Co-Clustering with Outlier Detection for Functional Data
-
Joint Research Centre, European Commission, Ispra, Italy
-
KU Leuven, Department of Mathematics, Celestijnenlaan 200B, 3001 Leuven, Belgium
Keywords: Co-clustering – Functional data – Multivariate – Trimming
Introduction
Co-clustering is a powerful technique for simultaneously grouping the rows and columns of a data matrix to uncover hidden patterns and structures. Unlike traditional clustering methods that treat rows and columns independently, co-clustering capitalizes on the interactions between these dimensions, offering significant advantages in applications such as microarray data analysis, text mining, and recommendation systems.
We explore the Latent Block Model (LBM), a versatile approach to co-clustering that models data through an underlying unobserved block structure, effectively managing various data types including continuous, binary, count and functional data (Slimen et al. [2018], Bouveyron et al. [2022]). The inherent adaptability of LBM makes it a valuable tool for data analysis and exploration. However, real-world data often deviate from theoretical assumptions, containing anomalies or outliers that can bias model estimation (García-Escudero et al. [2008]). These anomalies, while problematic, may also carry crucial information (Rousseeuw et al. [2019], Torti et al. [2021], Portela and Olsen [2023]).
Contribution
To address the challenges posed by outliers, we propose a cell-wise trimmed version of the Latent Block Model for multivariate normal data (Fibbi et al. [2024]). Our approach incorporates a trimming step in each iteration to discard anomalous data points, enhancing the robustness of parameter estimation. This process involves the introduction of a mask to identify outlier cells, which are determined using the Mahalanobis distance relative to the estimated distribution of their respective clusters.
The proposed method employs a modified Stochastic EM (SEM) algorithm, specifically the SEM-Gibbs algorithm, which is resilient to local minima and effectively handles missing data. In the M-step, the parameter estimation is done robustly.
Further, we tackle model selection challenges inherent in co-clustering, proposing a heuristic approach that combines a trimmed version of the ICL criterion for determining the number of clusters and the G-statistic for optimizing the trimming level (Lomet [2012]). This dual-step process ensures robust cluster retrieval and outlier management.
Study case
Our simulation studies, based on synthetic data, confirm the method’s accuracy in recovering partitions and parameters, as well as its efficacy in outlier detection. We generated the synthetic data by adapting the MixSim framework [Melnykov et al., 2012, 2024] to the co-clustering needs, with some adaptations of functions MixSim.m and simdataset.m in the FSDA toolbox for MATLAB (Riani et al. [2012, 2015]). A statistically principled extension of MixSim to co-clustering is however an open research problem.
The final contribution of this work is the application of the new framework to a set of functional data related to the energy market (Bernardi and Sangalli [2022]). The data are processed using functional principal component analysis to reduce them to three parameters that capture the main behaviours of the curve. Our aim is to determine if this analysis can reveal anomalous variations in energy market pricing. The original datasets are publicly available in the Italian energy service provider (GSE, “Gestore Servizi Energetici”) website.
References
- Bernardi and Sangalli [2022] M. S. Bernardi and L. M. Sangalli. Modeling Spatially Dependent Functional Data by Spatial Regression with Differential Regularization, chapter 11, pages 260–285. John Wiley & Sons, Ltd, 2022.
- Bouveyron et al. [2022] C. Bouveyron, J. Jacques, A. Schmutz, F. Simões, and S. Bottini. Co-clustering of multivariate functional data for the analysis of air pollution in the South of France. The Annals of Applied Statistics, 16(3):1400–1422, 2022.
- Fibbi et al. [2024] E. Fibbi, D. Perrotta, F. Torti, S. Van Aelst, and T. Verdonck. Co-clustering contaminated data: a robust model-based approach. Advances in Data Analysis and Classification, 18(1):121–161, 2024.
- García-Escudero et al. [2008] L. A. García-Escudero, A. Gordaliza, C. Matrán, and A. Mayo-Iscar. A general trimming approach to robust cluster Analysis. The Annals of Statistics, 36(3):1324 – 1345, 2008.
- Lomet [2012] A. Lomet. Sélection de modèle pour la classification croisée de données continues. These de doctorat, Compiègne, 2012.
- Melnykov et al. [2024] M. Melnykov, Y. Wang, Y. Melnykov, F. Torti, D. Perrotta, and M. Riani. On simulating skewed and cluster-weighted data for studying performance of clustering algorithms. Journal of Computational and Graphical Statistics, 33(1):303–309, 2024.
- Melnykov et al. [2012] V. Melnykov, W.C. Chen, and R. Maitra. Mixsim: An r package for simulating data to study performance of clustering algorithms. Journal of Statistical Software, 51(12):1–25, 2012.
- Portela and Olsen [2023] C. Portela and K. Olsen. Study on implementation and monitoring of the EU sanctions regimes, including recommendations to reinforce the EU’s capacities to implement and monitor sanctions, October 2023.
- Riani et al. [2012] M. Riani, D. Perrotta, and F. Torti. Fsda: A matlab toolbox for robust analysis and interactive data exploration. Chemometrics and Intelligent Laboratory Systems, 116:17–32, 2012.
- Riani et al. [2015] M. Riani, A. Cerioli, D. Perrotta, and F. Torti. Simulating mixtures of multivariate data with fixed cluster overlap in FSDA library. Advanced Data Analysis and Classification, 9:461–481, 2015.
- Rousseeuw et al. [2019] P. J. Rousseeuw, D. Perrotta, M. Riani, and M. Hubert. Robust monitoring of time series with application to fraud detection. Econometrics and Statistics, 9:108–121, 2019.
- Slimen et al. [2018] Y. B. Slimen, S. Allio, and J. Jacques. Model-based co-clustering for functional data. Neurocomputing, 291:97–108, 2018.
- Torti et al. [2021] F. Torti, M. Riani, and G. Morelli. Semiautomatic robust regression clustering of international trade data. Statistical Methods & Applications, 30(3):863–894, 2021.