Fuzzy clustering with cellwise outlier detection
-
Department of Statistics and Quantitative Methods, University of Milano-Bicocca, Milano, Italy [giorgia.zaccaria@unimib.it]
-
Department of Statistics and Operational Research, University of Valladolid, Valladolid, Spain [lagarcia@uva.es]
-
Department of Statistics and Quantitative Methods, University of Milano-Bicocca, Milano, Italy [francesca.greselin@unimib.it]
-
Department of Statistics and Operational Research, University of Valladolid, Valladolid, Spain [agustin.mayo.iscar@uva.es]
Keywords: Clustering – Fuzzy approach – Constrained optimization – EM algorithm
Real data often contain outliers, which are values that deviate from the pattern of the majority of the data [Hampel et al., 1986]. Outliers usually refer to cases or rows of a data matrix (casewise or rowwise outliers). Alqallaf et al. [2009] introduced a novel paradigm that accounts for contamination of cells of a data matrix. In the single-population framework, several robust estimators have been proposed for the center and covariance matrix under both the casewise and cellwise paradigms. For the former, Rousseeuw [1984, 1985] introduced the Minimum Covariance Determinant (MCD) estimator, which estimates parameters using a subset of cases – comprising at least half of the total sample size – such that the determinant of the covariance matrix for these observations is minimized. More recently, Raymaekers and Rousseeuw [2023] extended this approach to the cellwise paradigm with the cellMCD estimator, computed using an Expectation-Maximization (EM) algorithm [Dempster et al., 1977]. This alternates between a C-step, where a certain fraction of cells per variable is flagged as outliers, and E- and M-steps, which treat flagged cells as missing values, adhering to the principles of the EM algorithm designed for incomplete data scenarios.
Outliers are particularly harmful in heterogeneous populations. In the fuzzy clustering literature, several methods have been developed to address casewise contamination. Fritz et al. [2013] proposed a robust fuzzy clustering approach called F-TCLUST, which is based on impartial trimming. This model is estimated via an EM algorithm with a constrained optimization. However, the performance of F-TCLUST may deteriorate in the presence of cellwise contamination, especially when the number of variables is large, causing many (or even all) rows to be affected by a few outlying values.
In this work, we introduce a robust fuzzy clustering approach for cellwise outlier detection. Unlike existing robust fuzzy clustering methods, our proposal relaxes the spherical assumption for the clusters while retaining the eigenvalue ratio constraint of F-TCLUST and accounting for contaminated cells in the data. The model is estimated using an EM algorithm, which includes an additional outlier detection step, followed by E- and M-steps that handle contaminated cells as missing information. The performance of the proposed methodology is evaluated through simulation studies and real data applications, demonstrating its effectiveness and advantages in scenarios with substantial cellwise contamination.
References
- Propagation of outliers in multivariate data. The Annals of Statistics 37 (1), pp. 311–331. Cited by: Fuzzy clustering with cellwise outlier detection.
- Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 39 (1), pp. 1–38. Cited by: Fuzzy clustering with cellwise outlier detection.
- Robust constrained fuzzy clustering. Information Sciences 245, pp. 38–52. Cited by: Fuzzy clustering with cellwise outlier detection.
- Robust statistics: the approach based on influence functions. John Wiley & Sons, New York. Cited by: Fuzzy clustering with cellwise outlier detection.
- The cellwise minimum covariance determinant estimator. Journal of the American Statistical Association 119 (548), pp. 2610–2621. Cited by: Fuzzy clustering with cellwise outlier detection.
- Least median of squares regression. Journal of the American Statistical Association 79 (388), pp. 871–880. Cited by: Fuzzy clustering with cellwise outlier detection.
- Multivariate estimation with high breakdown point. In Mathematical Statistics and Applications, W. G. et al. (Ed.), Vol. Vol. B, Akadémiai Kiadó: Budapest, pp. 283–297. Cited by: Fuzzy clustering with cellwise outlier detection.