cellGMM: a Cellwise Robust Model-Based Clustering Approach

A. Mayo-Iscar1 L.A. García-Escudero1 F. Greselin2 and G. Zaccaria2
  • 1

    Department of Statistics and O.R. and IMUVA, University of Valladolid, Valladolid, Spain [lagarcia@uva.es,agustin.mayo.iscar@uva.es]

  • 2

    Department of Statistics and Quantitative Methods, University of Milano Bicocca, Milan, Italy [francesca.greselin@unimib.it,giorgia.zaccaria@unimib.it]

Keywords: Model based clustering – Trimming – Robustness – Cellwise outliers

1 Introduction

TCLUST (García-Escudero et al. [2008]) is a robust model based clustering methodology able to resists outliers’ influence when estimating clusters originated from several multivariate normal sources. Its robustness is achieved through the joint application of trimming and constraints. TCLUST can be considered as a clustering generalization of Minimum Covariance Determinant (MCD) methodology (Rousseeuw [1985]), which was designed for robust estimation of multivariate location and scatter. Both, MCD and TCLUST, are resistant to casewise contamination, which is characterized by a fixed proportion of rows in the data matrix exhibiting arbitrary behavior. However, cellwise contamination (Alqallaf et al. [2009]) presents a greater challenge, as a very much reduced proportion of outlier cells can affect a large proportion of rows, leading to the breakdown of casewise robust proposals. Raymaekers and Rousseeuw [2024] introduced cellMCD, an MCD extension capable of providing robust location and scatter estimation under cellwise contamination. Consequently, there is a need for robust model-based clustering methods that are effective under cellwise contamination.

2 Our proposal

The successful performance of cellMCD in addressing cellwise contamination inspired the arrival of cellGMM (Zaccaria et al. [2024]), as a cellwise version of TCLUST methodology. CellGMM is a modification of the finite mixture model’s maximum likelihood estimator, incorporating cellwise trimming, to mitigate the influence of this kind of outliers, and TCLUST constraints, to ensure a well posed statistical procedure. This presentation will detail the cellGMM methodology and provide empirical evidences of its performance in identifying clusters, under cellwise contamination, when applied to artificial and real-world data.

References

  • Alqallaf et al. [2009] F. Alqallaf, S. Van Aelst, V. J. Yohai, and R. H. Zamar. Propagation of outliers in multivariate data. The Annals of Statistics, 37(1):311–331, 2009.
  • García-Escudero et al. [2008] L. A. García-Escudero, A. Gordaliza, C. Matrán, and A. Mayo-Iscar. A general trimming approach to robust cluster analysis. The Annals of Statistics, 36(3):1324–1345, 2008.
  • Raymaekers and Rousseeuw [2024] J. Raymaekers and P. J. Rousseeuw. The cellwise minimum covariance determinant estimator. Journal of the American Statistical Association, 119(548):2610–2621, 2024.
  • Rousseeuw [1985] P. J. Rousseeuw. Multivariate estimation with high breakdown point. In W. Grossmann et al., editor, Mathematical Statistics and Applications, volume Vol. B, pages 283–297, Akadémiai Kiadó: Budapest, 1985.
  • Zaccaria et al. [2024] G. Zaccaria, L. A. García-Escudero, F. Greselin, and A. Mayo-Íscar. Cellwise outlier detection in heterogeneous populations. arXiv preprint arXiv:2409.07881, 2024.