Cell-TRICC: a Model-Based Approach to Cellwise-Trimmed Co-Clustering

E. Fibbi1 D. Perrotta2 F. Torti2 S. Van Aelst1 and T. Verdonck1,3
  • 1

    Department of Mathematics, KU Leuven, Leuven, Belgium [edoardo.fibbi@kuleuven.be, stefan.vanaelst@kuleuven.be]

  • 2

    European Commission, Joint Research Centre, Ispra, Italy [domenico.perrotta@ec.europa.eu, francesca.torti@ec.europa.eu]

  • 3

    University of Antwerp – imec, Antwerp, Belgium [tim.verdonck@uantwerpen.be]

Keywords: Co-clustering – Trimming – Robustness – Cellwise outliers – Count data

1 Introduction

Co-clustering consists in the simultaneous partitioning of the rows and columns of a data matrix, and is an unsupervised technique well suited to explore and extract block patterns from high-dimensional data.

It is often the case that real data contain outliers: such anomalous values could impair standard co-clustering techniques while being interesting pieces of information. In fact, most existing methods – relying either on maximum likelihood estimation or on the optimization of other cost functions – can be shown to be highly sensitive to outliers. On the other hand, several applications call for simultaneous data summarization and anomaly detection. Despite this fact and the wide applicability of co-clustering, very little literature is concerned with outlier-robust co-clustering approaches.

2 Our proposal

In this work we build on the well-known framework of Latent Block Models (LBMs, Govaert and Nadif [2003]), extending the application of impartial trimming in LBMs [Cuesta-Albertos et al., 1997, Fibbi et al., 2023] to the case of cellwise contamination [Alqallaf et al., 2009, Raymaekers and Rousseeuw, 2024]. Cell-TRICC, a novel cellwise-trimmed co-clustering method is thus proposed. We focus on count data modeled by Poisson distributions, but the main ideas are general. Robustness is sought through cellwise trimming, implemented as an additional step in a Stochastic EM algorithm [Keribin et al., 2010]. Model selection is performed through novel trimmed integrated complete likelihood (ICL) criteria to select the number of groups, while a heuristic strategy is used to tune trimming.

3 Application

The effectiveness of our approach is tested through simulations and a fully-fledged application to a trade monitoring problem is presented. The real-data analysis shows the potential of the proposed methodology both as a data exploration and anomaly detection tool, and is further enriched by insights drawn from original visualizations and diagnostic plots.

References

  • Alqallaf et al. [2009] F. Alqallaf, S. Van Aelst, V. J. Yohai, and R. H. Zamar. Propagation of outliers in multivariate data. The Annals of Statistics, 37(1):311 – 331, 2009. doi: 10.1214/07-AOS588.
  • Cuesta-Albertos et al. [1997] J. A. Cuesta-Albertos, A. Gordaliza, and C. Matrán. Trimmed k-means: an attempt to robustify quantizers. The Annals of Statistics, 25(2):553–576, 1997.
  • Fibbi et al. [2023] E. Fibbi, D. Perrotta, F. Torti, S. Van Aelst, and T. Verdonck. Co-clustering contaminated data: a robust model-based approach. Adv Data Anal Classif, 2023. doi: https://doi.org/10.1007/s11634-023-00549-3.
  • Govaert and Nadif [2003] G. Govaert and M. Nadif. Clustering with block mixture models. Pattern Recognition, 36(2):463–473, 2003. Biometrics.
  • Keribin et al. [2010] C. Keribin, G. Govaert, and G. Celeux. Estimation d’un modèle à blocs latents par l’algorithme SEM. In 42èmes Journées de Statistique, 2010.
  • Raymaekers and Rousseeuw [2024] J. Raymaekers and P. J. Rousseeuw. Challenges of cellwise outliers. Econometrics and Statistics, 2024. ISSN 2452-3062. doi: https://doi.org/10.1016/j.ecosta.2024.02.002.