Trimmed factorial k-means: a simulation study

M. Farnè

{}^{1}

${}^{1}$

Department of Statistical Sciences, University of Bologna, Bologna, Italy [matteo.farne@unibo.it]

Keywords: Robust clustering – Dimension reduction – Factorial k-means – Trimming

1 Introduction

Clustering multivariate data in high dimensions is a challenging task [Bouveyron and Brunet-Saumard, 2014]. First, agglomerative hierarchical methods like Ward’s method [Ward, 1963] may soon become computationally intractable as the dimension $p$ increases. Second, hierarchical partitioning methods like k-means algorithm [MacQueen, 1967] may be very unstable in high dimensions, due to numerical instability and multicollinearity. Third, dimension reduction is imperative to avoid the curse of dimensionality and to obtain interpretable partitions [Ding, 2009]. Fourth, when the sample size $n$ is also large, robust methods are often needed [García-Escudero et al., 2010], due to the much likely presence of outliers able to distort the obtained partition. For these reasons, the need rises to develop clustering procedures able to combine robustness and dimension reduction.

The very first dimension reduction tool is principal component analysis (PCA) [Hotelling, 1933], which is a geometric procedure to represent a high-dimensional dataset by means of a much lower-dimensional one. At some point, when the data dimension started to grow, it became evident [Hubert and Arabie, 1994] that existing clustering methods could be applied to the first extracted principal components. However, it also became apparent that the directions identified by PCA may not be the optimal ones for clustering purposes, as they neglect the cluster structure present in the sample. This fact was pointed out in [De Soete and Carroll, 1994] as a motivation to propose reduced k-means (RKM), a method to cluster objects in a low-dimensional space. The strong consistency of RKM was proved under independent and identically distributed data in [Terada, 2014].

RKM, in turn, may still mask the most relevant directions for clustering purposes when the directions explaining most of the variance are not close to the directions separating the objects the most. For this reason, [Vichi and Kiers, 2001] proposed factorial k-means (FKM), a method to identify the latent space most able to maximize the distinctiveness of projected objects. The strong consistency of FKM was proved in [Terada, 2015]. Proof is based on the original proof of the strong consistency of k-means clustering in [Pollard, 1981].

Given these premises, in this paper we present a robust version of factorial k-means, where outliers are iteratively removed via a trimming procedure in the low-dimensional space, thus simultaneously identifying radial outliers and designing better shaped clusters. This is obtained by minimising the trimmed least squares criterion in the low-dimensional space. The resulting method is called trimmed factorial k-means (TFKM). TFKM algorithm is reported in [Farnè, 2023]. In this paper, we propose a new simulation study which shows the effectiveness of TFKM against competitors such as the trimmed k-means of [Cuesta-Albertos et al., 1997] (using the R package trimcluster by Christian Hennig), and a robust tandem procedure which first estimates the robust principal components by the R package robpca as proposed in [Hubert et al., 2005], and then applies the trimmed k-means of [Cuesta-Albertos et al., 1997]. Importantly, we also study a new criterion to select the latent rank, the number of clusters and the proportion of outliers, based on a sequence of F-tests based on the deviance within the clusters in the reduced space. We then present an illustrative example involving Internet cookies data.

References

C. Bouveyron and C. Brunet-Saumard (2014) Model-based clustering of high-dimensional data: a review. Computational Statistics and Data Analysis 71, pp. 52–78. Cited by: §1.
J. Cuesta-Albertos, A. Gordaliza, and C. Matrán (1997) Trimmed k-means: an attempt to robustify quantizers. Ann Stat 25, pp. 553–576. Cited by: §1.
G. De Soete and J. Carroll (1994) K-means clustering in a low-dimensional euclidean space. In New Approaches In Classification And Data Analysis, E. Diday, Y. Lechevallier, M. Schader, P. Bertrand, and B. Burtschy (Eds.), pp. 212–219. Cited by: §1.
C. Ding (2009) Dimension reduction techniques for clustering. In Encyclopedia of Database Systems, L. Liu and M.T. Özsu (Eds.), Cited by: §1.
M. Farnè (2023) Trimmed factorial k-means. In Book Of Abstracts And Short Papers, pp. 148–151. External Links: ISBN 978-88-9193-563-2 Cited by: §1.
L. A. García-Escudero, A. Gordaliza, C. Matrán, and A. Mayo-Iscar (2010) A review of robust clustering methods. Adv Data Anal Classif 4, pp. 89–109. Cited by: §1.
H. Hotelling (1933) Analysis of a complex of statistical variables into principal components. Journal of educational psychology 24 (6), pp. 417. Cited by: §1.
L. Hubert and P. Arabie (1994) The analysis of proximity matrices through sums of matrices having anti-robinson forms. Br J Math Stat Psychol 47, pp. 1–40. Cited by: §1.
M. Hubert, P. Rousseeuw, and K. Vanden Branden (2005) ROBPCA: a new approach to robust principal component analysis. Technometrics 47, pp. 64–79. Cited by: §1.
J. MacQueen (1967) Classification and analysis of multivariate observations. 5th Berkeley Symp. Math. Statist. Probability, pp. 281–297. Cited by: §1.
D. Pollard (1981) Strong consistency of k-means clustering. Ann Stat, pp. 135–140. Cited by: §1.
Y. Terada (2014) Strong consistency of reduced k-means clustering. Scand Stat Theory Appl 41, pp. 913–931. Cited by: §1.
Y. Terada (2015) Strong consistency of factorial k-means clustering. Ann Inst Stat Math 67, pp. 335–357. Cited by: §1.
M. Vichi and H. Kiers (2001) Factorial k-means analysis for two-way data. Comput Stat Data Anal 37, pp. 49–64. Cited by: §1.
J. Ward (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58, pp. 236–244. Cited by: §1.