References

Sparse Feature Group K-Means

A. W. Diallo ${}^{a}$ , M. Ouattara ${}^{b}$ , N. Niang ${}^{c}$ and M. Bousso ${}^{d}$

${}^{a},^{d}$ Université Iba Der Thiam, Thies, Senegal

${}^{b}$ Université Polytechnique de San Pedro, San Pedro, Côte d’Ivoire

${}^{c}$ CEDRIC – CNAM, Paris, France

We address the problem of clustering high-dimensional multiblock data, where features are organized into homogeneous blocks representing different views of the data. Traditional subspace clustering methods like Entropy Weighting K-Means (EWKM) [1] and Feature Group K-Means (FGKM) [2] assign continuous weights to features or feature blocks, requiring tedious post-hoc analysis to identify cluster-relevant elements based on weight magnitudes. Sparse clustering methods such as Sparse K-Means [3] and Sparse Subspace K-Means [4] offer automatic feature selection but do not exploit the block structure inherent in multiblock data.

To address this limitation, we propose SFGKM (Sparse Feature Group K-Means), which extends Sparse Subspace K-Means (SSKM) to incorporate multiblock data structure. SFGKM employs a unified optimization criterion with dual-level lasso-type penalties that simultaneously performs observation clustering, cluster-specific individual feature selection, and cluster-specific feature block selection. The method automatically sets irrelevant feature and block weights to zero while assigning large weights to cluster-characterizing elements, eliminating manual weight analysis.

The optimization follows an alternating maximization scheme that iteratively updates cluster assignments, feature weights within blocks, and block weights using soft-thresholding operators until convergence. Experimental validation is conducted on synthetic datasets with varying complexity and noise patterns, as well as on real-world multiblock datasets covering diverse domains and block structures. Results show competitive performance on synthetic data, where SFGKM correctly identifies cluster-specific relevant blocks and automatically zeros out noise features. On real-world data, SFGKM achieves superior clustering performance compared to existing methods, demonstrating that explicitly modeling block structure is essential when views must be treated as coherent units rather than collections of independent features.

Keywords: Sparse Clustering, Soft Subspace Clustering, Feature Selection, Multi-block data.

References

[1] L. Jing, M. Ng, J. Huang (2007). An entropy weighting k-means algorithm for subspace clustering of high-dimensional sparse data. IEEE Transactions on Knowledge and Data Engineering, 1026–1041.
[2] X. Chen, Y. Ye, X. Xu, J. Z. Huang (2012). A feature group weighting method for subspace clustering of high-dimensional data. Pattern Recognition, 434–446.
[3] D. M. Witten, R. Tibshirani (2010). A framework for feature selection in clustering. Journal of the American Statistical Association, 713–726.
[4] A. W. Diallo, N. Niang, M. Ouattara (2021). Sparse subspace k-means. In ICDMW 2021, 678–685.