Cellwise-robust grouped GMM for smoothed covariance estimation
-
Institute of Statistics and Mathematical Methods in Economics, TU Wien, Vienna, Austria [patricia.puchhammer@tuwien.ac.at]
-
Department of Quantitative Economics, Maastricht University, Maastricht, The Netherlands [i.wilms@maastrichtuniversity.nl]
-
Institute of Statistics and Mathematical Methods in Economics, TU Wien, Vienna, Austria [peter.filzmoser@tuwien.ac.at]
Keywords: Multi-Group/Multi-Source Data – Gaussian Mixture Models – Cellwise Outliers
Many robust covariance estimation techniques assume one coherent data set while accounting for rowwise outliers or outlying cells. Yet, many applications involve multiple related data sets from different sources or non-coherent data sets whose observations can naturally be divided into multiple more coherent groups with shared global patterns. One example are spatial data where groups can be based on, e.g., spatial proximity. Even though existing methods for covariance estimation can be applied either globally or for each group individually, the former approach could result in misleading conclusions on the global level – since the data points are not coherent – and no conclusions for individual groups can be drawn. The latter approach can provide results representative for each group, but ignores any additional information from other groups or sources that could be leveraged to take shared patterns into account. Thus, grouped data should be analyzed jointly, and estimated covariance matrices should reflect shared patterns.
To this end we choose to model the data based on Gaussian Mixture Models (GMM). Let be data sets from groups consisting of independent observations, for , of the same variables. Assume that each observation from group originates from a Gaussian mixture
for , and where is defined as the multivariate normal density for . Based on the assumption that each individual group is coherent, assume , thus each group has a main distribution but can share patterns by allowing the transition to other distributions. The resulting covariance matrix of a group is the one from distribution smoothed over the covariance matrices from the distributions with smoothing weights ,
where is the expected value of an observation from group .
Similar to the approach of Raymaekers and Rousseeuw [2023] and Zaccaria et al. [2024], cellwise robustness is included by using a missing value approach in the likelihood to set outlying cells to missings, indicated by the matrices consisting of binary vectors . The objective function to minimize is
subject to
where are constants, denotes the subset of observed variables, and are regularizing factors and matrices for high-dimensional settings, and is the minimal number of cells observed per variable and group .
Theoretical results include the breakdown point for an adaptation of the ideal setting described in Hennig [2004], Cuesta-Albertos et al. [2008] to the paradigm of cellwise outliers for multi-group data.
The proposed methodology implemented via an EM-type algorithm provides good simulation results compared to the cellMCD [Raymaekers and Rousseeuw, 2023] and other methods in settings of low as well as high dimensional grouped data, and its potential is illustrated on Austrian weather data.
Acknowledgements: Co-funded by the European Union (SEMACRET, Grant Agreement no. 101057741) and UKRI (UK Research and Innovation). Ines Wilms is supported by a grant from the Dutch Research Council (NWO), research program Vidi under the grant number VI.Vidi.211.032.
References
- Cuesta-Albertos et al. [2008] Juan Cuesta-Albertos, Carlos Matrán, and Agustín Mayo-Íscar. Robust estimation in the normal mixture model based on robust clustering. Journal of the Royal Statistical Society Series B: Statistical Methodology, 70(4):779–802, 2008.
- Hennig [2004] Christian Hennig. Breakdown points for maximum likelihood estimators of location–scale mixtures. Ann. Statist., 32(1):1313–1340, 2004.
- Raymaekers and Rousseeuw [2023] Jakob Raymaekers and Peter J Rousseeuw. The cellwise minimum covariance determinant estimator. Journal of the American Statistical Association, pages 1–12, 2023.
- Zaccaria et al. [2024] Giorgia Zaccaria, Luis A García-Escudero, Francesca Greselin, and Agustín Mayo-Íscar. Cellwise outlier detection in heterogeneous populations. arXiv preprint arXiv:2409.07881, 2024.