Cellwise-robust grouped GMM for smoothed covariance estimation

P. Puchhammer1 I. Wilms2 and P. Filzmoser3
  • 1

    Institute of Statistics and Mathematical Methods in Economics, TU Wien, Vienna, Austria [patricia.puchhammer@tuwien.ac.at]

  • 2

    Department of Quantitative Economics, Maastricht University, Maastricht, The Netherlands [i.wilms@maastrichtuniversity.nl]

  • 3

    Institute of Statistics and Mathematical Methods in Economics, TU Wien, Vienna, Austria [peter.filzmoser@tuwien.ac.at]

Keywords: Multi-Group/Multi-Source Data – Gaussian Mixture Models – Cellwise Outliers

Many robust covariance estimation techniques assume one coherent data set while accounting for rowwise outliers or outlying cells. Yet, many applications involve multiple related data sets from different sources or non-coherent data sets whose observations can naturally be divided into multiple more coherent groups with shared global patterns. One example are spatial data where groups can be based on, e.g., spatial proximity. Even though existing methods for covariance estimation can be applied either globally or for each group individually, the former approach could result in misleading conclusions on the global level – since the data points are not coherent – and no conclusions for individual groups can be drawn. The latter approach can provide results representative for each group, but ignores any additional information from other groups or sources that could be leveraged to take shared patterns into account. Thus, grouped data should be analyzed jointly, and estimated covariance matrices should reflect shared patterns.

To this end we choose to model the data based on Gaussian Mixture Models (GMM). Let 𝑿1,𝑿2,,𝑿N be data sets from N groups consisting of 𝑿g=((𝒙g,1),,(𝒙g,ng))ng×p independent observations, for g=1,,N, of the same p variables. Assume that each observation 𝒙g,i from group g originates from a Gaussian mixture

𝒙g,i𝒩(𝝁k,𝚺k) with probability πg,k0

for k=1,,N, and where φ(𝒙g,i;𝝁k,𝚺k) is defined as the multivariate normal density for 𝒙g,i. Based on the assumption that each individual group is coherent, assume πg,g0.5, thus each group has a main distribution but can share patterns by allowing the transition to other distributions. The resulting covariance matrix of a group g is the one from distribution g smoothed over the covariance matrices from the distributions with smoothing weights πg,k,

Cov[𝒙g]=k=1Nπg,k𝚺k+k=1Nπg,k(𝝁k-𝔼[𝒙g])(𝝁k-𝔼[𝒙g]),

where 𝔼[𝒙g]=k=1Nπg,k𝝁k is the expected value of an observation 𝒙g from group g.

Similar to the approach of Raymaekers and Rousseeuw [2023] and Zaccaria et al. [2024], cellwise robustness is included by using a missing value approach in the likelihood to set outlying cells to missings, indicated by the matrices 𝑾=(𝑾g)g=1N consisting of binary vectors 𝒘g,i,i=1,,ng. The objective function to minimize is

-2g=1Ni=1ng[ln(k=1Nπg,kφ(𝒙g,i(𝒘g,i);𝝁k(𝒘g,i),𝚺reg,k(𝒘g,i)))+j=1pqg,ij(1-wg,ij)]

subject to

𝚺reg,k=(1-ρk)𝚺k+ρk𝑻k
||𝑾g,.j||0hg j=1,,p,g=1,,N
k=1Nπg,k=1 g=1,,N
πg,gα0.5,

where qg,ij are constants, .(𝒘) denotes the subset of observed variables, ρk and 𝑻k are regularizing factors and matrices for high-dimensional settings, and hg is the minimal number of cells observed per variable and group g.

Theoretical results include the breakdown point for an adaptation of the ideal setting described in Hennig [2004], Cuesta-Albertos et al. [2008] to the paradigm of cellwise outliers for multi-group data. The proposed methodology implemented via an EM-type algorithm provides good simulation results compared to the cellMCD [Raymaekers and Rousseeuw, 2023] and other methods in settings of low as well as high dimensional grouped data, and its potential is illustrated on Austrian weather data.

Acknowledgements: Co-funded by the European Union (SEMACRET, Grant Agreement no. 101057741) and UKRI (UK Research and Innovation). Ines Wilms is supported by a grant from the Dutch Research Council (NWO), research program Vidi under the grant number VI.Vidi.211.032.

References

  • Cuesta-Albertos et al. [2008] Juan Cuesta-Albertos, Carlos Matrán, and Agustín Mayo-Íscar. Robust estimation in the normal mixture model based on robust clustering. Journal of the Royal Statistical Society Series B: Statistical Methodology, 70(4):779–802, 2008.
  • Hennig [2004] Christian Hennig. Breakdown points for maximum likelihood estimators of location–scale mixtures. Ann. Statist., 32(1):1313–1340, 2004.
  • Raymaekers and Rousseeuw [2023] Jakob Raymaekers and Peter J Rousseeuw. The cellwise minimum covariance determinant estimator. Journal of the American Statistical Association, pages 1–12, 2023.
  • Zaccaria et al. [2024] Giorgia Zaccaria, Luis A García-Escudero, Francesca Greselin, and Agustín Mayo-Íscar. Cellwise outlier detection in heterogeneous populations. arXiv preprint arXiv:2409.07881, 2024.