Cellwise robust Gaussian mixture model for multi-group data with label noise

P. Puchhammera, I. Wilmsb and P. Filzmosera

aTechnische Universität Wien, bMaastricht University

Do expert-defined or diagnostically-labeled data groups align with clusters inferred through statistical modeling? If not, where do discrepancies between predefined labels and model-based groupings occur and why? We introduce the multi-group Gaussian mixture model (MG-GMM), the first model developed to investigate these questions. It incorporates prior group information while allowing flexibility to reassign observations to alternative groups based on data-driven evidence.

To this end we model the data based on Gaussian Mixture Models. Let 𝑿1,𝑿2,,𝑿N be given data sets from N groups consisting of 𝑿g=((𝒙g,1),,(𝒙g,ng))ng×p independent observations, for g=1,,N, of the same p variables. Assume that each observation 𝒙g,i from group g originates from a Gaussian mixture

𝒙g,i𝒩(𝝁k,𝚺k) with probability πg,k0

for k=1,,N, and where φ(𝒙g,i;𝝁k,𝚺k) is defined as the multivariate normal density for 𝒙g,i. Based on the assumption that each individual group is coherent, assume πg,g0.5. Thus each group g has a main distribution 𝒩(𝝁g,𝚺g). However, data-driven reassignment of observations outlying in the original group is allowed by the flexibility of the mixture model.

Moreover, our model offers robustness against cellwise outliers that may obscure or distort the underlying group structure based on a penalized likelihood approach. The proposed methodology implemented via an EM-type algorithm provides good simulation results and its potential is illustrated on wine data.

Acknowledgements: Co-funded by the European Union (SEMACRET, Grant Agreement no. 101057741) and UKRI (UK Research and Innovation). Ines Wilms is supported by a grant from the Dutch Research Council (NWO), research program Vidi under the grant number VI.Vidi.211.032.

Keywords: Gaussian mixture models, cellwise outliers, labeled data