A unified approach to outlier identification for mixed type data
-
Department of Statistical Sciences “Paolo Fortunati”, University of Bologna, Italy [christian.hennig@unibo.it]
-
Department of Mathematics, Imperial College London [efthymios.costa17@imperial.ac.uk]
Keywords: MCD – latent variables – regularisation
We present an approach for identifying outliers in data with continuous as well as categorical variables. Ordinal variables can be integrated in a straightforward manner but are not treated here.
Providing a unified definition of outliers for mixed type data is hard, because categorical variables imply a different geometry from continuous variables, and Euclidean intuition may not apply to categorical variables.
Our approach is based on robust Mahalanobis distances and the outlier concept introduced in Becker and Gather [1999], where outliers are defined as observations that are in low probability regions relative to a multivariate Gaussian distribution. Our robust Mahalanobis distances are based on the fast MCD (Rousseeuw and Van Driessen [1999]).
In order to unify the contribution of continuous and categorical variables to the robust Mahalanobis distance and the definition of outliers, categorical variables are modelled as collections of dummy variables obtained by thresholding from a latent multivariate Gaussian distribution. Different from other latent variable approaches, we do not assume that the categorical variables are generated from a latent Gaussian space with strongly reduced dimensionality. Instead we assume that the latent space has a Gaussian dimension for every dummy variable.
We use polychoric and polyserial correlation (Drasgow [1986]) to estimate the covariance matrix of the underlying multivariate Gaussian distribution.
There are however issues with singularity. dummy variables encoding a categorical variable with categories only have degrees of freedom. Furthermore, the robust covariance matrix should be computed based on an outlier free subset of the data. This will often imply that certain categories of a categorical variable are not represented at all in the subset. It also implies that breakdown points of up to 1/2 are not desirable, because just leaving out a category from the “outlier free” subset produces a dummy variable that takes only a single value and causes a zero determinant of the covariance matrix. These problems will be treated by regularisation of the covariance matrix.
Furthermore, the multivariate Gaussian distribution cannot fully capture the dependence structure between dummy variables, because a variable that takes one out of more than two categories implies a higher than bivariate dependence between its dummy variables. We will propose a generative model based on a multivariate latent Gaussian distribution (from which outliers are defined) that generates admissible collections of dummy variables, i.e., encoding one and only one category of a categorical variable.
References
- Becker and Gather [1999] C. Becker and U. Gather. The masking breakdown point of multivariate outlier identification rules. Journal of the American Statistical Association, 94:947–955, 1999.
- Drasgow [1986] F. Drasgow. Polychoric and polyserial correlations. In S. Kotz and N. Johnson, editors, The Encyclopedia of Statistics, volume 7, pages 68–74. Wiley, New York, 1986.
- Rousseeuw and Van Driessen [1999] P. Rousseeuw and K. Van Driessen. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41:212–223, 1999.