Outlier detection under the multivariate Student-t distribution

L. Barabesi1 A. Cerioli2 L.A. Garcia-Escudero3
A. Mayo-Iscar4
D. Perrotta5 and F. Torti6
  • 1

    Department of Economics and Statistics, University of Siena, Siena, Italy [lucio.barabesi@unisi.it]

  • 2

    Department of Economics, University of Parma, Parma, Italy [andrea.cerioli@unipr.it]

  • 3

    Department of Statistics, University of Valladolid, Valladolid, Spain [lagarcia@uva.es]

  • 4

    Department of Statistics, University of Valladolid, Valladolid, Spain [agustin.mayo.iscar@uva.es]

  • 5

    European Commission, Joint Research Centre, Ispra, Italy [domenico.perrotta@ec.europa.eu]

  • 6

    European Commission, Joint Research Centre, Ispra, Italy [francesca.torti@ec.europa.eu]

Keywords: Contamination rate – Generalized radius process – Multivariate Student-t distribution – Outlier detection – Robust distance

Abstract

It is well known that trimmed estimators of multivariate scatter, such as the Minimum Covariance Determinant (MCD) estimator, are not consistent unless an appropriate factor is applied to them ([Hubert et al., 2008]). This factor is widely recommended and applied when uncontaminated data are assumed to come from a multivariate normal model, while difficulties arise under more complex models (see, e.g., Schreurs et al. [2008]). Barabesi et al. [2023] address the problem in a heavy-tail scenario, when uncontaminated data come from a multivariate Student-t distribution. Specifically, they derive a simple computational formula for the consistency factor of the MCD estimator and show that it reduces to an even simpler analytic expression in the bivariate case. They also develop a robust Monte Carlo procedure for estimating the usually unknown number of degrees of freedom of the assumed (and possibly contaminated) multivariate Student-t model, which is a necessary ingredient for obtaining the required factor.

In this work we advance the results of Barabesi et al. [2023] by leveraging on the generalized radius process of Garcia-Escudero and Gordaliza [2005]. Our first purpose is the development of a semi-automatic algorithm for jointly inferring both the number of degrees of freedom of the multivariate Student-t model and the unknown contamination rate. Based on the computed estimates and on the asymptotic null distribution of the robust MCD-distances under the Student-t model, we then obtain a simultaneous rule for identifying multivariate outliers through Monte Carlo estimation and nonparametric interpolation of the tail quantiles of the corresponding generalized radius process.

References

  • Barabesi et al. [2023] L. Barabesi, A. Cerioli, L.A. Garcia-Escudero, and A. Mayo-Iscar. Consistency factor for the mcd estimator at the student-t distribution. Statistics and Computing, 33(132), 2023.
  • Garcia-Escudero and Gordaliza [2005] L.A. Garcia-Escudero and A. Gordaliza. Generalized radius processes for elliptically contoured distributions. Journal of the American Statistical Association, 100:1036–1045, 2005.
  • Hubert et al. [2008] M. Hubert, P.J. Rousseeuw, and S. Van Aelst. High-breakdown robust multivariate methods. Statistical Science, 23:92–119, 2008.
  • Hubert et al. [2018] M. Hubert, M. Debruyne, and P. Rousseeuw. Minimum covariance determinant and extensions. WIREs Computational Statistics, 10(e1421), 2018.
  • Schreurs et al. [2008] J. Schreurs, I. Vranckx, M. Hubert, J. Suykens, and P. Rousseeuw. Outlier detection in non elliptical data by kernel mrcd. Statistics and Computing, 31(66), 2008.