Some insights into depth estimators for location and scatter in the multivariate setting

Jorge Adrover1 Marcelo Ruiz2
  • 1

    Facultad de Matemática, Astronomía, Física y Computación, Universidad Nacional de Córdoba, Universidad Nacional de Córdoba, CIEM and CONICET, Córdoba, Argentina
    [jorge.adrover@unc.edu.ar ]

  • 2

    Facultad de Ciencias Exactas, Físico-Químicas y Naturales, Universidad Nacional de Río Cuarto, Río Cuarto, Argentina[mruiz@exa.unrc.edu.ar]

Keywords: statistical depth – multivariate location and scatter – maximum bias – breakdown point

Abstract

Tukey [1975] introduced the concept of depth in the context of p- dimensional observations in order to come up with a multivariate analog of the median. More precisely, if 𝐱P and 𝜽p the depth of 𝜽 is defined to be

𝒟T(𝜽,P)inf𝝀=1,𝝀pP(𝝀t𝐱𝝀t𝜽)

and the Tukey median is taken to be 𝜽^T=argmax𝜽p𝒟T(𝜽,P). Chen et al. [2018] and Paindaveine and Van Bever [2018] also dealt with the concept of depth for the multivariate scatter. Given an initial robust multivariate location estimator 𝐯^p, Paindaveine and Van Bever [2018] defined the depth of a symmetric positive matrix Γp×p as

DCL(Γ,P)=inf𝐮𝒮p-1min{P(|𝐮t(X-𝐯^)|2𝐮tΓ𝐮),P(|𝐮t(X-𝐯^)|2𝐮tΓ𝐮)} (1)

and the deepest estimator as Γ^L=argmaxΓ𝟎DCL(Γ,P). Chen et al. [2018] and Paindaveine and Van Bever [2018] agree with the definition when the location is known as we do not need to include 𝐯^ in (1). If p=1, a joint location-scale depth can be derived from either,

DLS1(μ,σ,P)=min{infλP([|y-μ||y-λ|]),infγ>0P([||y-μσ|-1|||y-μγ|-1|])}

or

DLS2(μ,σ,P)=infγ,λP([|y-μ||y-λ|][||y-μσ|-1|||y-μγ|-1|]),

yielding the median as deepest location and the median absolute deviation around the median as the deepest scale estimator in case of the depth DLS1.

Chen et al. [2018] introduced a unified way to study the statistical convergence rate and robustness jointly. Given δ(0,1/2) let α=1-2δ and ϵ[0,1/2). Let 𝒫ε(F0) be the ε-contamination neighborhood with F0=N(𝜽,Σ). Take (M) as the set of symmetric and definite positive matrices Σ such that the largest eigenvalue λ1(Σ) is less than a constant M>0.

Chen et al. [2018] derived that, for ε[0,ε], ε<1/3 and (p+log(1/δ))/n sufficiently small, there exists a constant C>0 (depending on ε but independent of p, n, ε), such that

inf𝜽,Σ(M),P𝒫ε(F0)P(𝜽^T-𝜽2C(max{pn,ε2}+log(1/δ)n))α (2)

The constant C in (2) is actually affected by the asymptotic maximum bias of the Tukey median. Chen and Tyler [2002] dealt with the asymptotic maximum bias of the Tukey median, BT(ε,Σ)=λ1(Σ)Φ-1(1+ε2(1-ε)), over the ε-contamination neighborhood with ε[0,1/3).

However, the bound (2) can be derived in a more illuminating manner by explicitly incorporating the maximum bias, as the maximum bias governs the behavior of the estimator when the sample size is sufficiently large. Without enlarging the bound in (2), we obtain a more informative inequality,

inf𝜽,Σ(M),P𝒫(F0)εP(𝜽^T-𝜽2C~(max{pn,BT2(ε,I)}+log(1/δ)n))α.

Chen et al. [2018] also came up with an analogous bound for the deepest dispersion estimator in the case of known location. Therefore they considered the ε-contamination neighborhood with central model Np(𝟎,Σ). If β>0 such that Φ(β)=3/4, Σ^=Γ^/β and Aop stands for the norm of the matrix A given by the largest singular value of A, Chen et al. [2018] showed that,

infΣ(M),P𝒫ε(F0)P(Σ^-Σop2C(max{pn,ε2}+log(1/δ)n))α. (3)

Similarly to the analysis given for the Tukey median, we can obtain a more accurate inequality,

infΣ(M),P𝒫ε(F0)P(Σ^-Σop2C*(max{pn,BE2(ε)}+log(1/δ)n))α,

with BE(ε)=[1βΦ(3-ε4(1-ε))-1-1], without enlarging the original bound (3) and highlighting a possible maximum bias for the largest eigenvalue.

For ε=1/3, BE(1/3)=, which suggests that the asymptotic breakdown point is 1/3. In effect, we prove that the asymptotic breakdown point ε(Γ^)=1/3.

Even though the formulation given by Chen et al. [2018] seems to be promising as robust confidence regions, the bounds are extremely large to yield regions with reasonable sizes according to simulations.

References

  • Chen et al. [2018] M. Chen, C. Gao, and Z. Ren. Robust covariance and scatter matrix estimation under Huber’s contamination model. The Annals of Statistics, 46(5):1932–1960, 2018.
  • Chen and Tyler [2002] Z. Chen and D. E. Tyler. The influence function and maximum bias of Tukey’s median. The Annals of Statistics, 30(6):1737–1759, 2002.
  • Paindaveine and Van Bever [2018] D. Paindaveine and G. Van Bever. Halfspace depths for scatter, concentration and shape matrices. The Annals of Statistics, 46(6B):3276–3307, 2018.
  • Tukey [1975] J. W. Tukey. Mathematics and the picturing of data. In R. James, editor, Proceedings of the International Congress of Mathematicians, volume 2, pages 523–531, Vancouver, 1975.