Robust Sure Independence Screening for Ultra-High Dimensional Mixed Models

A. Ghosh1 and M. Thoresen2
  • 1

    Indian Statistical Institute, Kolkata, India [abhik.ghosh@isical.ac.in]

  • 2

    University of Oslo, Oslo, Norway [magne.thoresen@medisin.uio.no]

Keywords: Robustness – Density Power divergence – M-Estimator – Sure screening property

Recent technological advancements have led to many ultra-high dimensional datasets where the number of features/variables is extremely large compared to the sample size (more precisely, increasing at an exponential rate). For analyzing these data, we first need to perform an initial feature screening to reduce the number of variables to a reasonable size, from the pool of all available variables, for further detailed investigations. Such procedures must have the sure screening property which ensures that no important variables (i.e., the ones having truly significant relationship with the target response) are lost during the initial screening. Such idea of sure independence screening (SIS) has originally been developed by Fan and Lv (2008) for linear regression models and subsequently extended for generalized linear models by Fan and Song (2010). However, as in the low-dimensional cases, such SIS procedures based on the Pearson correlation or the maximum likelihood estimates of the marginal regression coefficients suffer significantly from their non-robustness nature under any possible data contamination. Since it is often quite common to have hidden noises or other types of contamination (e.g., outliers) in such ultra-high dimensional data, particularly in those related to medical sciences, a robust SIS procedure is of high practical importance. In fact, there has been a proposal of robust SIS in a discussion, published along with its original paper by Fan and Lv; subsequently many parametric and non-parametric robust SIS methods has been developed and studied in details for different linear and generalized linear regression models.

However, in many situations involving longitudinal, repeated measurement data and clustered studies, we need to add a few suitable random effects parameters along with a set of several possible fixed effects. This leads to the so-called mixed effects models (or, simply mixed models) with ultra-high dimensional fixed effects variables and some pre-fixed random effects; our objective is then to screen the most important fixed effect variables to reduce the model to a computationally tractable size for further analyses. There have been some recent attempts to develop SIS for such ultra-high dimensional mixed models; but all of them use the maximum likelihood estimates as their building block and hence are highly non-robust against data contamination. As per the knowledge of the authors, there has been no work in the literature so far, discussing a robust SIS procedure for such mixed models.

In this work, we have filled this gap in the literature of ultra-high dimensional mixed models by developing a suitable parametric robust SIS procedure for screening fixed effect variables. We have proved the sure screening property of our proposed SIS procedure, by first showing that the robust estimators used in their construction are uniformly consistent with an exponential rate of convergence, and subsequently proving that none of the important variables can be lost within the set of screened features. The claimed robustness properties of our proposal under possible data contamination have been justified through extensive empirical illustrations. Finally, some indications are also provided for the extension of our proposed robust SIS for conditional screening of fixed effects under such ultra-high dimensional mixed effects models.

References

  • Fan and Lv [2008] J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society Series B: Statistical Methodology, 70(5):849–911, 2008.
  • Fan and Song [2010] J. Fan and R. Song. Sure independence screening in generalized linear models with np-dimensionality. Annals of Statistics, 38(6):3567–3604, 2010.