Robust dimension reduction
-
Department of Mathematics, Universidad del Litoral, Santa Fe, Argentina [anbergesio@gmail.com]
-
Department of Mathematics, Universidad de Buenos Aires, Buenos Aires, Argentina [meszre@dm.uba.ar ]
-
Department of Mathematics, Universidad de Buenos Aires, Buenos Aires, Argentina and CONICET, Argentina [victoryohai@gmail.com]
Keywords: Sufficient reduction, Principal Fitted Components and Reduced Rank regressions
Abstract
Non-parametric regression is a very flexible procedure to establish the relationship between a response variable and several covariables . Non-parametric regression techniques deal with fitting a model of the form without assuming any predetermined parametric model for . However the number of observations required for non-parametric regression increases exponentially with the number of covariables, and this number may be larger of what is generally available. This is usually known as the dimensionality curse problem.
Cook [2007] defined the concept of sufficient reduction statistics to overcome this problem. Suppose that is the response and is the vector of covariables. Let Then is a sufficient reduction statistics of size for if, Therefore the vector contains all the information that has about . This imply that can be replaced by in the non-parametric regression. To obtain a sufficient reduction statistics, Cook [2007] introduced the principal fitted components (PFC) model. The PFC model assumes that
where is a matrix, , is a matrix where and , . Then, , where is the subspace of dimension generated by the columns of . For example ). Calling , we can also write , where and . This imply that follows a reduced rank multiple regression model of rank with regressor equal to . Therefore the proposed estimators can also be used to estimate these models.
Cook [2007] and Cook and Forzani [2008] proved that under the PFC model where is , a sufficient reduction statistics of dimension for is given by .
Let be a random sample of the PFC model and consider the Mahalanobis distances between the predicted and observed values of
Then the maximum likelihood estimator is given by
subject to
As most of the ML estimators that assume normal errors, the ML estimator of the PFC model is very sensitive to a few outliers, even a single outlier may take the ML estimator beyond any limit. To overcome this problem a class of robust estimators for the PFC model based on a -scale estimator, see Yohai and Zamar [1988], were proposed. The -scale estimators are highly robust and may have breakdown point 0.5. Let be a scale estimator, and let its asymptotic value for a sample of the distribution with degrees of freedom. Then a class of robust estimators for the PFC model (called -PFC estimators) is given by
subject to
These estimators are strongly consistent under general conditions. A computational procedure based on iterative weighted maximum likelihood estimators, where the weights penalize outlier observations is given. A procedure based on cross validation to determine that is the dimension of the sufficient reduction statistics, is also proposed. A Monte Carlo study shows that the -PFC estimators are simultaneously highly efficient and robust. We also give an example with real data where the non-parametric model using the reduction sufficient statistics obtained with the proposed robust sufficient reduction statistics works better than using the one obtained with the maximum likelihood estimator.
References
- Cook [2007] R. D. Cook. Fisher lecture: Dimension reduction in regressionl. Statistical Science, 22(2):1–26, 2007.
- Cook and Forzani [2008] R. D. Cook and L. Forzani. Principal fitted components fordimension reduction in regression. Statistical Science, 23(4):485–501, 2008.
- Yohai and Zamar [1988] R. D. Yohai and R. H. Zamar. High breakdown-point estimates of regression by means of the minimization of an efficient scale. Journal of the American Statistical Association, 83(402):406–413, 1988.