Robust Multi-Model Subset Selection
-
Department of Biomedical Informatics, Harvard Medical School[anthony.christidis@stat.ubc.ca]
-
Department of Statistics, University of British Columbia, Vancouver, Canada [gcohen@stat.ubc.ca]
Keywords: Robust methods – High-dimensional data – Ensemble methods – Multi-model optimization – Regularization Methods – Variable Selection.
1 Introduction
The rapid growth of digital technologies has led to an explosive increase in data, revolutionizing the approaches to modelling and predicting real-word phenomena. Large volumes of data of different types, formats, and structures can be rapidly collected, generated and integrated. Thus, modern datasets are often characterized by the presence of a large number of variables (in columns), of which some are irrelevant or redundant, and in general exceeds the number of observations (in rows). Along with more sophisticated processes of obtaining data comes the inclusion of outlying observations, also known as data contamination. Outlying observations can be challenging to handle and adversely affect subsequent analyses, particularly, in complex high-dimensional datasets. Although outliers are not always undesired anomalies in the data and may possess valuable insights, only methods that are robust to outliers are able to accurately identify them and resist their influence.
Regularized regression methods and stepwise selection methods have been proposed to model datasets with many predictors relative to the number of samples, enabling the selection of an optimal subset of predictors for building interpretable predictive models. However, depending on the loss function used, these methods may be very sensitive to outliers, which may adversely affect their variable selection and prediction performances. To address this problem, many statistical row-wise robust procedures have been proposed [Maronna, 2011, Alfons et al., 2013, Smucler and Yohai, 2017, Cohen Freue et al., 2019, Maronna et al., 2019, Kepplinger, 2023, Thompson, 2022] as well as some cell-wise robust procedures in recent years among others [Filzmoser, Peter and Höppner, Sebastiaan and Ortner, Irene and Serneels, Sven and Verdonck, Tim, 2020, Bottmer, Lea and Croux, Christophe and Wilms, Ines, 2022, Su, Peng and Tarr, Garth and Muller, Samuel and Wang, Suojin, 2024].
Ensemble methods can be used to generate and aggregate multiple diverse models, and often outperform single-model methods in high-dimensional prediction tasks. Some notable examples of these ensemble methods include random forests (RF) [Breiman, 2001], and random generalized linear models (RGLM) [Song et al., 2013]. More recently, Christidis et al. [2020] and Christidis et al. [2024] proposed methods that generate ensembles comprised of a small number of sparse and diverse models learned directly from the data without any form of randomization or heuristics. However, both the ensembles and the individual models that comprise them are not robust and are thus very sensitive to outliers.
2 Contribution and Results
In this talk, we introduce a robust multi-model subset selection (RMSS) method that combines stepwise selection with robust objective functions to generate robust ensembles. The resulting ensembles comprised of a small number of sparse, robust and diverse models in a regression setting. We develop a tailored computing algorithm with attractive convergence properties to build the ensembles by leveraging recent developments in optimization. The levels of sparsity, diversity and robustness within each model are driven directly by the data.
We establish the finite-sample breakdown point of the ensembles and the models that comprise them. Using extensive simulation studies and artificially contaminated real datasets from bioinformatics and cheminformatics we show that the ensembles generated by RMM generally outperform single-model sparse and robust methods in high-dimensional prediction tasks. The flexibility offered by our method can be particularly appealing for practitioners collecting and analyzing high-dimensional complex data. The implementation of our robust multi-model stepwise selection algorithm is available on CRAN [R Core Team, 2022] in the R package robStepSplitReg [Christidis and Cohen Freue, 2023b]. RMSS ensembles can be fit using the R package RMSS [Christidis and Cohen Freue, 2023a], which generates RBSS if . The source code of RMSS is written in C++, and multithreading via OpenMP [Chandra et al., 2001] is available in the package to further speed up computations.
To the best of our knowledge, this is the first robust ensemble method proposed in the literature. We are currently extending our methodology to be resistant to cell-wise contamination and some preliminary results will be presented in this talk. With the growing emphasis on interpretable statistical and machine learning algorithms in the literature and in real data applications, our proposal will potentially pave the way for the development of other robust ensemble methods. A potential bottleneck in this area of research is the high computational cost of such methods, thus new optimization tools will be needed to render such ensemble methods feasible in practice.
References
- Maronna [2011] Maronna, R. A. (2011). Robust ridge regression for high-dimensional data. Technometrics 53(1), 44–53.
- Maronna et al. [2019] Maronna, R. A., R. D. Martin, V. J. Yohai, and M. Salibián-Barrera (2019). Robust statistics: theory and methods (with R). John Wiley & Sons.
- Alfons et al. [2013] Alfons, A., C. Croux, and S. Gelper (2013). Sparse least trimmed squares regression for analyzing high-dimensional large data sets. The Annals of Applied Statistics, 226–248.
- Christidis and Cohen Freue [2023a] Christidis, A. and G. V. Cohen Freue (2023a). RMSS: Robust Multi-Model Subset Selection. R package version 1.1.1.
- Christidis and Cohen Freue [2023b] Christidis, A. and G. V. Cohen Freue (2023b). robStepSplitReg: Robust Stepwise Split Regularized Regression. R package version 1.1.0.
- Christidis et al. [2024] Christidis, A.-A., S. V. Aelst, and R. Zamar (2024). Multi-model subset selection. Computational Statistics and Data Analysis. In press.
- Christidis et al. [2020] Christidis, A.-A., L. Lakshmanan, E. Smucler, and R. Zamar (2020). Split regularized regression. Technometrics 62(3), 330–338.
- Cohen Freue et al. [2019] Cohen Freue, G. V., D. Kepplinger, M. Salibián-Barrera, and E. Smucler (2019). Robust elastic net estimators for variable selection and identification of proteomic biomarkers.
- Smucler and Yohai [2017] Smucler, E. and V. J. Yohai (2017). Robust and sparse estimators for linear regression models. Computational Statistics & Data Analysis 111, 116–130.
- Kepplinger [2023] Kepplinger, D. (2023). Robust variable selection and estimation via adaptive elastic net s-estimators for linear regression. Computational Statistics & Data Analysis 183, 107730.
- Thompson [2022] Thompson, R. (2022). Robust subset selection. Computational Statistics & Data Analysis, 107415.
- Breiman [2001] Breiman, L. (2001, October). Random forests. Machine Learning 45(1), 5–32.
- Song et al. [2013] Song, L., P. Langfelder, and S. Horvath (2013). Random generalized linear model: a highly accurate and interpretable ensemble predictor. BMC Bioinformatics 14(1), 5.
- R Core Team [2022] R Core Team (2022). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
- Chandra et al. [2001] Chandra, R., L. Dagum, D. Kohr, R. Menon, D. Maydan, and J. McDonald (2001). Parallel programming in OpenMP. Morgan Kaufmann.
- Su, Peng and Tarr, Garth and Muller, Samuel and Wang, Suojin [2024] Su, et al., (2024). CR-lasso: Robust cellwise regularized sparse regression Computational Statistics & Data Analysis 197107971
- Filzmoser, Peter and Höppner, Sebastiaan and Ortner, Irene and Serneels, Sven and Verdonck, Tim [2020] Filzmoser, et al., (2020). Cellwise robust M regression Computational Statistics & Data Analysis 147106944
- Bottmer, Lea and Croux, Christophe and Wilms, Ines [2022] title=Sparse regression for large data sets with outliers, Bottmer, et al., (2022). Sparse regression for large data sets with outliers European Journal of Operational Research 297782–794