Challenges and Robust Alternatives to Random Forests for Genomic Prediction in Breeding Studies

V.M. Lourencço1 M.T. Braga2 J.O. Ogutu3 and H.-P. Piepho3
Abstract

The presence of data contamination, such as errors or outliers, can severely impact statistical models, including Random Forests, particularly in high-dimensional genomic prediction and selection studies. While contamination can affect both response and covariate levels, this work focuses on response contamination, evaluating the robustness of the classical Random Forest method through simulations. Using a synthetic animal breeding dataset, we assess various contamination scenarios and explore robust adaptations to improve model resilience. Our study highlights the implications of data contamination in genomic prediction and aims to develop a robust Random Forest counterpart for routine use in breeding studies.

  • 1

    Center for Mathematics and Applications (NOVAMath) & Department of Mathematics, NOVA FCT, Caparica, Portugal [vmml@fct.unl.pt]

  • 2

    Timestamp - Sistemas de Informação, S.A., Portugal [miguelbraga40@gmail.com]

  • 3

    Biostatistics Unit, University of Hohenheim, Stuttgart, Germany [jogutu2007@gmail.com; hans-peter.piepho@uni-hohenheim.de]

Keywords: Random Forests – Machine Learning – Robust Modelling

Overview

The analysis of real data is often vulnerable to the violation of underlying model assumptions, which can be especially exacerbated by data misspecifications such as errors or outliers. In the context of linear regression, the presence of even a single outlier can disrupt the normality assumption, leading to compromised parameter estimation and other subsequent, also compromised inferential results (Huber [1981]). Machine learning methods, including Random Forests, are not immune to data contamination, including outliers, noise, and missing data. Existing literature has recognized the need for robust statistical techniques to address this issue, particularly in high-dimensional data analysis, which includes variable selection and prediction (Brence and Brown [2006], Roy and Larocque [2012], Li and Martin [2017]). Such defficiencies can seriously undermine the performance of ML models by introducing biased estimates and predictions, which may lead to erroneous insights. This concern is especially pronounced in breeding programs, where biases in predictive accuracy can directly impact outcomes of prediction and selection processes (Estaghvirou et al. [2014], Lourenço et al. [2017]).

While data contamination can occur at both the response (output) and covariate (feature) levels, this work primarily focuses on the former. To address these challenges, we will evaluate the performance of the classical Random Forest method through simulations, incorporating robust techniques designed to enhance its resilience to data contamination. Specifically, we will use a synthetic animal dataset from the literature (Lourenço et al. [2024]) and introduce a range of plausible contamination scenarios. The study aim is to clarify the implications of data contamination in genomic prediction and selection for breeding studies, and to propose possible robust adaptations of Random Forests that mitigate the challenges posed by certain types of contamination. Ultimately, we hope to develop a robust counterpart to the Random Forests algorithm that can be used routinely alongside the latter in genomic prediction and selection studies.

References

  • Brence and Brown [2006] M. J. R. Brence and D. E. Brown. Improving the robust random forest regression algorithm. Systems and Information Engineering Technical Papers, pages 1–18, 2006.
  • Estaghvirou et al. [2014] S. B. O. Estaghvirou, J. O. Ogutu, and H.-P. Piepho. Influence of outliers on accuracy estimation in genomic prediction in plant breeding. G3: Genes, Genomes, Genetics, 4(12):2317–2328, 2014.
  • Huber [1981] P. J. Huber. Robust Statistics. Wiley & Sons, New York, 1981.
  • Li and Martin [2017] A. H. Li and A. Martin. Forest-type regression with general losses and robust forest. In: International Conference on Machine Learning, pages 2091–2100, 2017.
  • Lourenço et al. [2017] V. L. Lourenço, P. C. Rodrigues, A. M. Pires, , and H.-P. Piepho. A robust df-reml framework for variance components estimation in genetic studies. Bioinformatics, 33(22):3584–3594, 2017.
  • Lourenço et al. [2024] V. L. Lourenço, J. O. Ogutu, R. A. P. Rodrigues, A. Posekany, and H.-P. Piepho. Genomic prediction using machine learning: A comparison of the performance of regularized regression, ensemble, instance-based and deep learning methods on synthetic and empirical data. BMC Genomics, 25(1):152, 2024.
  • Roy and Larocque [2012] M.-H. Roy and D. Larocque. Robustness of random forests for regression. Journal of Nonparametric Statistics, 24(4):993–1006, 2012.