Genevera Allen

Genevera Allen

(Department of Statistics, Center for Theoretical Neuroscience, Zuckerman Institute, Irving Institute, Columbia University, New York, NY USA)
Personal website

Fast and Powerful Minipatch Ensemble Learning for Discovery and Inference

Enormous quantities of data are collected in many industries and disciplines; this data holds the key to solving critical societal and scientific problems. Yet, fitting models to make discoveries from this huge data often poses both computational and statistical challenges. In this talk, we propose a new ensemble learning strategy primed for fast, distributed, and memory-efficient computation that also has many statistical advantages. Inspired by random forests, stability selection, and stochastic optimization, we propose to build ensembles based on tiny subsamples of both observations and features that we term minipatches.

While minipatch learning can easily be applied to prediction tasks similarly to random forests, this talk focuses on using minipatch ensemble approaches in unconventional ways: making data-driven discoveries and for statistical inference. Specifically, we will discuss using this ensemble strategy for feature selection, clustering, and graph learning as well as for distribution-free and model-agnostic inference for both predictions and important features. Through huge real data examples from neuroscience, genomics and biomedicine, we illustrate the computational and statistical advantages of our minipatch ensemble learning approaches.

Keywords

  • Ensemble Learning
  • Double Subsampling
  • Conformal Inference
  • Feature Importance Inference
  • Graphical Models
  • Clustering

References

  • L. Gan, L. Zheng and G.I. Allen, Model-Agnostic Confidence Intervals for Feature Importance: A Fast and Powerful Approach Using Minipatch Ensembles”, (Submitted), arXiv:2206.02088, 2024+.
  • T. Yao, M. Wang, and G.I. Allen, Fast and Accurate Graph Learning for Huge Data via Minipatch Ensembles”, (Submitted) arXiv:2110.12067, 2024+.
  • L. Gan and G.I. Allen, Fast and Interpretable Consensus Clustering via Minipatch Learning, PLoS Computational Biology, 18:10, e1010577, 2022.
   

Luis Angel Garcia-Escudero

Luis Angel Garcia-Escudero

(Departamento de Estadìstica e I.O. and IMUVA, University of Valladolid)
Personal website

Robust clustering in (moderately) high dimensional cases

Outliers can negatively impact Cluster Analysis. One might view outliers as separate clusters, leading to the idea that simply increasing the number of clusters, \(K\), could be a natural way to manage them. However, this approach is often not the best strategy and can even be completely impractical. Consequently, several robust clustering techniques have been developed to address this issue. These techniques are also useful for highlighting potentially relevant anomalies in data, especially when dealing with datasets that may naturally include different subpopulations. In this talk, we will focus exclusively on robust clustering methods based on trimming (see Garcìa-Escudero and Mayo-Iscar, 2023, for a recent review). Among these methods, TCLUST (Garcìa-Escudero et al., 2008) stands out as a prominent approach, as it extends the well-known MCD method (Rousseeuw, 1985) for Cluster Analysis by incorporating both trimming and eigenvalue-ratio constraints.

TCLUST, along with the algorithms and packages used for its implementation, is known to be quite reliable when dealing with low-dimensional data. However, in Statistics, it is increasingly common to encounter problems arising from higher dimensionality, where outliers still occur. Detecting outliers while mitigating their harmful effects becomes more challenging. For instance, it is evident that the performance of TCLUST deteriorates significantly as dimensionality increases. This presentation will discuss these challenges, as well as some promising initial solutions for tackling this problem, at least in the case of moderately high dimensionality. The main difficulty in using TCLUST in high dimensions is the large number of parameters that arise when handling the \(K\) scatter matrices for the fitted components. Constraining the maximum ratio between the eigenvalues of these scatter matrices is a reasonable way to “regularize” the TCLUST objective function and has been shown to be useful in practice. However, this regularization unfortunately limits the detectable clusters to spherical clusters with the same dispersion, which can be overly restrictive.

An alternative approach, as dimensionality increases, is to assume that the different clusters are grouped around \(K\) subspaces of dimension lower than the ambient space. This approach is employed in the Robust Linear Grouping method (Garcìa-Escudero et al., 2009), which can be viewed as a simultaneous clustering and dimensionality reduction technique. However, Robust Linear Grouping does not take into account the information related to the specific “coordinates” of the projection of the observations onto the \(K\) approximating subspaces, as its objective function considers only orthogonal errors.

To find a compromise between TCLUST and Robust Linear Grouping, by leveraging the dimensionality reduction power of Robust Linear Grouping and the ability of TCLUST to model the projections of observations onto the \(K\) approximating subspaces, we consider a robust extension of the HDD method (Bouveyron et al., 2007) through trimming and suitable constraints. An algorithm for implementing this methodology will be introduced, and its application will be illustrated with examples.

When dealing with increasing dimensionality in robust clustering based on trimming, it is essential to consider additional non-trivial aspects. One such issue is the proper initialization of the concentration steps typically applied at the algorithmic level. While using random initializations is feasible in principle, a large number of random initializations would be needed to ensure a reliable starting point, highlighting the need for improved initialization schemes. Another important consideration is the possibility of incorporating cellwise trimming rather than just casewise trimming, as trimming entire rows of the data matrix may discard too much valuable information. Some proposals to address these two key issues will be presented in the talk. Finally, it is important to emphasize that we are not attempting to solve the problem of handling extremely high-dimensional cases (limiting ourselves to moderately high dimensions). The problem of extremely high-dimensional cases is complex even without contamination, and making certain assumptions about sparsity may become essential in such situations. Some interesting approaches in this direction, such as those by Kondo et al. (2016) and Brodinovà et al. (2019), will be briefly discussed.

Keywords

  • Robust clustering
  • Trimming
  • Model-based clustering
  • Cellwise contamination

References

  • Bouveyron, C., Girard, S., and Schmid, C. (2007). High-Dimensional Data Clustering. Computational Statistics and Data Analysis, 52, 502-519.
  • Brodinovà, S., Filzmoser, P., Ortner, T., Breiteneder, C., and Rohm, M. (2019). Robust and sparse \(K\)-means clustering for high-dimensional data. Advances in Data Analysis and Classification, 13, 905-932.
  • Garcìa-Escudero, L.A., Gordaliza, A., Matràn, C., and Mayo-Iscar, A. (2008). A general trimming approach to robust cluster analysis. Annals of Statistics, 36, 1324-1345.
  • Garcìa-Escudero, L.A., Gordaliza, A., San Martìn, R., van Aelst, S., and Zamar, R. (2009). Robust linear clustering. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 71, 301-318.
  • Garc'{i}a-Escudero, L.A., and Mayo-Iscar, A. (2024). Robust clustering based on trimming. Wiley Interdisciplinary Reviews. Computational Statistics, 16, e1658.
  • Kondo, Y., Salibian-Barrera, M., and Zamar, R. (2016). RSKC: An R package for a robust and sparse $k$-means clustering algorithm. Journal of Statistical Software, 72, 1-26.
  • Rousseeuw, P. (1985). Multivariate estimation with high breakdown point. In W. Grossmann, G. Pflug, I. Vincze, and W. Wertz (Eds.), Mathematical statistics and applications (Vol. B, pp. 283–297). Dordrecht: Reidel.
   

Marc Hallin

Marc Hallin

(Department of Mathematics, Université libre de Bruxelles, Belgium and Czech Academy of Sciences Prague, Czech Republic)

Directional Nonlinear Principal and Independent Components: a measure transportation approach

Traditional Principal and Independent Component Analysis (PCA and ICA) are inherently linear and bidirectional: principal directions, in both cases, are linear combinations defined up to their signs. While this approach is perfectly justified in a linear and symmetric context – essentially, under Gaussian or elliptical symmetry assumptions – a more flexible nonlinear and directional one is more appropriate under more general distributions. Measure transportation is offering the ideal tool for such an extension. Inspired by the measure-transportation-based concepts of Monge-Kantorovich depth and center-outward distribution functions introduced in Chernozhukov et al. (2017) and Hallin et al. (2021), we propose new, nonlinear and directional, notions of principal and independent components (grounded in monotone transports to the uniform over the unit ball \(\mathbb{S}_d\) and to the uniform over the unit cube \([-1, 1]^d\), respectively). Principal directions, in our approach, are curves originating from a (data-driven) central set (instead of running through some origin) and maximizing the dispersion of appropriate one-sided curvilinear projections; the underlying transports are not necessarily continuous at the center, making one-sidedness a natural feature. Contrary to the classical linear ones, our nonlinear independent components, under absolute continuity assumptions, always exist.

Keywords

  • Nonlinear principal components
  • Nonlinear independent components
  • Measure transportation
  • Dimension reduction

References

  • Chernozhukov, V., Galichon, A., Hallin, M. and Henry, M. (2017). Monge-Kantorovich depth, quantiles, ranks and signs. Annals of Statistics 45, 223-256.

  • Hallin, M., del Barrio, E., Cuesta-Albertos, J. and Matran, C. (2021). Center-outward distribution functions, quantiles, ranks, and signs in \({\mathbb R}^d\). Annals of Statistics 48, 1139-1165.

   

Johannes Lederer

Johannes Lederer

(Departament of Mathematics, University of Hamburg)
Personal website

Data Science, Where Statistics Meets Optimization

Modern data science spans computer science, mathematics, and applications. Hence, these different fields need to support and nourish each other in order to reach the full potential of data science. This talk will bring a sharp focus on the roles of statistics and optimization. We will discuss two examples: We start with deep learning, where mathematical statistics can lead to a more profound understanding of computer-science pipelines. We then turn to extremes, where efficient computing algorithms can lead to mathematical models for contemporary data. You will walk away from this talk with a clear understanding of how statistics and optimization can work together to improve data science.

Keywords

  • Data science
  • Deep learning
  • Extreme-value theory
  • Optimization

References

  • Lederer, Johannes. Fundamentals of high-dimensional statistics. Springer Texts in Statistics, 2022.
  • Taheri, Mahsa, Néhémy Lim, and Johannes Lederer. Balancing Statistical and Computational Precision: A General Theory and Applications to Sparse Regression. IEEE Transactions on Information Theory 69.1 (2022): 316-333.
  • Taheri, Mahsa, Fang Xie, and Johannes Lederer. Statistical Guarantees for Approximate Stationary Points of Simple Neural Networks. arXiv preprint arXiv:2205.04491 (2022).
  • Lederer, Johannes, and Marco Oesting. Extremes in high dimensions: Methods and scalable algorithms. arXiv preprint arXiv:2303.04258 (2023).
  • Taheri, Mahsa, Néhémy Lim, and Johannes Lederer. Balancing Statistical and Computational Precision: A General Theory and Applications to Sparse Regression. IEEE Transactions on Information Theory 69.1 (2022): 316-333.
  • Taheri, Mahsa, Fang Xie, and Johannes Lederer. Statistical Guarantees for Approximate Stationary Points of Simple Neural Networks. arXiv preprint arXiv:2205.04491 (2022).
  • Lederer, Johannes, and Marco Oesting. Extremes in high dimensions: Methods and scalable algorithms. arXiv preprint arXiv:2303.04258 (2023).

Valentin Todorov

Valentin Todorov

(United Nations Industrial Development Organization (retired), Vienna, Austria)
Personal website

Fortifying Statistical Analyses: Software Tools for Robust Methods

The practical deployment and success of robust methods are inconceivable without reliable and user-friendly software. This necessity was recognized early on, leading to the development of initial robust statistical software in platforms such as SAS, S-Plus, and MATLAB. This talk provides an overview of key software ecosystems, highlighting their features, use cases, and suitability for various audiences. Currently two MATLAB toolboxes for robust statistics are popular: LIBRA, developed by the research groups in robust statistics of the Katholieke Universiteit Leuven (Department of Mathematics) and the University of Antwerp (Department of Mathematics and Computer Science) and FSDA, a joint effort by the University of Parma and the Joint Research Center (JRC) of the European Commission. However, the R programming environment, a free software platform for statistical computing and graphics, has emerged as a viable alternative, offering developers and users extensive capabilities for creating and applying robust methods.

Many researchers have significantly contributed to making robust statistical methods accessible. On CRAN alone, over 700 R packages include the terms “robust” or “outlier” in their names, titles, or descriptions. This abundance of options can be overwhelming for both beginners and experienced users. To address this, we review the 25 most significant R packages for various tasks, briefly describing their functionalities. We also explore several key topics in robust statistics, presenting methodologies, implementations in R, and applications to real-world data. Particular attention is given to robust methods and algorithms suited for high-dimensional data.

While robust methods have long been available in R and MATLAB, Python users have lacked a comprehensive package that offers these methods in a cohesive framework. Only recently the Python package RobPy filled to some extent this gap and we still have to see similar development in Julia.

While robust methods have long been available in R and MATLAB, Python users have only recently gained access to a comprehensive package – RobPy – that offers such methods within a cohesive framework. However, comparable development in Julia remains limited. Despite the progress in robust statistical software, challenges persist, including computational efficiency, ease of use, integration with big data frameworks, and compatibility with machine learning systems.

The future undoubtedly holds exciting advancements for R, MATLAB, Python, and Julia, promising to enrich the statistical community with even more powerful and versatile tools.

Keywords

  • Robustness
  • Software
  • R
  • MATLAB
  • Python
  • Julia

References

  • A.C. Atkinson, M. Riani, A. Corbellini, D. Perrotta, and V. Todorov. Robust Statistics through the Monitoring Approach: Applications in Regression. Springer-Verlag, Heidelberg (2025). In press.

  • S. Leyder, J. Raymaekers, P.J. Rousseeuw, T. Servotte, and T. Verdonck. RobPy: A Python package for robust statistical methods, arXiv preprint arXiv:2411.01954 (2024).

  • M. Riani, D. Perrotta, and F. Torti. FSDA: A MATLAB toolbox for robust analysis and interactive data exploration. Chemometrics and Intelligent Laboratory Systems, 116:17-32 (2012).

  • V. Todorov. The R package ecosystem for robust statistics. Wiley Interdisciplinary Reviews: Computational Statistics, 16(6):e70007 (2024).