Plenary Speakers

Genevera Allen

Genevera Allen

(Department of Statistics, Center for Theoretical Neuroscience, Zuckerman Institute, Irving Institute, Columbia University, New York, NY USA)
Personal website

Fast and Powerful Minipatch Ensemble Learning for Discovery and Inference

Enormous quantities of data are collected in many industries and disciplines; this data holds the key to solving critical societal and scientific problems. Yet, fitting models to make discoveries from this huge data often poses both computational and statistical challenges. In this talk, we propose a new ensemble learning strategy primed for fast, distributed, and memory-efficient computation that also has many statistical advantages. Inspired by random forests, stability selection, and stochastic optimization, we propose to build ensembles based on tiny subsamples of both observations and features that we term minipatches.

While minipatch learning can easily be applied to prediction tasks similarly to random forests, this talk focuses on using minipatch ensemble approaches in unconventional ways: making data-driven discoveries and for statistical inference. Specifically, we will discuss using this ensemble strategy for feature selection, clustering, and graph learning as well as for distribution-free and model-agnostic inference for both predictions and important features. Through huge real data examples from neuroscience, genomics and biomedicine, we illustrate the computational and statistical advantages of our minipatch ensemble learning approaches.

Keywords

  • Ensemble Learning
  • Double Subsampling
  • Conformal Inference
  • Feature Importance Inference
  • Graphical Models
  • Clustering

References

  • L. Gan, L. Zheng and G.I. Allen, Model-Agnostic Confidence Intervals for Feature Importance: A Fast and Powerful Approach Using Minipatch Ensembles”, (Submitted), arXiv:2206.02088, 2024+.
  • T. Yao, M. Wang, and G.I. Allen, Fast and Accurate Graph Learning for Huge Data via Minipatch Ensembles”, (Submitted) arXiv:2110.12067, 2024+.
  • L. Gan and G.I. Allen, Fast and Interpretable Consensus Clustering via Minipatch Learning, PLoS Computational Biology, 18:10, e1010577, 2022.
   

Luis Angel Garcia-Escudero

Luis Angel Garcia-Escudero

(Departamento de Estadìstica e I.O. and IMUVA, University of Valladolid)
Personal website

Robust clustering in (moderately) high dimensional cases

Outliers can negatively impact Cluster Analysis. One might view outliers as separate clusters, leading to the idea that simply increasing the number of clusters, \(K\), could be a natural way to manage them. However, this approach is often not the best strategy and can even be completely impractical. Consequently, several robust clustering techniques have been developed to address this issue. These techniques are also useful for highlighting potentially relevant anomalies in data, especially when dealing with datasets that may naturally include different subpopulations. In this talk, we will focus exclusively on robust clustering methods based on trimming (see Garcìa-Escudero and Mayo-Iscar, 2023, for a recent review). Among these methods, TCLUST (Garcìa-Escudero et al., 2008) stands out as a prominent approach, as it extends the well-known MCD method (Rousseeuw, 1985) for Cluster Analysis by incorporating both trimming and eigenvalue-ratio constraints.

TCLUST, along with the algorithms and packages used for its implementation, is known to be quite reliable when dealing with low-dimensional data. However, in Statistics, it is increasingly common to encounter problems arising from higher dimensionality, where outliers still occur. Detecting outliers while mitigating their harmful effects becomes more challenging. For instance, it is evident that the performance of TCLUST deteriorates significantly as dimensionality increases. This presentation will discuss these challenges, as well as some promising initial solutions for tackling this problem, at least in the case of moderately high dimensionality. The main difficulty in using TCLUST in high dimensions is the large number of parameters that arise when handling the \(K\) scatter matrices for the fitted components. Constraining the maximum ratio between the eigenvalues of these scatter matrices is a reasonable way to “regularize” the TCLUST objective function and has been shown to be useful in practice. However, this regularization unfortunately limits the detectable clusters to spherical clusters with the same dispersion, which can be overly restrictive.

An alternative approach, as dimensionality increases, is to assume that the different clusters are grouped around \(K\) subspaces of dimension lower than the ambient space. This approach is employed in the Robust Linear Grouping method (Garcìa-Escudero et al., 2009), which can be viewed as a simultaneous clustering and dimensionality reduction technique. However, Robust Linear Grouping does not take into account the information related to the specific “coordinates” of the projection of the observations onto the \(K\) approximating subspaces, as its objective function considers only orthogonal errors.

To find a compromise between TCLUST and Robust Linear Grouping, by leveraging the dimensionality reduction power of Robust Linear Grouping and the ability of TCLUST to model the projections of observations onto the \(K\) approximating subspaces, we consider a robust extension of the HDD method (Bouveyron et al., 2007) through trimming and suitable constraints. An algorithm for implementing this methodology will be introduced, and its application will be illustrated with examples.

When dealing with increasing dimensionality in robust clustering based on trimming, it is essential to consider additional non-trivial aspects. One such issue is the proper initialization of the concentration steps typically applied at the algorithmic level. While using random initializations is feasible in principle, a large number of random initializations would be needed to ensure a reliable starting point, highlighting the need for improved initialization schemes. Another important consideration is the possibility of incorporating cellwise trimming rather than just casewise trimming, as trimming entire rows of the data matrix may discard too much valuable information. Some proposals to address these two key issues will be presented in the talk. Finally, it is important to emphasize that we are not attempting to solve the problem of handling extremely high-dimensional cases (limiting ourselves to moderately high dimensions). The problem of extremely high-dimensional cases is complex even without contamination, and making certain assumptions about sparsity may become essential in such situations. Some interesting approaches in this direction, such as those by Kondo et al. (2016) and Brodinovà et al. (2019), will be briefly discussed.

Keywords

  • Robust clustering
  • Trimming
  • Model-based clustering
  • Cellwise contamination

References

  • Bouveyron, C., Girard, S., and Schmid, C. (2007). High-Dimensional Data Clustering. Computational Statistics and Data Analysis, 52, 502-519.
  • Brodinovà, S., Filzmoser, P., Ortner, T., Breiteneder, C., and Rohm, M. (2019). Robust and sparse \(K\)-means clustering for high-dimensional data. Advances in Data Analysis and Classification, 13, 905-932.
  • Garcìa-Escudero, L.A., Gordaliza, A., Matràn, C., and Mayo-Iscar, A. (2008). A general trimming approach to robust cluster analysis. Annals of Statistics, 36, 1324-1345.
  • Garcìa-Escudero, L.A., Gordaliza, A., San Martìn, R., van Aelst, S., and Zamar, R. (2009). Robust linear clustering. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 71, 301-318.
  • Garc'{i}a-Escudero, L.A., and Mayo-Iscar, A. (2024). Robust clustering based on trimming. Wiley Interdisciplinary Reviews. Computational Statistics, 16, e1658.
  • Kondo, Y., Salibian-Barrera, M., and Zamar, R. (2016). RSKC: An R package for a robust and sparse $k$-means clustering algorithm. Journal of Statistical Software, 72, 1-26.
  • Rousseeuw, P. (1985). Multivariate estimation with high breakdown point. In W. Grossmann, G. Pflug, I. Vincze, and W. Wertz (Eds.), Mathematical statistics and applications (Vol. B, pp. 283–297). Dordrecht: Reidel.
   

Marc Hallin

Marc Hallin

(Université libre de Bruxelles)
Personal website

TBA

Keywords

  • TBA

References

  • TBA
   

Johannes Lederer

Johannes Lederer

(Departament of Mathematics, University of Hamburg)
Personal website

TBA

Keywords

  • TBA

References

  • TBA