A robust approach to generalized canonical correlation analysis based on scatter matrices

N. Kudraszow1 A. Vahnovan2 J. Ferrario3 and M. V. Fasano4
  • 1

    Centro de Matemática de La Plata, Universidad Nacional de La Plata and CONICET, La Plata, Argentina [nkudraszow@mate.unlp.edu.ar]

  • 2

    Centro de Matemática de La Plata, Universidad Nacional de La Plata, La Plata, Argentina [avahnovan@mate.unlp.edu.ar]

  • 3

    Centro de Matemática de La Plata, Universidad Nacional de La Plata, La Plata, Argentina [jferrario@mate.unlp.edu.ar]

  • 4

    Centro de Matemática de La Plata, Universidad Nacional de La Plata, La Plata, Argentina [vicky@mate.unlp.edu.ar]

Keywords: Robustness – Dimension reduction – Atypical data detection – Scatter matrices

Generalized Canonical Correlation Analysis (GCCA) is a powerful tool for analyzing and understanding linear relationships between multiple sets of variables. The first proposals for GCCA correspond to Horst [1961] and Carroll [1968], but Kettenring [1971] has become the standard reference for GCCA. Kettenring’s contribution presents an integrated formulation with five different forms of measuring the association degree between sets, all equivalent to the classical Canonical Correlation Analysis when the number of random vectors is only two. However, when there are more than two groups, different results can be achieved since each one identifies a distinct type of linear relationship between the groups.

Tenenhaus and Tenenhaus [2011] introduced a regularized version of the GCCA, called RGCCA, which consists of a unified statistical framework for analyzing multiple sets of variables. RGCCA not only includes the methods proposed by Kettenring as particular cases, but it also encompasses a wide range of models that account for a network of connections between the sets of variables (called blocks); see Girka et al. [2023] for a complete list. Although there is no analytical solution to the constrained optimization problem that defines the RGCCA, these authors implemented a monotonically convergent algorithm. They propose estimating the algorithm’s solution via a plug-in approach that replaces the covariance matrices with sample versions. Because high multicollinearity within blocks or high-dimensional settings can cause the sample covariance matrix to be ill-conditioned, they introduced a Ridge-type regularization into the problem.

All the aforementioned authors studied the problem of maximizing functions based on the sample Pearson correlation or sample covariance, which are known to be sensitive to atypical observations. Consequently, the estimates in GCCA are not robust. A functional version of GCCA, based on scattering matrices, will be presented, leading to robust and Fisher consistent estimators for the appropriate choice of the scatter matrix. In cases where scatter matrices are ill-conditioned, a modification based on an estimation of the precision matrix will be introduced. The good performance in finite samples of the proposed methods will be illustrated by a simulation study involving clean and contaminated samples. A procedure for identifying influential observations will also be studied by an application to a real data set.

References

  • Carroll [1968] J. D. Carroll. Generalization of canonical correlation analysis to three or more sets of variables. In American Psycological Association, pages 227–228, 1968.
  • Girka et al. [2023] F. Girka, E. Camenen, C. Peltier, A. Gloaguen, V. Guillemot, L. Le Brusquet, and A. Tenenhaus. Multiblock data analysis with the RGCCA package. Journal of Statistical Software, pages 1–36, 2023.
  • Horst [1961] P. Horst. Relations among m sets of measures. Psychometrika, 26:129–149, 1961.
  • Kettenring [1971] J. R. Kettenring. Canonical analysis of several sets of variables. Biometrika, 58:433–451, 1971.
  • Tenenhaus and Tenenhaus [2011] A. Tenenhaus and M. Tenenhaus. Regularized generalized canonical correlation analysis. Psychometrika, 76(2):257–284, 2011.