Robust data cleaning for tensor-valued observations, with application to GCGC-MS data
P. Filzmoser, L. Micheler, N. Lim, and E. Rosenberg
Institute of Statistics and Mathematical Methods in Economics, TU Wien, Austria,
AC2T research GmbH, Austria,
Institute of Chemical Technologies and Analytics, TU Wien, Austria
Nowadays, many measurement devices lead to matrix-valued (e.g. images) or even tensor-valued data. An example for the latter are data from two-dimensional gas chromatography (GC) coupled with mass spectrometry (MS), so-called GCGC-MS data, where for each mass number an image is observed with mass intensity values along a first and second retention time. Some of these hundreds of images contain data artifacts caused by the measurement process. For reliable data analysis it is important to correct the data first, which means that images with artifacts need to be identified.
This problem boils down to identifying outlying matrices in tensor data. For that purpose we assume that the slices of the tensor follow a matrix normal distribution, and we estimate the parameters by the Matrix Minimum Covariance Determinant (MMCD) estimator [1].
After removing/correcting outlying slices for each sample, we can continue working with the cleaned tensor-valued observations. In this application the observations are measurements of fuel samples originating from different fuel types. In order to characterize chemical differences among the types we use tensor-PCA [2], where the loadings indicate mass numbers that allow to distinguish the fuel types. This form of automated procedure characterizing chemical differences for GCGC-MS data is novel.
Keywords: Robustness, Tensor, PCA.
References
- [1] M. Mayrhofer, U. Radojičić, and P. Filzmoser (2025). Robust covariance estimation and explainable outlier detection for matrix-valued data. Technometrics, textbf67(3) (2025), 516–530.
- [2] J. Virta, S. Taskinen, and K. Nordhausen (2016). Applying fully tensorial ICA to fMRI data. IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 1-6.