Robust matrix completion for discrete rating-scale data
-
Toulouse Business School, Toulouse, France [a.archimbaud@tbs-education.fr]
-
Erasmus University Rotterdam, Rotterdam, The Netherlands [alfons@ese.eur.nl]
-
Maastricht University, Maastricht, The Netherlands [i.wilms@maastrichtuniversity.nl]
Keywords: Latent low-rank regularization – Missing data – Outliers – Recommender systems – Surveys.
1 Introduction
In recent years, there has been significant interest in matrix completion. The goal of matrix completion aims to predict the unknown entries of a partially observed matrix using its known elements, typically enforcing a low-rank constraint, while minimizing a specific criterion such as the mean squared error. Although common applications feature discrete rating-scale data, such as user-product rating matrices in recommender systems or surveys in the social and behavioral sciences, methods for matrix completion are almost always designed for and studied in the context of continuous data. Furthermore, only a small subset of the literature considers matrix completion in the presence of corrupted observations despite their common occurrence in practice. Examples include attacks on recommender systems (i.e., malicious users deliberately manipulating ratings to influence the recommender system to their advantage), or careless respondents in surveys (i.e., respondents providing answers irrespective of what the survey asks of them due to a lack of attention).
To fill these gaps, we introduce a novel matrix completion algorithm that is tailored towards the discrete nature of rating-scale data and robust to the presence of corrupted observations. Furthermore, we investigate the performance of our proposed method and several competing approaches through simulation studies using discrete rating-scale (rather than continuous) data and under various missing data mechanisms and types of corrupted observations. Our working paper version of this work is available in Archimbaud et al. [2024] and our implementation in the add-on package rdmc [Alfons and Archimbaud, 2024] for the statistical computing environment R [R Core Team, 2024]. The main part of the code is thereby written in C++ to improve computational efficiency.
2 A robust procedure for discrete matrix completion
2.1 Notations
Suppose that we observe an incomplete rating matrix of rows (representing individuals providing the ratings) and columns (representing the rated items) with elements , where denotes the number of discrete rating categories and for . Let denote the index set of observed entries in and define the projection to be the -dimensional matrix whose elements are given by
2.2 Problem definition
The general problem consists on finding a -dimensional matrix such that:
for all .
To solve this problem, the sparsity or low-rank matrix assumption is essential, otherwise it is not possible to find a solution. The rank minimization problem, without knowing the rank information, can be written such as:
Solving this general problem with this rank constraint is NP-hard since the rank function is non differentiable. Usual approaches use surrogate functions for the rank as for example the convex relaxation via the nuclear norm.
In order to address both of the main drawbacks of the popular matrix completion methods, we want to add a discrete constraint as in Huang et al. [2013] and generalize the norm to a (pseudo-)norm based on a robust loss function (see also Tang and Guan [2020]):
subject to |
with .
However, this problem might not have any solution because both the rank constraint and the discrete constraint are not necessarily fulfilled simultaneously. To resolve this issue, an ancillary continuous matrix can be introduced such that the discreteness constraint operates on and the nuclear norm constraint on , while ensuring that and remain as close as possible. Putting all of this together, we obtain the following optimization problem in augmented Lagrangian form:
subject to |
where denotes the Frobenius inner product, is a multiplier adjusting for the discrepancy between and , and is an additional regularization parameter.
3 Results and conclusion
We conduct an extensive simulation study in a comprehensive framework to investigate the performance of our proposed method compared with median imputation and the widely-used procedure Soft-Impute [Mazumder et al., 2010, Hastie et al., 2015]. We also include variants of those methods that discretize the obtained predictions to the given rating scale. We focus our analysis under various types of corrupted observations and different type of missingness mechanisms (missing completely at random - MCAR - or not at random - MNAR). To perform this evaluation, we propose to go beyond the standard metrics such as the root mean squared error which are not well suited for evaluation on discrete data, nor data containing corrupted observations.
Results demonstrate that the proposed method protects well against corrupted observations—such as attacks with fake user profiles on recommender systems or careless respondents in surveys—while paying only a small price in terms of performance in the absence of corrupted observations. Two empirical applications further highlight the practical relevance of the method. In addition, we contribute to the literature by providing an analysis that is fully reproducible and based only on open-source software, addressing the recent call for this by LeBlanc et al. [2024] and following the best practices mentioned in Ekstrand et al. [2011].
References
- Alfons and Archimbaud [2024] Andreas Alfons and Aurore Archimbaud. rdmc: Robust Discrete Matrix Completion, 2024. URL https://github.com/aalfons/rdmc. R package version 0.1.0.
- Archimbaud et al. [2024] Aurore Archimbaud, Andreas Alfons, and Ines Wilms. Robust matrix completion for discrete rating-scale data. arXiv preprint arXiv:2412.20802, 2024.
- Ekstrand et al. [2011] Michael D Ekstrand, Michael Ludwig, Joseph A Konstan, and John T Riedl. Rethinking the recommender research ecosystem: reproducibility, openness, and lenskit. In Proceedings of the fifth ACM conference on Recommender systems, pages 133–140, 2011.
- Hastie et al. [2015] Trevor Hastie, Rahul Mazumder, Jason D Lee, and Reza Zadeh. Matrix completion and low-rank SVD via fast alternating least squares. Journal of Machine Learning Research, 16(104):3367–3402, 2015.
- Huang et al. [2013] Jin Huang, Feiping Nie, and Heng Huang. Robust discrete matrix completion. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 27, pages 424–430, 2013. doi: 10.1609/aaai.v27i1.8675.
- LeBlanc et al. [2024] Patrick M. LeBlanc, David Banks, Linhui Fu, Mingyan Li, Zhengyu Tang, and Qiuyi Wu. Recommender systems: A review. Journal of the American Statistical Association, 119(545):773–785, 2024. doi: 10.1080/01621459.2023.2279695.
- Mazumder et al. [2010] Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms for learning large incomplete matrices. Journal of Machine Learning Research, 11(80):2287–2322, 2010.
- R Core Team [2024] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2024. URL https://www.R-project.org/.
- Tang and Guan [2020] Li Tang and Weili Guan. Robust matrix completion with complex noise. Multimedia Tools and Applications, 79(3–4):2703–2717, 2020. doi: https://doi.org/10.1007/s11042-019-08430-2.