Minimum Covariance Determinant Estimator and Outlier Detection for Interval-valued Data

C.P. Loureiro1 M.R. Oliveira1 P. Brito2 and L. Oliveira3
  • 1

    CEMAT & Department of Mathematics, Instituto Superior Técnico, Lisbon, Portugal
    [catarinapadrela@tecnico.ulisboa.pt, rosario.oliveira@tecnico.ulisboa.pt]

  • 2

    Faculdade de Economia, Universidade do Porto & LIAAD-INESC TEC, Porto, Portugal
    [mpbrito@fep.up.pt]

  • 3

    CAMGSD & Department of Mathematics, Instituto Superior Técnico, Lisbon, Portugal
    [lina.oliveira@tecnico.ulisboa.pt]

Keywords: Symbolic Data Analysis – Minimum Covariance Determinant Estimator – Mallows’ Distance – Outlier Detection

The increasing demand for analysing large datasets has led to the development of Symbolic Data Analysis, which focuses on modelling complex data structures. Symbolic data preserves inherent variability in the data while simultaneously reducing the dataset size. Interval-valued data is one of the most common symbolic data types, where an interval of real numbers is recorded for each variable in each unit. Nevertheless, this framework presents theoretical and methodological challenges requiring innovative solutions.

The location and scale of interval-valued random vectors can be obtained using the barycentre approach based on the Mallows’ distance [Irpino and Verde, 2015, Oliveira et al., 2024]. However, as in conventional data analysis, these (classical) estimates are highly sensitive to anomalous data points present in real-life datasets, resulting in the need for robust methods.

To address this issue, we develop robust estimators for location and scale, extending the Minimum Covariance Determinant (MCD) estimator [Rousseeuw and van Driessen, 1999] to interval-valued data. We start by formulating the optimization problem and showing that the objective function is concave [Boyd and Vandenberghe, 2004]. This enables us to employ the Majorization-Minorization algorithm [Lange, 2016] with Taylor’s expansion to derive the MCD estimator for interval-valued data. The resulting MCD algorithm yields a robust distance, which is used to detect anomalous observations. This can be done by setting appropriate cut-off values or by leveraging the farness concept [Raymaekers et al., 2022].

Finally, we assess the performance of the interval-valued MCD estimator and the outlier detection method using both synthetic and real datasets. In the simulation study, we compare the proposed method with the classical estimators across varying contamination levels. Our findings indicate that the interval-valued MCD estimator effectively estimates the symbolic covariance matrix even with 20% contamination, whereas the classical estimator fails. Additionally, our method achieves high accuracy in identifying anomalous observations, making it a valuable tool for robust symbolic data analysis in real-world applications.

Acknowledgments

We thank FCT - Fundação para a Ciência e Tecnologia for the grant UI/BD/153720/2021, and the projects UIDB/04621/2020, UIDB/04459/2020, UIDP/04459/2020, and LA/P/0063/2020.

References

  • Boyd and Vandenberghe [2004] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, 2004. doi: 10.1017/CBO9780511804441.
  • Irpino and Verde [2015] A. Irpino and R. Verde. Basic statistics for distributional symbolic variables: a new metric-based approach. Advances in Data Analysis and Classification, 9(2):143–175, Jun 2015. ISSN 1862-5355. doi: 10.1007/s11634-014-0176-4.
  • Lange [2016] K. Lange. MM Optimization Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, 2016. doi: 10.1137/1.9781611974409.
  • Oliveira et al. [2024] M. R. Oliveira, D. Pinheiro, and L. Oliveira. Location and association measures for interval data based on Mallows’ distance. 2024. doi: 10.48550/arXiv.2407.05105.
  • Raymaekers et al. [2022] J. Raymaekers, P. J. Rousseeuw, and M. Hubert. Class Maps for Visualizing Classification Results. Technometrics, 64(2):151–165, 2022. doi: 10.1080/00401706.2021.1927849.
  • Rousseeuw and van Driessen [1999] P. J. Rousseeuw and K. van Driessen. A Fast Algorithm for the Minimum Covariance Determinant Estimator. Technometrics, 41(3):212–223, 1999. ISSN 00401706. doi: 10.2307/1270566.