Outlier Detection in Histogram-valued Data
Ana Martins, Sónia Dias, Paula Brito, and Peter Filzmoser
Institute of Electronics and Informatics Engineering of Aveiro, Aveiro, Portugal
Instituto Politécnico de Viana do Castelo & LIAAD-INESC TEC, Portugal
Fac. Economia, Universidade do Porto & LIAAD-INESC TEC, Porto, Portugal
Institute of Statistics and Mathematical Methods in Economics, Vienna University of Technology, Vienna, Austria
We introduce a novel method for multivariate outlier detection in histogram- valued data. The proposed method is based on Donoho’s outlyingness measure, and is inspired by the approach of [1] for functional data, where observations are projected into a one-dimensional space. In our case, this projection takes advantage of the linear combination proposed in [2] based on the representation of empirical distributions by their quantile functions. Assuming a Uniform distribution within each sub-interval of the observed histograms, these quantile functions are piecewise linear functions. The proposed outlyingness measure may rely on either the L1-Wasserstein or on the Mallows (L2-Wasserstein) distance to compare distributions.
An extensive simulation study, considering data with different distributions, and different outlier proportions and severity, shows that the proposed approach is efficient in detecting atypical observations, even in cases where they are close to the regular ones. The method is further applied to two real datasets, regarding flight data and Austrian meteorological stations data, allowing to identify atypical cases.
Keywords: Distributional data, Outlyingness measure, Symbolic Data Analysis
References
- [1] Hubert, M., Rousseeuw, P. J., and Segaert, P. (2015). Multivariate functional outlier detection. Statistical Methods & Applications, 24(2):177–202.
- [2] Dias, S. and Brito, P. (2015). Linear regression model with histogram-valued variables. Statistical Analysis and Data Mining: The ASA Data Science Journal, 8(2):75–113