Outlier Detection in Histogram-valued Data

Ana Martinsa, Sónia Diasb, Paula Britoc, and Peter Filzmoserd

aInstitute of Electronics and Informatics Engineering of Aveiro, Aveiro, Portugal
bInstituto Politécnico de Viana do Castelo & LIAAD-INESC TEC, Portugal
cFac. Economia, Universidade do Porto & LIAAD-INESC TEC, Porto, Portugal
dInstitute of Statistics and Mathematical Methods in Economics, Vienna University of Technology, Vienna, Austria

We introduce a novel method for multivariate outlier detection in histogram- valued data. The proposed method is based on Donoho’s outlyingness measure, and is inspired by the approach of [1] for functional data, where observations are projected into a one-dimensional space. In our case, this projection takes advantage of the linear combination proposed in [2] based on the representation of empirical distributions by their quantile functions. Assuming a Uniform distribution within each sub-interval of the observed histograms, these quantile functions are piecewise linear functions. The proposed outlyingness measure may rely on either the L1-Wasserstein or on the Mallows (L2-Wasserstein) distance to compare distributions.

An extensive simulation study, considering data with different distributions, and different outlier proportions and severity, shows that the proposed approach is efficient in detecting atypical observations, even in cases where they are close to the regular ones. The method is further applied to two real datasets, regarding flight data and Austrian meteorological stations data, allowing to identify atypical cases.

Keywords: Distributional data, Outlyingness measure, Symbolic Data Analysis

References

  • [1] Hubert, M., Rousseeuw, P. J., and Segaert, P. (2015). Multivariate functional outlier detection. Statistical Methods & Applications, 24(2):177–202.
  • [2] Dias, S. and Brito, P. (2015). Linear regression model with histogram-valued variables. Statistical Analysis and Data Mining: The ASA Data Science Journal, 8(2):75–113