A novel framework for quantifying nominal outlyingness
-
Department of Mathematics, Imperial College London, London, United Kingdom
[efthymios.costa17@imperial.ac.uk] , [i.papatsouma@imperial.ac.uk]
Keywords: Outlier detection – Nominal data – Association rule mining
Abstract
Outlier detection is an important data mining tool that becomes particularly challenging when dealing with nominal, that is unordered categorical data. Several approaches to the problem of outlier identification have been proposed when dealing with nominal variables. Many among these methods use concepts from the continuous setting, extended to account for the number of categorical levels being fixed. For instance, proximity-based methods use distances between nominal responses [see for example Bay and Schwabacher, 2003, Li et al., 2007], but these often lack meaningful interpretation. Model-based approaches use log-linear models to robustly infer the expected proportions of counts in a contingency table [for instance in Kuhnt et al., 2014, Calvino et al., 2021]. However, these methods can become impractical when the number of nominal variables is very large and the analysis of a high-dimensional contingency table is required. The sparsity of the resulting contingency tables increases together with the dimensionality, which is something that needs to be accounted for as well.
We alleviate these commonly encountered issues by introducing a flexible framework that can be used to indicate the extent to which a sequence of nominal levels is outlying. We start by defining a notion of nominal outlyingness and formulate a score of nominal outlyingness with respect to this. The proposed framework makes use of ideas from the association rule mining literature [Agrawal et al., 1993], extending the work of Koufakou and Georgiopoulos [2010]. Methods for determining the involved hyperparameter values are presented and the concepts of variable contributions and nominal outlyingness depth are introduced, in an attempt to enhance interpretability of the results. We implement the proposed framework on synthetic data and data sets from the fields of physics and healthcare. Our proposal demonstrates comparable performance to state-of-the-art frequent pattern mining algorithms, even outperforming them in certain cases. The ideas presented can serve as a tool for assessing the degree to which an observation differs from the rest of the data, under certain assumptions regarding the generating process of sequences of nominal levels.
Code availability
The code to replicate the experiments presented can be found in https://github.com/EfthymiosCosta/SONO.
References
- Agrawal et al. [1993] Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207–216, 1993.
- Bay and Schwabacher [2003] Stephen D Bay and Mark Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 29–38, 2003.
- Calvino et al. [2021] Aida Calvino, Nirian Martin, and Leandro Pardo. Robustness of minimum density power divergence estimators and Wald-type test statistics in loglinear models with multinomial sampling. Journal of Computational and Applied Mathematics, 386:113214, 2021.
- Koufakou and Georgiopoulos [2010] Anna Koufakou and Michael Georgiopoulos. A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes. Data Mining and Knowledge Discovery, 20(2):259–289, 2010.
- Kuhnt et al. [2014] Sonja Kuhnt, Fabio Rapallo, and André Rehage. Outlier detection in contingency tables based on minimal patterns. Statistics and Computing, 24:481–491, 2014.
- Li et al. [2007] Shuxin Li, Robert Lee, and Sheau-Dong Lang. Mining distance-based outliers from categorical data. In Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007), pages 225–230. IEEE, 2007.