Robust Anomaly Detection via Order Statistical Methods

J. Woody

{}^{1}

K. Boateng

{}^{1}

and Q. Lu

{}^{2}

${}^{1}$

Department of Mathematics and Statistics, Mississippi State University, Mississippi State, Mississippi, USA [jwoody@math.msstate.edu,kab1167@msstate.edu]
${}^{2}$

Department of Statistical Sciences and Operations Research, Virginia Commonwealth University, Richmond, Virginia, USA [qlu2@vcu.edu]

Keywords: Anomaly Detection – Forensic Statistics – Robust Statistic

1 Motivation

This talk is motivated by recent results in forensic finance and forensic statistics regarding the Paycheck Protection Program (PPP) which was part of the United States’ response to the Covid-19 Pandemic. Recent studies show that clustering in digital transaction amounts [Woody et al., 2024, Griffin et al., 2023] is linked to misreporting and fraud in various governmental programs. This study develops a rigorous methodology to investigate clustering patterns observed in digital transactions in complex data environments. The methodology is robust to departures from any assumed statistical distribution for digital transaction amounts.

2 Methods

Let $\{X_{i}\}$ for $i=1,2,\ldots,n$ denote a sample of $n$ transaction amounts housed by a governmental institution. We assume that the transactions follow a Gamma distribution with density given by

f(x;\alpha,\beta)=\frac{\beta^{\alpha}}{\Gamma(\alpha)}x^{\alpha-1}e^{-\beta x% }\quad\mbox{ for }x>0

(1)

and $\alpha$ , $\beta>0$ . The term $F(x;\alpha,\beta)$ will represent the gamma cumulative distribution function (CDF) having parameters $\alpha$ and $\beta$ . Then define

U_{i}=\hat{F}(X_{i}),\quad\quad\mbox{ for }i=1,2,\ldots,n.

(2)

When $\hat{\alpha}$ and $\hat{\beta}$ are obtained via MLE estimation. However, the distributional assumptions will never accommodate reality in all cases. The CDF transforms in (2) of the data to obtain “uniform-like” nearly independent and nearly identically distributed random variables. The phrase uniform-like acknowledges that the gamma density does not capture the distribution of the transactions; however, as we will show, the gamma fits are robust faulty model assumptions in the ensuing methodology.

The novel mechanism to “filter” uniform-like random variables into clustered and non-clustered transaction data utilizes a familiar result from order statistics. Let $\{W_{i}\}_{i=1}^{n}$ be an i.i.d. sequence of uniformly distributed random variables on $(0,1)$ . Let $W_{(i)}$ be the $i$ th order statistic, whereby

0<W_{(1)}<W_{(2)}<\cdots<W_{(n)}<1,

and we set $W_{(n+1)}=1$ . The inequalities are strict with probability 1. A classic result dating back to Malmquist [1950] states that for $i=1,2,\ldots,n$

V_{i}=\left(\frac{W_{(i)}}{W_{(i+1)}}\right)^{i}

(3)

are i.i.d. such that $V_{i}$ is uniformly distributed on (0,1). The sequence $\{V_{i}\}$ compels the methodology. We replace the uniform $W_{i}$ s in (3) with the uniform-like $U_{i}$ s from (2).

The idea is as follows: whenever an alien clustering or atypical concentrations are present, successive $U_{(i)}$ s will be much closer together than expected. Subsequently, the associated $V_{i}$ s will be close to unity. The task is then to formalize an approach to statistically test if too many $V_{i}$ s are too close to unity. Interestingly, the $V_{i}$ s are surprisingly robust to $U_{i}$ s departure from a ”uniform-like” distribution.

A segmented density is fit to the $V_{i}$ s:

f(v;\theta_{1},\theta_{2})=\left\{\begin{array}[]{ll}\frac{\theta_{2}}{\theta_% {1}},&v\in(0,\theta_{1})\\ \\ \frac{1-\theta_{2}}{1-\theta_{1}},&v\in[\theta_{1},1),\end{array}\right.

(4)

where $v\in(0,1)$ and parameters $\theta_{2}\in(0,1)$ and $\theta_{1}\in(0,1)$ . Note that when $\theta_{1}=\theta_{2}$ , (4) reduces to the density of a uniformly distributed random variable on $(0,1)$ . Novel statistical theory and practical results are presented.

References

J. M. Griffin, S. Kruger, and P. Mahajan (2023) Did fintech lenders facilitate ppp fraud?. The Journal of Finance 78 (3), pp. 1777–1827. Cited by: §1.
S. Malmquist (1950) On a property of order statistics from a rectangular distribution. Scandinavian Actuarial Journal 1950 (3-4), pp. 214–222. Cited by: §2.
J. Woody, Z. Zhao, R. Lund, and T. Wu (2024) A forensic statistical analysis of fraud in the federal food stamp program. The Annals of Applied Statistics 18 (3), pp. 2486–2510. Cited by: §1.