Robust Anomaly Detection via Order Statistical Methods

J. Woody1 K. Boateng1 and Q. Lu2
  • 1

    Department of Mathematics and Statistics, Mississippi State University, Mississippi State, Mississippi, USA [jwoody@math.msstate.edu,kab1167@msstate.edu]

  • 2

    Department of Statistical Sciences and Operations Research, Virginia Commonwealth University, Richmond, Virginia, USA [qlu2@vcu.edu]

Keywords: Anomaly Detection – Forensic Statistics – Robust Statistic

1 Motivation

This talk is motivated by recent results in forensic finance and forensic statistics regarding the Paycheck Protection Program (PPP) which was part of the United States’ response to the Covid-19 Pandemic. Recent studies show that clustering in digital transaction amounts [Woody et al., 2024, Griffin et al., 2023] is linked to misreporting and fraud in various governmental programs. This study develops a rigorous methodology to investigate clustering patterns observed in digital transactions in complex data environments. The methodology is robust to departures from any assumed statistical distribution for digital transaction amounts.

2 Methods

Let {Xi} for i=1,2,,n denote a sample of n transaction amounts housed by a governmental institution. We assume that the transactions follow a Gamma distribution with density given by

f(x;α,β)=βαΓ(α)xα-1e-βx for x>0 (1)

and α, β>0. The term F(x;α,β) will represent the gamma cumulative distribution function (CDF) having parameters α and β. Then define

Ui=F^(Xi), for i=1,2,,n. (2)

When α^ and β^ are obtained via MLE estimation. However, the distributional assumptions will never accommodate reality in all cases. The CDF transforms in (2) of the data to obtain “uniform-like” nearly independent and nearly identically distributed random variables. The phrase uniform-like acknowledges that the gamma density does not capture the distribution of the transactions; however, as we will show, the gamma fits are robust faulty model assumptions in the ensuing methodology.

The novel mechanism to “filter” uniform-like random variables into clustered and non-clustered transaction data utilizes a familiar result from order statistics. Let {Wi}i=1n be an i.i.d. sequence of uniformly distributed random variables on (0,1). Let W(i) be the ith order statistic, whereby

0<W(1)<W(2)<<W(n)<1,

and we set W(n+1)=1. The inequalities are strict with probability 1. A classic result dating back to Malmquist [1950] states that for i=1,2,,n

Vi=(W(i)W(i+1))i (3)

are i.i.d. such that Vi is uniformly distributed on (0,1). The sequence {Vi} compels the methodology. We replace the uniform Wis in (3) with the uniform-like Uis from (2).

The idea is as follows: whenever an alien clustering or atypical concentrations are present, successive U(i)s will be much closer together than expected. Subsequently, the associated Vis will be close to unity. The task is then to formalize an approach to statistically test if too many Vis are too close to unity. Interestingly, the Vis are surprisingly robust to Uis departure from a ”uniform-like” distribution.

A segmented density is fit to the Vis:

f(v;θ1,θ2)={θ2θ1,v(0,θ1)1-θ21-θ1,v[θ1,1), (4)

where v(0,1) and parameters θ2(0,1) and θ1(0,1). Note that when θ1=θ2, (4) reduces to the density of a uniformly distributed random variable on (0,1). Novel statistical theory and practical results are presented.

References

  • J. M. Griffin, S. Kruger, and P. Mahajan (2023) Did fintech lenders facilitate ppp fraud?. The Journal of Finance 78 (3), pp. 1777–1827. Cited by: §1.
  • S. Malmquist (1950) On a property of order statistics from a rectangular distribution. Scandinavian Actuarial Journal 1950 (3-4), pp. 214–222. Cited by: §2.
  • J. Woody, Z. Zhao, R. Lund, and T. Wu (2024) A forensic statistical analysis of fraud in the federal food stamp program. The Annals of Applied Statistics 18 (3), pp. 2486–2510. Cited by: §1.