Asymptotic properties of Impulse Indicator Saturation under outlier contamination

Otso Hao1
  • 1

    Department of Economics and Nuffield College, University of Oxford. OX1 1NF Oxford, UK. [otso.hao@economics.ox.ac.uk]

Keywords: Outlier detection; robust estimation and inference; linear regression; time series.

1 Extended abstract

Impulse indicator saturation (IIS) (Santos, Hendry, and Johansen 2008) is an outlier robust algorithm for estimating linear regression models. A simple variant of this algorithm begins by splitting the sample into two halves: one assumed to be free of outliers, and the other potentially containing them. The clean half is used to compute initial least squares estimators and scaled residuals. Observations with residuals exceeding a predefined cut-off are identified as ‘outliers’, while the remaining observations are classified as ‘good’. The IIS estimator is then equal to the least squares estimator on the selected set of ‘good’ observations. Properties of the algorithm are controlled by the cut-off tuning parameter, which needs to be chosen by the user. The IIS algorithm simultaneously selects the number of outliers and their location. This contrasts with other popular robust regression methods, such as the Least Trimmed Squares estimator (Rousseeuw 1987), for which the number of outliers needs to be estimated separately. IIS has also been embedded within a larger algorithm that does joint outlier and variable selection (Hendry and Doornik 2014).

In existing literature, theoretical properties of IIS have only been studied under a null hypothesis of no outlier contamination. Under such a null, Johansen and Nielsen (2016) show asymptotic normality of IIS estimators. They also find a limiting distribution for its False Outlier Discovery Rate (FODR), the proportion of ‘good’ observations falsely classified as outliers. The aim of this paper is to understand the properties of IIS in data generating processes that include outliers. I begin by writing down an asymptotic representation result for IIS under outlier contamination. In the paper, this representation result is stated for a general class of Huber-skip estimators (Huber 1964), which include IIS as a special case. I show that, under provided conditions, an IIS estimator has the same asymptotic distribution as a least squares estimator with all the outliers perfectly removed. This means that asymptotic inference with IIS can proceed along the lines of standard least squares theory and does not depend on nuisance parameters. The representation result allows for cross-sectional, as well as stationary and non-stationary time series data. For the representation result, I consider an increasing sequence of IIS cut-offs. A key assumption is that the magnitude of outliers, measured by their distance from a ‘true’ linear fit, increases at a rate similar to the cut-off. Beyond this requirement, I place little structure on the outliers, and allow for asymmetric contamination as well as growing degrees of ‘bad leverage’. To guide selection of the IIS cut-off tuning parameter, I derive a distribution theory for the FODR of IIS in data generating processes with outliers. For a growing sequence of cut-offs, I find Gaussian and Poisson approximations to the gauge. Furthermore, I suggest that specification tests on the set of observations retained by IIS could be used to guide the choice of a cut-off. For example, I show that a standard cumulant-based normality test has its usual chi-squared limiting distribution under a null hypothesis. Simulations are provided to assess finite sample. An empirical illustration using macroeconomic time series data is also provided. My simulations suggest that IIS performs best when outliers are clearly separated from the tails of ‘good’ observations. If there is little separation, or ‘near-outliers’, larger sample sizes are needed for the asymptotic approximation to be accurate. The focus in this paper is on the split-half variant of the IIS algorithm. The basic version of the split-half algorithm, described above, takes a set of clean observations as known. I go beyond this and analyse an algorithm where the clean half is not known a priori. These results build towards the analysis of IIS algorithms with multi-split searching for outliers.

References

  • Hendry and Doornik [2014] D.F. Hendry and J.A. Doornik. Empirical Model Discovery and Theory Evaluation. The MIT Press, Cambridge, Massachusetts, 2014.
  • Huber [1964] Peter J. Huber. Robust estimation of a location parameter. Annals of Mathematical Statistics, 35(1):73 – 101, 1964.
  • Johansen and Nielsen [2016] Søren Johansen and Bent Nielsen. Asymptotic theory of outlier detection algorithms for linear time series regression models. Scandinavian Journal of Statistics, 43(2):321–348, 2016.
  • Rousseeuw [1984] Peter J Rousseeuw. Least median of squares regression. Journal of the American Statistical Association, 79(388):871–880, 1984.
  • Santos et al. [2008] Carlos Santos, David F Hendry, and Soren Johansen. Automatic selection of indicators in a fully saturated regression. Computational Statistics, 23:317–335, 2008.