Simultaneous Feature Selection and Outlier Detection Using Mixed-Integer Programming Under Varying Data Structures

M. Cremona1 L. Insolia2 A. Kenney*3
  • 1

    Department of Operations and Decision Systems, Université Laval, Quebec, Canada
    [marzia.cremona@fsa.ulaval.ca]

  • 2

    Geneva School of Economics and Management, Université de Genève, Geneva, Switzerland [luca.insolia@unige.ch]

  • 3

    Department of Statistics, University of California Irvine, USA
    [anamaria.kenney@uci.eu]

Keywords: mixed-integer programming, – robust regression – feature selection – functional data analysis

Abstract

Contemporary sciences are increasingly complex, but data rich. Obtaining knowledge and insights requires appropriate tools that focus on critical methodological goals e.g., predictive performance, scalability of computation, reproducibility/stability and, importantly, interpretability and practical relevance. Redundant features in a model directly harm those goals - producing unstable estimates and overfit predictions, and hindering interpretation. Feature selection methods prevent this by favoring sparse, interpretable models – but tend to rely on simplifying assumptions often violated in practice. For instance, while large, publicly available databases provide unprecedented research opportunities, they have a high likelihood of containing noisy features, data contamination forming outliers (i.e., data entry errors or subsets with distributions shifts), and complex structured responses/features (e.g., longitudinal measurements).

Mixed Integer Programming (MIP)-based approaches have been established as a competitive alternative to popular methods for exact best subset selection. The MIP framework for best subset selection allows direct control of the L0 norm of coefficient estimates vector (i.e., the size of the active set of features). We demonstrate this flexible framework is similarly effective for outlier detection and allows this to operate simultaneously with feature selection. As a result, we can produce more interpretable solutions, and stronger theoretical guarantees under weaker/fewer assumptions. Furthermore, we present multiple extensions to allow for various data structures. For instance, a careful reformulation as a mixed-integer conic program allows for categorical responses. Moreover, in scenarios with features or responses with dependent structures (e.g., longitudinal measurements), we utilize a novel combination of tools from Functional Data Analysis (FDA) as well as the unique integer constraints to account for these dependencies and enforce key group structures in selection. Overall, our proposed framework can be tailored toward problem settings allowing practitioners to control, possibly simultaneously, the feature sparsity level, outlier treatment, and level of smoothness used in their analysis.