Robust model selection for logistic regression with an application to sport analytics

M. Castellani

{}^{1}

G.A. Díaz Rubio

{}^{2}

S. Giannerini

{}^{3}

and G. Goracci

{}^{4}

${}^{1}$

Department of Statistical Sciences, University of Bologna, Bologna, Italy [m.castellani@unibo.it]
${}^{2}$

Department of Statistical Sciences, University of Bologna, Bologna, Italy [geryandre.diazrubio2@unibo.it]
${}^{3}$

Department of Statistics and Economics, University of Udine, Udine, Italy [simone.giannerini@uniud.it]
${}^{4}$

Department of Economics and Management, Free University of Bozen, Bolzano, Italy [greta.goracci@unibz.it]

Keywords: Robust Information Criteria – Model Selection – Soccer

1 Background

When modeling home wins in calcio using logistic regression, the interplay of minute-by-minute actions can yield high-leverage points that bias classical inference. Building upon Cantoni and Ronchetti [2001] and similarly to the nonparametric approach for generalized additive models of Wong et al. [2014], we propose a weighted logistic regression framework augmented by robust model selection criteria. We apply the generalized information criterion (GIC) of Konishi and Kitagawa [1996] and modify its penalty weight to compare it with both classical and robust methods based on Huber- and Mallows-type M-estimators. The goal is to study the application of a correct penalty term in information criteria to draw robust inferences on coach, team, and referee factors driving the probability of a home win in the Italian Serie A, weighting in-game actions.

2 Methods

2.1 Dataset and weighting scheme

We compiled a dataset of 3040 Italian Serie A matches (2011–2019) from ESPN play-by-play commentary, encompassing more than 345 000 in-game actions. Each match is represented by interval-specific counts (minutes 15, 30, 45, 60, 75, and 89) reflecting coach, team, and referee actions. These counts are weighted by the intensity (or frequency) of occurrences and then interacted with the match status (loss/tie/win at each interval). The binary outcome takes the value one if the home team eventually wins the match.

2.2 Robust logistic regression and quasi-deviance

We adopt the quasi-likelihood approach of Cantoni and Ronchetti [2001], whereby the logistic regression coefficients are estimated via M-estimators that bound the influence of large residuals. Let $Q(\hat{\boldsymbol{\beta}})$ be the robust quasi-deviance evaluated at $\hat{\boldsymbol{\beta}}$ , and $\operatorname{tr}\bigl{(}\mathbf{P}^{-1}\mathbf{Q}\bigr{)}$ be the robust penalty term as described in Konishi and Kitagawa [1996] and further studied by Wong et al. [2014]. Then, the generalized information criterion (GIC) with parameter $\alpha$ is given by:

\text{GIC}_{\alpha}\;=\;-2\,Q(\hat{\boldsymbol{\beta}})\;+\;n^{\alpha}\,% \operatorname{tr}\bigl{(}\mathbf{P}^{-1}\mathbf{Q}\bigr{)},

(1)

where $n$ is the sample size and $\alpha\in[0,1]$ is a tuning parameter assessed via out-of-sample classification metrics. This penalty weight is based on considerations of asymptotic efficiency in model selection by Shibata [1989], and recently included in the Misspecification-Resistant Information Criterion (MRIC) of Hsu et al. [2019] for misspecified time series models. We compare $\text{GIC}_{\alpha}$ with classical AIC/BIC, robust AIC/BIC (Wong et al. [2014]), and the robust quasi-deviance test (Cantoni and Ronchetti [2001]), by employing these in a stepwise backward procedure as in Heritier et al. [2009].

3 Preliminary findings

We see improved classification by weighting in-game actions. The penalty choice in robust criteria influences which interactions persist for both unweighted and weighted variables. RAIC, RBIC, and GIC ${}_{\alpha}$ effectively manage high-leverage matches, producing subtle shifts in coefficients (e.g. offsides, corners, penalties) when adjusting the penalty. By bounding residuals, robust estimators reduce outlier effects around referee events, with fewer influential points and better diagnostics. Matches with extreme refereeing often show high residuals, but stepwise RAIC/RBIC selection refines residual diagnostics at different minutes. The $n^{\alpha}$ component balances predictive accuracy and parsimony, though a more extensive set of checks is advisable. Combining Cantoni–Ronchetti estimators with GIC ${}_{\alpha}$ is promising, and further refinements will be investigated.

References

E. Cantoni and E. Ronchetti (2001) Robust inference for generalized linear models. Journal of the American Statistical Association 96 (455), pp. 1022–1030. Cited by: §1, §2.2.
S. Heritier, E. Cantoni, S. Copt, and M.-P. Victoria-Feser (2009) Robust methods in biostatistics. Wiley Series in Probability and Statistics. Cited by: §2.2.
H. L. Hsu, C. K. Ing, and H. Tong (2019) On model selection from a finite family of possibly misspecified time series models. The Annals of Statistics 47 (2), pp. 1061–1087. Cited by: §2.2.
S. Konishi and G. Kitagawa (1996) Generalized information criteria in model selection. Biometrika 83 (4), pp. 875–890. Cited by: §1, §2.2.
R. Shibata (1989) Statistical aspects of model selection. Springer. Cited by: §2.2.
R. K. W. Wong, F. Yao, and T. C. M. Lee (2014) Robust estimation for generalized additive models. Journal of Computational and Graphical Statistics 23 (1), pp. 270–289. Cited by: §1, §2.2.