Optimal Self-Distillation in Ridge Regression: Sharp Asymptotics and One-Shot Tuning

D. Hiena, P. Patila and A. Rinaldoa

aThe University of Texs at Austin

Self-distillation (SD) is the process of retraining a student model on a mixture of ground-truth labels and a teacher models predictions, using the same architecture and training data. While SD has been empirically shown to improve generalization in regression tasks, a rigorous theoretical under- standing of its mechanics and properties remains elusive. We study SD for ridge regression in the general unconstrained setting where the mixing weight is allowed to lie outside the unit interval. Without any distributional assumptions, we prove that the squared prediction risk including the out-of-distribution risk of the optimally mixed student model strictly improves upon the teacher model for every value of regularization at which the teacher’s risk is non-stationary. We express the optimal mixing weight in terms of the teacher’s risk derivative, thereby characterizing the somewhat surprising scenarios in which a negative weight is optimal. To quantify the magnitude of these improvements, we derive exact risk asymptotics in the proportional regime under general anisotropic covariances and deterministic signals. Building on this theory, we propose a novel one- shot tuning procedure to consistently estimate the optimal mixing weight without retraining, sample splitting, or grid search. Experiments on real-world regression tasks and pre-trained neural net- work features validate our theoretical predictions and demonstrate the effec- tiveness of the proposed tuning methodology.

Keywords: Self-Distillation, Ridge Regression, Tuning.