Statistical inference for smoothed quantile regression with streaming data
-
Department of Mathematical and Statistical Sciences, University of Alberta, Edmonton, Canada [jinhan3@ualberta.ca; bei1@ualberta.ca; lkong@ualberta.ca]
-
Zhongtai Securities Institute for Financial Studies, Shandong University, Jinan, China
[yanxiaodong@xjtu.edu.cn]
Keywords: Streaming data – Quantile regression – Smoothing – Confidence interval – High-dimensional data
1 Abstract and Motivation
In this paper, we tackle the problem of conducting valid statistical inference for quantile regression with streaming data. The main difficulties are that the quantile regression loss function is non-smooth and it is often infeasible to store the entire dataset in memory, rendering traditional methodologies ineffective. We introduce a fully online updating method for statistical inference in smoothed quantile regression with streaming data to overcome these issues. Our main contributions are twofold. First, for low-dimensional data, we present an incremental updating algorithm to obtain the smoothed quantile regression estimator with the streaming data set. The proposed estimator allows us to construct asymptotically exact statistical inference procedures. Second, within the realm of high-dimensional data, we develop an online debiased lasso procedure to accommodate the special sparse structure of streaming data. The proposed online debiased approach is updated with only the current data and summary statistics of historical data and corrects an approximation error term from online updating with streaming data. Furthermore, theoretical results such as estimation consistency and asymptotic normality are established to justify its validity in both settings. Our findings are supported by simulation studies and illustrated through real data applications.
2 Contributions
We introduce an online framework designed for estimation and statistical inference in quantile regression models, tailored specifically for streaming data and utilizing a convolution-type smoothing technique. Our key contributions are as follows:
-
•
In the low-dimensional setting, we introduce an online renewable estimator that allows for both estimation and inference to be updated in real time using current data alongside summary statistics from previous data. Under mild regularity conditions, the proposed online renewable estimator achieves asymptotic equivalence with the offline oracle estimator, which utilizes the full dataset. This advancement underscores the effectiveness of our approach in dynamically adapting to new information without sacrificing the accuracy and reliability typically associated with offline analysis.
-
•
In high-dimensional data analysis, regularization is crucial, with -regularization being a natural choice. However, the lasso estimator lacks a tractable limiting distribution [Zhao and Yu, 2006, Bühlmann and Van De Geer, 2011], limiting its use in inference. To address this, we propose an online debiased lasso estimator tailored for streaming data, with online lasso estimation as a by-product. Our method is simple, efficient, and requires only the current data batch and summary statistics, avoiding full dataset re-access. Under mild regularity conditions, we establish oracle inequalities for the online lasso estimator and asymptotic normality for the online debiased estimator, enabling valid statistical inference.
Our work differs from Han et al. [2021] and Luo et al. [2023] in key ways. First, we focus on quantile regression (QR), which models the dependence of response quantiles on predictors, whereas their methods address linear and generalized linear models. Second, unlike high-dimensional generalized linear models that assume a twice-differentiable loss function with local strong convexity (e.g., squared or logistic loss), QR’s check loss is non-differentiable at some points and lacks strong convexity, posing both theoretical and optimization challenges. Lastly, we derive explicit upper bounds for online lasso estimators with sub-Gaussian covariates, providing a clearer distinction between our oracle inequalities and traditional results [Van de Geer, 2008, Huang et al., 2013].
3 Methodology
Let denotes the oracle smoothed QR estimator, we propose the following approximation procedure to obtain the online lasso estimator,
where denotes the aggregated information matrix and .
The summarized estimation procedure of the whole online lasso is showed in Figure 1. When a new data stream arrives, we update the online smoothed lasso estimator via the proximal gradient descent algorithm based on the historical summary statistics . Meanwhile, the previous cumulative dataset that produces the previous online lasso estimator that can be regarded as the training set and the newly arrived data set can be used as the testing set, we then choose tuning parameter via minimizing the quantile prediction error.
References
- Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media. Cited by: 2nd item.
- Online debiased lasso for streaming data. arXiv preprint arXiv:2106.05925. Cited by: §2.
- Oracle inequalities for the lasso in the cox model. Annals of statistics 41 (3), pp. 1142. Cited by: §2.
- Statistical inference in high-dimensional generalized linear models with streaming data. Electronic Journal of Statistics 17 (2), pp. 3443–3471. Cited by: §2.
- High-dimensional generalized linear models and the lasso. Annals of statistics 36 (2), pp. 614–645. Cited by: §2.
- On model selection consistency of lasso. The Journal of Machine Learning Research 7 (90), pp. 2541–2563. Cited by: 2nd item.