Efficient Inference with Predicted Data under Label Shift
Abstract
Recent advances in prediction-powered inference leverage AI models to enhance the quality of statistical inference in settings with both labeled and unlabeled data. Most existing literature assumes that labeled and unlabeled data follow the same distribution; however, various forms of distribution shifts are common in practical applications. In this paper, we focus on the label shift structure and develop an AI-driven prediction-powered procedure for efficient inference on a general parameter characterizing the unlabeled data population. This approach broadens the applicability of prediction-powered inference in real-world scenarios. A key component of achieving efficient inference is modeling the outcome density ratio between the labeled and unlabeled data. We develop a progressive estimation process for this purpose, evolving through three stages: an initial heuristic guess, a consistent estimation, and ultimately, an efficient estimation. This self-driven evolutionary process is not standard in the statistical literature and is of independent interest. We rigorously establish the asymptotic properties of the proposed estimators and demonstrate their superior performance compared to existing methods. Through simulation studies and multiple real-world applications, we highlight both the theoretical contributions and the practical utility of our methods.
-
Department of Statistics, University of Seoul, Seoul, Korea [seongho@uos.ac.kr]
-
Department of Statistics, Penn State University, State College, USA [yanyuanma@gmail.com]
-
Department of Statistics/Biostatistics, University of Wisconsin, Madison, USA [jiwei.zhao@wisc.edu]
Keywords: Efficeint – Prediction powered inference
– Robustness – Semiparametrics – Semi supevised