A Supervised Distance Metric for K-Nearest Neighbors with Mixed-Type Data

C. Cavicchiaa, M. van de Veldena, A. Iodice D’Enzab and A. Markosc
aErasmus University Rotterdam, bUniversity of Naples Federico II, cDemocritus University of Thrace

K-Nearest Neighbors (KNN) is a widely used nonparametric method for classification and regression. It predicts the response of a new observation from the responses of the K closest training observations in the feature space. Its effectiveness depends critically on how distances between observations are defined [1]. This is especially important in heterogeneous datasets containing both numerical and categorical variables. Standard distances are designed for numerical data and are therefore not well suited to mixed-type settings. A classical alternative is to combine variable-specific contributions into a single measure. However, such distances typically do not incorporate information from the response variable, and they may also favor one variable type over another by construction [2]. In this paper, we introduce a supervised distance for heterogeneous data that combines variable-type-specific dissimilarities with response-based weighting. The proposed distance treats numerical and categorical variables differently while assigning greater importance to predictors that are more strongly associated with the response. As a result, observations with similar responses are encouraged to be closer, whereas observations with different responses tend to be farther apart. By incorporating response information directly into the distance construction, the proposed approach improves the discriminative power of KNN for mixed-data settings. The resulting framework is simple, interpretable, and suitable for applications in which heterogeneous predictors are common.

Keywords: K-Nearest Neighbors, mixed-type data, distance metric

References

  • [1] T. Cover and P. Hart (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.
  • [2] M. van de Velden, A. Iodice D’Enza, A. Markos, and C. Cavicchia (2024). Unbiased mixed variables distance. arXiv preprint arXiv:2411.00429.