Classification of highly colinear infrared spectral data
S. Lubbe, N. J. le Roux, R. J. Cornelissen and H. H. Nieuwoudt
MuViSU (Centre for Multi-dimensional Data Visualisation), Department of Statistics and Actuarial Science, Stellenbosch University, Stellenbosch, South Africa, NITheCS (National Institute of Theoretical and Computational Sciences), Stellenbosch University, Stellenbosch, South Africa, Namaqua Wines, Vredendal, South Africa, South African Grape and Wine Research Institute, Department of Viticulture and Oenology, Stellenbosch University, Stellenbosch, South Africa
This study addresses a classification problem involving variable selection from a large set of highly correlated predictors. In spectral data, measurements at neighbouring wavenumbers are typically strongly correlated, resulting in severe multicollinearity. We propose a variable selection approach that first clusters highly correlated variables and then selects a single representative from each cluster for potential use in the classification model. The method also incorporates a constraint to prevent non-contiguous variables from being grouped within the same cluster. The approach is demonstrated using mid-infrared (MIR) spectral data from grape samples collected to assess grapevine bunch rot, an issue of economic importance to wineries. The objective is to classify new samples either as rot-affected (Yes/No) or into multiple categories reflecting the severity (%) of rot. From the smaller set of potential variables, those important for classification are selected and biplots are used to visualise the separation between classes.
Keywords: biplots, classification, multicollinearity, clustering, bunch rot