TY - JOUR
T1 - Unbiased prediction errors for partial least squares regression models
T2 - Choosing a representative error estimator for process monitoring
AU - Skou, Peter B
AU - Tonolini, Margherita
AU - Eskildsen, Carl Emil
AU - Berg, Frans Van Den
AU - Rasmussen, Morten Arendt
PY - 2023
Y1 - 2023
N2 - Partial least squares (PLS) regression is widely used to predict chemical analytes from spectroscopic data, thus reducing the need for expensive and time-consuming wet chemical reference analysis in industrial process monitoring. However, predictions via PLS by definition carry sample-specific errors, and estimation of these errors is essential for correct interpretation of results. To increase trust in PLS regression-based predictions, reliable prediction error estimates must be reported. This can be achieved by determining realistic sample-specific prediction errors using an unbiased mean squared prediction error estimate. This work provides a guide for estimating sample-specific prediction errors, showing the importance of choosing an appropriate error estimator prior to deploying PLS models for industrial applications. We reviewed recent and established methods for estimating the sample-specific prediction error and test them through simulation studies. The methods were subsequently applied for estimating prediction errors in two real-life datasets from the food ingredients industry, where near-infrared spectroscopy was used to quantify i) urea in process water and ii) individual protein concentrations in ultrafiltration retentates from a protein fractionation process. Both the simulations and real data examples showed that the mean squared error of calibration is always a downward biased estimator. Although leave-one-out-cross-validation performed surprisingly well in the data analysed in this work, this paper demonstrated that the appropriate choice of error estimator requires the user to make an informed, data-centered decision.
AB - Partial least squares (PLS) regression is widely used to predict chemical analytes from spectroscopic data, thus reducing the need for expensive and time-consuming wet chemical reference analysis in industrial process monitoring. However, predictions via PLS by definition carry sample-specific errors, and estimation of these errors is essential for correct interpretation of results. To increase trust in PLS regression-based predictions, reliable prediction error estimates must be reported. This can be achieved by determining realistic sample-specific prediction errors using an unbiased mean squared prediction error estimate. This work provides a guide for estimating sample-specific prediction errors, showing the importance of choosing an appropriate error estimator prior to deploying PLS models for industrial applications. We reviewed recent and established methods for estimating the sample-specific prediction error and test them through simulation studies. The methods were subsequently applied for estimating prediction errors in two real-life datasets from the food ingredients industry, where near-infrared spectroscopy was used to quantify i) urea in process water and ii) individual protein concentrations in ultrafiltration retentates from a protein fractionation process. Both the simulations and real data examples showed that the mean squared error of calibration is always a downward biased estimator. Although leave-one-out-cross-validation performed surprisingly well in the data analysed in this work, this paper demonstrated that the appropriate choice of error estimator requires the user to make an informed, data-centered decision.
U2 - 10.1177/09670335231173139
DO - 10.1177/09670335231173139
M3 - Journal article
VL - 31
SP - 186
EP - 195
JO - Journal of Near Infrared Spectroscopy
JF - Journal of Near Infrared Spectroscopy
SN - 0967-0335
IS - 4
ER -