MIPLS2: Exploiting PLS2 to impute missing values in a two-block system with multiple response variables

Ashkan Tashk*, Klavs Martin Sørensen, Søren Balling Engelsen, Kim Steenstrup Pedersen, Carl Emil Eskildsen

*Corresponding author af dette arbejde

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningpeer review

1 Downloads (Pure)

Abstract

Background: Missing values in datasets with multiple response variables present a significant challenge in many scientific fields. The most popular and conventional methods for imputing missing data are based on simple variable mean or median imputations. More advanced methods leverage covariance with other variables when imputing missing values. However, these methods only use information from a single data block (e.g., Y-block), leaving the information in a possibly related data block (e.g., X-block) unexploited while imputing missing values. Incorporating information from both data blocks could significantly enhance the accuracy and reliability of imputations. Results: We present Multiple Imputations with PLS2 (MIPLS2), which exploits the covariance between two data blocks to handle missing values in a multivariate Y-block (response variables) with a complete multivariate X-block (explanatory variables). The method follows a two-phase iterative procedure: Phase I uses PLS2 regression to predict and impute missing values, while Phase II refines the imputations until convergence. MIPLS2 was tested on a dataset of 890 milk samples, with fatty acid quantities as response variables measured by gas chromatography. It was compared with conventional imputation methods (MICE, MiceForest, missMDA, MIPCA) in a setting combining the X-block with the Y-block containing missing values. Performance was evaluated under two scenarios: missing at random (Case Yrand) and systematic missing values (Case Ysyst). Results show that MIPLS2 consistently outperformed alternative methods, achieving lower imputation errors. Monte Carlo simulations confirmed superior performance across missing data rates from 10 % to 50 % for both missing data patterns. Significance: MIPLS2 effectively addresses complex missing data scenarios. Regression results using PLS1 and PLS2 with imputed response variables demonstrate that MIPLS2 has a smaller positive impact on prediction outputs compared to competing imputation methods. This remains consistent for both random missing values (Case Yrand) and systematic missing values (Case Ysyst). These findings underscore MIPLS2 as a robust solution for imputing missing values in systems where the values of response variables are labile or systematically absent.

OriginalsprogEngelsk
Artikelnummer344134
TidsskriftAnalytica Chimica Acta
Vol/bind1364
Antal sider12
ISSN0003-2670
DOI
StatusUdgivet - 2025

Bibliografisk note

Publisher Copyright:
© 2025 The Authors

Citationsformater