Mind the Gap: Investigating the Impact of Data Leakage on Machine Learning Predictive Models

Augusto Exenberger Becker; Mariana Recamonde-Mendoza

Augusto Exenberger Becker UFRGS
Mariana Recamonde-Mendoza UFRGS / HCPA

Resumo

Data leakage is one of the most critical and underestimated threats to the validity of machine learning (ML) model evaluation, often leading to substantially inflated performance estimates. While the concept has been acknowledged in the literature, this is the first study to provide large-scale empirical evidence of its effects across diverse modeling scenarios. We systematically investigated the impact of data leakage introduced at four common pipeline stages – normalization, imputation, feature selection, and hyperparameter tuning – using 30 datasets and six supervised learning algorithms. Our results show that leakage significantly inflated performance in nearly half of the datasets (p < 5), with feature selection causing the strongest distortions. Support Vector Classifiers were particularly affected, showing large average gains and highly significant differences. Normalization and imputation led to minimal or statistically insignificant changes. These findings offer robust empirical evidence of how different leakage scenarios can distort model evaluation, underscoring the importance of rigorous pipeline design and validation practices in ML research and applications.