Mind the Gap: Investigating the Impact of Data Leakage on Machine Learning Predictive Models
Resumo
Data leakage is one of the most critical and underestimated threats to the validity of machine learning (ML) model evaluation, often leading to substantially inflated performance estimates. While the concept has been acknowledged in the literature, this is the first study to provide large-scale empirical evidence of its effects across diverse modeling scenarios. We systematically investigated the impact of data leakage introduced at four common pipeline stages – normalization, imputation, feature selection, and hyperparameter tuning – using 30 datasets and six supervised learning algorithms. Our results show that leakage significantly inflated performance in nearly half of the datasets (p < 5), with feature selection causing the strongest distortions. Support Vector Classifiers were particularly affected, showing large average gains and highly significant differences. Normalization and imputation led to minimal or statistically insignificant changes. These findings offer robust empirical evidence of how different leakage scenarios can distort model evaluation, underscoring the importance of rigorous pipeline design and validation practices in ML research and applications.
Publicado
29/09/2025
Como Citar
BECKER, Augusto Exenberger; RECAMONDE-MENDOZA, Mariana.
Mind the Gap: Investigating the Impact of Data Leakage on Machine Learning Predictive Models. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 35. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 166-180.
ISSN 2643-6264.
