Intervening in problematic data regions to improve machine learning models
Resumo
Debugging machine learning models is essential to improving their robustness and performance. This work explores a data-driven debugging approach based on problematic data regions—subsets of the data where the model performs poorly compared to others. These regions often reflect problems such as class imbalance, bias, or unfairness. We propose to improve model performance by focusing primarily on the data, using a specialized algorithm to identify problematic regions in the datasets, and then applying targeted interventions. A common cause for poor performance is class imbalance within a problematic region. For such scenarios, we apply data augmentation in a controlled manner, avoiding excessive introduction of synthetic data. Our experiments demonstrate that it is possible to improve model performance by focusing exclusively on problematic data regions rather than the entire dataset.
Palavras-chave:
Machine Learning, Model debugging, Data Augmentation, Problematic regions
Referências
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357.
Chung, Y., Kraska, T., Polyzotis, N., Tae, K. H., and Whang, S. E. (2020). Automated data slicing for model validation: A big data - ai integration approach. IEEE Transactions on Knowledge and Data Engineering, 32(12):2284–2296.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, Hillsdale, NJ, 2nd edition.
El Gebaly, K., Agrawal, P., Golab, L., Korn, F., and Srivastava, D. (2014). Interpretable and informative explanations of outcomes. Proceedings of the VLDB Endowment, 8(1):61–72.
Foster, D. P. and Stine, R. A. (2008). α-investing: a procedure for sequential control of expected false discoveries. Journal of the Royal Statistical Society Series B: Statistical Methodology, 70(2):429–444.
Kamiran, F. and Calders, T. (2012). Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33(1):1–33.
Kerrigan, D. and Bertini, E. (2023). Slicelens: Guided exploration of machine learning datasets. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, pages 1–7.
Kohavi, R. (1996). Census Income. UCI Machine Learning Repository. DOI: 10.24432/C5GP7S.
Lin, Y., Gupta, S., and Jagadish, H. (2024). Mitigating subgroup unfairness in machine learning classifiers: A data-driven approach. In 2024 IEEE 40th International Conference on Data Engineering (ICDE), pages 2151–2163. IEEE.
Liu, H. and Motoda, H. (1998). Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, USA.
Moro, S., Rita, P., and Cortez, P. (2012). Bank Marketing. UCI Machine Learning Repository. DOI: 10.24432/C5K306.
Pastor, E., De Alfaro, L., and Baralis, E. (2021). Looking for trouble: Analyzing classifier behavior via pattern divergence. In Proceedings of the 2021 International Conference on Management of Data, pages 1400–1412.
Polyzotis, N., Roy, S., Whang, S. E., and Zinkevich, M. (2017). Data management challenges in production machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1723–1726.
Ribeiro, V., Pena, E. H. M., Saldanha, R., Akbarinia, R., Valduriez, P., Khan, F., Stoyanovich, J., and Porto, F. (2023). Subset modelling: A domain partitioning strategy for data-efficient machine-learning. In Anais do XXXVIII Simpósio Brasileiro de Bancos de Dados, pages 318–323, Porto Alegre, RS, Brasil. SBC.
Sagadeeva, S. and Boehm, M. (2021). Sliceline: Fast, linear-algebra-based slice finding for ml model debugging. In Proceedings of the 2021 International Conference on Management of Data, pages 2290–2299.
Sakar, C. and Kastro, Y. (2018). Online Shoppers Purchasing Intention Dataset. UCI Machine Learning Repository. DOI: 10.24432/C5F88Q.
Sharma, A., Jain, A., Gupta, P., and Chowdary, V. (2021). Machine learning applications for precision agriculture: A comprehensive review. IEEE Access, 9:4843–4873.
Shehab, M., Abualigah, L., Shambour, Q., Abu-Hashem, M. A., Shambour, M. K. Y., Alsalibi, A. I., and Gandomi, A. H. (2022). Machine learning in medical applications: A review of state-of-the-art methods. Computers in Biology and Medicine, 145:105458.
Yang, L. and Shami, A. (2020). On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing, 415:295–316.
Zhang, X., Ono, J. P., Song, H., Gou, L., Ma, K.-L., and Ren, L. (2023). Sliceteller: A data slice-driven approach for machine learning model validation. IEEE Transactions on Visualization and Computer Graphics, 29(1):842–852.
Chung, Y., Kraska, T., Polyzotis, N., Tae, K. H., and Whang, S. E. (2020). Automated data slicing for model validation: A big data - ai integration approach. IEEE Transactions on Knowledge and Data Engineering, 32(12):2284–2296.
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences. Lawrence Erlbaum Associates, Hillsdale, NJ, 2nd edition.
El Gebaly, K., Agrawal, P., Golab, L., Korn, F., and Srivastava, D. (2014). Interpretable and informative explanations of outcomes. Proceedings of the VLDB Endowment, 8(1):61–72.
Foster, D. P. and Stine, R. A. (2008). α-investing: a procedure for sequential control of expected false discoveries. Journal of the Royal Statistical Society Series B: Statistical Methodology, 70(2):429–444.
Kamiran, F. and Calders, T. (2012). Data preprocessing techniques for classification without discrimination. Knowledge and Information Systems, 33(1):1–33.
Kerrigan, D. and Bertini, E. (2023). Slicelens: Guided exploration of machine learning datasets. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, pages 1–7.
Kohavi, R. (1996). Census Income. UCI Machine Learning Repository. DOI: 10.24432/C5GP7S.
Lin, Y., Gupta, S., and Jagadish, H. (2024). Mitigating subgroup unfairness in machine learning classifiers: A data-driven approach. In 2024 IEEE 40th International Conference on Data Engineering (ICDE), pages 2151–2163. IEEE.
Liu, H. and Motoda, H. (1998). Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, USA.
Moro, S., Rita, P., and Cortez, P. (2012). Bank Marketing. UCI Machine Learning Repository. DOI: 10.24432/C5K306.
Pastor, E., De Alfaro, L., and Baralis, E. (2021). Looking for trouble: Analyzing classifier behavior via pattern divergence. In Proceedings of the 2021 International Conference on Management of Data, pages 1400–1412.
Polyzotis, N., Roy, S., Whang, S. E., and Zinkevich, M. (2017). Data management challenges in production machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1723–1726.
Ribeiro, V., Pena, E. H. M., Saldanha, R., Akbarinia, R., Valduriez, P., Khan, F., Stoyanovich, J., and Porto, F. (2023). Subset modelling: A domain partitioning strategy for data-efficient machine-learning. In Anais do XXXVIII Simpósio Brasileiro de Bancos de Dados, pages 318–323, Porto Alegre, RS, Brasil. SBC.
Sagadeeva, S. and Boehm, M. (2021). Sliceline: Fast, linear-algebra-based slice finding for ml model debugging. In Proceedings of the 2021 International Conference on Management of Data, pages 2290–2299.
Sakar, C. and Kastro, Y. (2018). Online Shoppers Purchasing Intention Dataset. UCI Machine Learning Repository. DOI: 10.24432/C5F88Q.
Sharma, A., Jain, A., Gupta, P., and Chowdary, V. (2021). Machine learning applications for precision agriculture: A comprehensive review. IEEE Access, 9:4843–4873.
Shehab, M., Abualigah, L., Shambour, Q., Abu-Hashem, M. A., Shambour, M. K. Y., Alsalibi, A. I., and Gandomi, A. H. (2022). Machine learning in medical applications: A review of state-of-the-art methods. Computers in Biology and Medicine, 145:105458.
Yang, L. and Shami, A. (2020). On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing, 415:295–316.
Zhang, X., Ono, J. P., Song, H., Gou, L., Ma, K.-L., and Ren, L. (2023). Sliceteller: A data slice-driven approach for machine learning model validation. IEEE Transactions on Visualization and Computer Graphics, 29(1):842–852.
Publicado
29/09/2025
Como Citar
WILLIAN, Gregully; PORTO, Fabio; PENA, Eduardo H. M..
Intervening in problematic data regions to improve machine learning models. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 40. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 385-398.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2025.247255.
