Combating Class Imbalance for Infant Mortality Risk Modeling: Resampling Strategies in Brazil's Unified Health System
Abstract
This study addresses the global public health challenge of infant mortality (IM) by developing predictive machine learning models using nationally representative data from Brazil’s DATASUS system. Leveraging birth (SINASC, 2018–2022) and death records (SIM, 2018–2023), the research rigorously preprocesses data, applies probabilistic entity matching, and enriches features via Principal Component Analysis (PCA). To counteract severe class imbalance (infant deaths vs. non-deaths), multiple sampling techniques are evaluated: undersampling (Random Undersampler, Edited Nearest Neighbours), oversampling (Random Oversampler, ADASYN), and hybrid methods (SMOTETomek, SMOTEENN), both with and without PCA. The XGBoost classifier is employed with default hyperparameters. Key findings indicate that Random Undersampler (RU) achieves the highest recall (0.8031), critical for identifying true IM cases, while Random Oversampling (ROS) yields the best precision (0.6451), minimizing false positives. Edited Nearest Neighbours (ENN) with PCA achieves the optimal F1-score (0.4488), balancing precision and recall. The study concludes that sampling strategy selection should align with clinical priorities: RU for maximizing detection, ROS for reliability, and ENN for balanced performance. PCA integration showed negligible impact on results.
Keywords:
Data Imbalance, Infant Mortality, Machine Learning, Resampling Techniques
References
Barlaug, N. and Gulla, J. A. Neural Networks for Entity Matching: A Survey. ACM Trans. Knowl. Discov. Data 15 (3): 52:1–52:37, Apr., 2021.
Batista, A. F., Diniz, C. S., Bonilha, E. A., Kawachi, I., and Chiavegatto Filho, A. D. Neonatal mortality prediction with routinely collected data: a machine learning approach. BMC pediatrics vol. 21, pp. 1–6, 2021.
Chakraborty, J., Majumder, S., and Menzies, T. Bias in machine learning software: why? how? what to do? In Proc. of the 29th ACM Joint Meeting on European Software Engineering Conference and Symp. on the Foundations of Software Engineering. ESEC/FSE 2021. New York, USA, pp. 429–440, 2021.
Chivardi, C., Zamudio Sosa, A., Cavalcanti, D. M., Ordoñez, J. A., Diaz, J. F., Zuluaga, D., Almeida, C., Serván-Mori, E., Hessel, P., Moncayo, A. L., et al. Understanding the social determinants of child mortality in latin america over the last two decades: a machine learning approach. Scientific reports 13 (1): 20839, 2023.
Crawford, L. Impact of Ohio Senate Bill 265 on Infant Mortality Rate in Ohio. Ph.D. thesis, Walden Univ., 2025.
He, H. and Ma, Y. Imbalanced learning: foundations, algorithms, and applications, 2013.
Khushi, M., Shaukat, K., Alam, T. M., Hameed, I. A., Uddin, S., Luo, S., Yang, X., and Reyes, M. C. A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data. IEEE Access vol. 9, pp. 109960–109975, 2021.
Kumar, P., Bhatnagar, R., Gaur, K., and Bhatnagar, A. Classification of imbalanced data: review of methods and applications. In IOP conf. series: materials science and engineering. Vol. 1099. IOP Pub., pp. 012077, 2021.
Maharana, K., Mondal, S., and Nemade, B. A review: Data pre-processing and data augmentation techniques. Global Transitions Proceedings 3 (1): 91–99, June, 2022.
Organization, W. H. et al. Infant mortality, 2020.
Pillay, T., Dawson, K., and Trenell, M. Infant mortality is rising in the uk—reducing modifiable risks can help reverse reverse the trend, 2025.
Silva, A., Rocha, E., and Endo, P. Evaluating how different balancing data techniques impact on prediction of premature birth - extended abstract – ctdgsi 2025. In Anais Estendidos do XXI Simpósio Brasileiro de Sistemas de Informação. SBC, Porto Alegre, RS, Brasil, pp. 111–114, 2025.
Trinh, N. T., de Visme, S., Cohen, J. F., Bruckner, T., Lelong, N., Adnot, P., Rozé, J.-C., Blondel, B., Goffinet, F., Rey, G., et al. Recent historic increase of infant mortality in france: A time-series analysis, 2001 to 2019. The Lancet Regional Health–Europe vol. 16, 2022
Wongvorachan, T., He, S., and Bulut, O. A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining. Information 14 (1): 54, Jan., 2023. Number: 1 Publisher: Multidisciplinary Digital Publishing Institute.
Batista, A. F., Diniz, C. S., Bonilha, E. A., Kawachi, I., and Chiavegatto Filho, A. D. Neonatal mortality prediction with routinely collected data: a machine learning approach. BMC pediatrics vol. 21, pp. 1–6, 2021.
Chakraborty, J., Majumder, S., and Menzies, T. Bias in machine learning software: why? how? what to do? In Proc. of the 29th ACM Joint Meeting on European Software Engineering Conference and Symp. on the Foundations of Software Engineering. ESEC/FSE 2021. New York, USA, pp. 429–440, 2021.
Chivardi, C., Zamudio Sosa, A., Cavalcanti, D. M., Ordoñez, J. A., Diaz, J. F., Zuluaga, D., Almeida, C., Serván-Mori, E., Hessel, P., Moncayo, A. L., et al. Understanding the social determinants of child mortality in latin america over the last two decades: a machine learning approach. Scientific reports 13 (1): 20839, 2023.
Crawford, L. Impact of Ohio Senate Bill 265 on Infant Mortality Rate in Ohio. Ph.D. thesis, Walden Univ., 2025.
He, H. and Ma, Y. Imbalanced learning: foundations, algorithms, and applications, 2013.
Khushi, M., Shaukat, K., Alam, T. M., Hameed, I. A., Uddin, S., Luo, S., Yang, X., and Reyes, M. C. A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data. IEEE Access vol. 9, pp. 109960–109975, 2021.
Kumar, P., Bhatnagar, R., Gaur, K., and Bhatnagar, A. Classification of imbalanced data: review of methods and applications. In IOP conf. series: materials science and engineering. Vol. 1099. IOP Pub., pp. 012077, 2021.
Maharana, K., Mondal, S., and Nemade, B. A review: Data pre-processing and data augmentation techniques. Global Transitions Proceedings 3 (1): 91–99, June, 2022.
Organization, W. H. et al. Infant mortality, 2020.
Pillay, T., Dawson, K., and Trenell, M. Infant mortality is rising in the uk—reducing modifiable risks can help reverse reverse the trend, 2025.
Silva, A., Rocha, E., and Endo, P. Evaluating how different balancing data techniques impact on prediction of premature birth - extended abstract – ctdgsi 2025. In Anais Estendidos do XXI Simpósio Brasileiro de Sistemas de Informação. SBC, Porto Alegre, RS, Brasil, pp. 111–114, 2025.
Trinh, N. T., de Visme, S., Cohen, J. F., Bruckner, T., Lelong, N., Adnot, P., Rozé, J.-C., Blondel, B., Goffinet, F., Rey, G., et al. Recent historic increase of infant mortality in france: A time-series analysis, 2001 to 2019. The Lancet Regional Health–Europe vol. 16, 2022
Wongvorachan, T., He, S., and Bulut, O. A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining. Information 14 (1): 54, Jan., 2023. Number: 1 Publisher: Multidisciplinary Digital Publishing Institute.
Published
2025-09-29
How to Cite
MORSOLETO, Ricardo; SILVA, Vinícius A.; CALIARI, Juliano de S.; MIRANDA, Simone Mara F.; FERREIRA, Hiran Nonato M..
Combating Class Imbalance for Infant Mortality Risk Modeling: Resampling Strategies in Brazil's Unified Health System. In: SYMPOSIUM ON KNOWLEDGE DISCOVERY, MINING AND LEARNING (KDMILE), 13. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 57-64.
ISSN 2763-8944.
DOI: https://doi.org/10.5753/kdmile.2025.247778.
