Prediction of Infant Mortality in Brazil using Machine Learning and Entity Matching on Brazilian Unified Health System's Data
Resumo
This study applies Machine Learning (ML) to predict infant mortality (IM) in Brazil by integrating two key DataSUS databases (SINASC - live births and SIM - mortality) using probabilistic Record Linkage (Entity Matching). Four supervised ML models were tested: Decision Tree (DT), Logistic Regression (LR), Naive Bayes (NB), and Extreme Gradient Boost (XGB). The analysis utilized demographic, obstetric, prenatal, and newborn variables. Despite achieving high overall accuracy (>90%), all models demonstrated low precision (<0.5) and F1-scores (max 0.44 for XGB) in identifying actual death cases. This poor performance in detecting the minority class (deaths, representing only 0.81% of records) highlights significant challenges posed by severe class imbalance, even after applying the SMOTE oversampling technique. XGBoost yielded the best, though still insufficient, results among the models. The study also revealed higher mortality ratios for Black infants, males, and those born in the North and Northeast regions. While reinforcing ML’s relevance for public health analysis, the results underscore the difficulty in reliably predicting rare events like IM with the current approach. The authors conclude that improvements in data balancing, alternative Entity Matching techniques, and exploring deep learning models are necessary future steps to develop a robust predictive tool for supporting IM reduction policies in Brazil.
Palavras-chave:
Infant Mortality Prediction, Machine Learning, SMOTE, XGBoost
Referências
Ali, M. M., Paul, B. K., Ahmed, K., Bui, F. M., Quinn, J. M., and Moni, M. A. Heart disease prediction using supervised machine learning algorithms: Performance analysis and comparison. Computers in Biology and Medicine vol. 136, pp. 104672, 2021.
Barlaug, N. and Gulla, J. A. Neural networks for entity matching: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD) 15 (3): 1–37, 2021.
Barros, G. A., Oliveira, L. M., Silva, R. F., and Costa, F. R. Graph neural networks for databases: A survey. ACM Computing Surveys 57 (1): 1–38, 2025.
Batista, A. F. M., Diniz, C. S. G., Bonilha, E. A., Kawachi, I., and Filho, A. D. P. C. Neonatal mortality prediction with routinely collected data: a machine learning approach. BMC Pediatrics, 2021.
Bedi, S., Liu, Y., Orr-Ewing, L., Dash, D., Koyejo, S., Callahan, A., Fries, J. A., Wornow, M., Swaminathan, A., Lehmann, L. S., Hong, H. J., Kashyap, M., Chaurasia, A. R., Shah, N. R., Singh, K., Tazbaz, T., Milstein, A., Pfeffer, M. A., and Shah, N. H. A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs), 2024. Pages: 2024.04.15.24305869.
Bizzego, A., Gabrieli, G., Bornstein, M. H., Deater-Deckard, K., Lansford, J. E., Bradley, R. H., Costa, M., and Esposito, G. Predictors of contemporary under-5 child mortality in low-and middle-income countries: A machine learning approach. International journal of environmental research and public health 18 (3): 1315, 2021.
Bugelli, A., Silva, R. B. D., Dowbor, L., and Sicotte, C. The determinants of infant mortality in brazil, 2010–2020: A scoping review. International Journal of Environmental Research and Public Health, 2021.
Chivardi, C., Zamudio Sosa, A., Cavalcanti, D. M., Ordoñez, J. A., Diaz, J. F., Zuluaga, D., Almeida, C., Serván-Mori, E., Hessel, P., Moncayo, A. L., et al. Understanding the social determinants of child mortality in latin america over the last two decades: a machine learning approach. Scientific reports 13 (1): 20839, 2023.
Conway-Jones, R., James, A., Goldacre, M. J., and Seminog, O. O. Risk of self-harm in patients with eating disorders: English population-based national record-linkage study, 1999–2021. International Journal of Eating Disorders 57 (1): 162–172, Jan., 2024. Publisher: John Wiley & Sons, Ltd.
da Frota, L. M., Hasegawa, M., and Jacinto, P. Infant mortality in brazil: A survival analysis using machine learning models, 2024.
da Saúde, M. Manual de Vigilância do Óbito Infantil e Fetal e do Comitê de Prevenção do Óbito Infantil Fetal. Ministério da Saúde, 2009.
De Bruin, J. Python Record Linkage Toolkit: A toolkit for record linkage and duplicate detection in Python, 2019.
Dhokotera, T. G., Muchengeti, M., Davidović, M., Rohner, E., Olago, V., Egger, M., and Bohlius, J. Gynaecologic and breast cancers in women living with HIV in South Africa: A record linkage study. International Journal of Cancer 154 (2): 284–296, 2024. _eprint: [link].
Flores-Quispe, M. d. P., Duro, S. M. S., Blumenberg, C., Facchini, L., Zibel, A. B., and Tomasi, E. Quality of newborn healthcare in the first week of life in brazil’s primary care network: a cross-sectional multilevel analysis of the national programme for improving primary care access and quality – pmaq. BMJ Open, 2022.
Iqbal, F., Satti, M. I., Irshad, A., and Shah, M. A. Predictive analytics in smart healthcare for child mortality prediction using a machine learning approach. Open Life Sciences 18 (1): 20220609, 2023.
Jesus, E. M. d., Calais-Ferreira, L., and Barreto, M. E. Matched-pair analysis using machine learning to predict 1-year mortality in newborn twins. Brazilian Symposium on Computing Applied to Health (SBCAS 2020), 2020.
Jorge, M. H. P. d. M., Laurenti, R., and Gotlieb, S. L. D. Análise da qualidade das estatísticas vitais brasileiras: a experiência de implantação do sim e do sinasc. Ciência & Saúde Coletiva vol. 12, pp. 643–654, 2007.
Li, X., Zhang, W., Sun, Q., Wang, H., and Liu, J. Next-generation database interfaces: A survey of llm-based text-to-sql. Information Systems vol. 115, pp. 102235, 2024.
Mfateneza, E., Rutayisire, P. C., Biracyaza, E., Musafiri, S., and Mpabuka, W. G. Application of machine learning methods for predicting infant mortality in rwanda: analysis of rwanda demographic health survey 2014–15 dataset. BMC Pregnancy and Childbirth, 2022.
Organization, W. H. et al. Infant mortality, 2020.
Paul, S. G., Saha, A., Hasan, M. Z., Noori, S. R. H., and Moustafa, A. A Systematic Review of Graph Neural Network in Healthcare-Based Applications: Recent Advances, Trends, and Future Directions. IEEE Access vol. 12, pp. 15145–15170, 2024.
Ranbaduge, T., Christen, P., and Schnell, R. Large scale record linkage in the presence of missing data. arXiv preprint arXiv:2104.09677 , 2021.
Reidpath, D. D. and Allotey, P. Infant mortality rate as an indicator of population health. Journal of Epidemiology and Community Health, 2003.
Szwarcwald, C. L., Leal, M. d. C., Esteves-Pereira, A. P., Almeida, W. d. S. d., Frias, P. G. d., Damacena, G. N., Souza Júnior, P. R. B. d., Rocha, N. M., and Mullachery, P. M. H. Evaluation of data from the brazilian information system on live births (sinasc). Cadernos de Saude Publica vol. 35, pp. e00214918, 2019.
Barlaug, N. and Gulla, J. A. Neural networks for entity matching: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD) 15 (3): 1–37, 2021.
Barros, G. A., Oliveira, L. M., Silva, R. F., and Costa, F. R. Graph neural networks for databases: A survey. ACM Computing Surveys 57 (1): 1–38, 2025.
Batista, A. F. M., Diniz, C. S. G., Bonilha, E. A., Kawachi, I., and Filho, A. D. P. C. Neonatal mortality prediction with routinely collected data: a machine learning approach. BMC Pediatrics, 2021.
Bedi, S., Liu, Y., Orr-Ewing, L., Dash, D., Koyejo, S., Callahan, A., Fries, J. A., Wornow, M., Swaminathan, A., Lehmann, L. S., Hong, H. J., Kashyap, M., Chaurasia, A. R., Shah, N. R., Singh, K., Tazbaz, T., Milstein, A., Pfeffer, M. A., and Shah, N. H. A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs), 2024. Pages: 2024.04.15.24305869.
Bizzego, A., Gabrieli, G., Bornstein, M. H., Deater-Deckard, K., Lansford, J. E., Bradley, R. H., Costa, M., and Esposito, G. Predictors of contemporary under-5 child mortality in low-and middle-income countries: A machine learning approach. International journal of environmental research and public health 18 (3): 1315, 2021.
Bugelli, A., Silva, R. B. D., Dowbor, L., and Sicotte, C. The determinants of infant mortality in brazil, 2010–2020: A scoping review. International Journal of Environmental Research and Public Health, 2021.
Chivardi, C., Zamudio Sosa, A., Cavalcanti, D. M., Ordoñez, J. A., Diaz, J. F., Zuluaga, D., Almeida, C., Serván-Mori, E., Hessel, P., Moncayo, A. L., et al. Understanding the social determinants of child mortality in latin america over the last two decades: a machine learning approach. Scientific reports 13 (1): 20839, 2023.
Conway-Jones, R., James, A., Goldacre, M. J., and Seminog, O. O. Risk of self-harm in patients with eating disorders: English population-based national record-linkage study, 1999–2021. International Journal of Eating Disorders 57 (1): 162–172, Jan., 2024. Publisher: John Wiley & Sons, Ltd.
da Frota, L. M., Hasegawa, M., and Jacinto, P. Infant mortality in brazil: A survival analysis using machine learning models, 2024.
da Saúde, M. Manual de Vigilância do Óbito Infantil e Fetal e do Comitê de Prevenção do Óbito Infantil Fetal. Ministério da Saúde, 2009.
De Bruin, J. Python Record Linkage Toolkit: A toolkit for record linkage and duplicate detection in Python, 2019.
Dhokotera, T. G., Muchengeti, M., Davidović, M., Rohner, E., Olago, V., Egger, M., and Bohlius, J. Gynaecologic and breast cancers in women living with HIV in South Africa: A record linkage study. International Journal of Cancer 154 (2): 284–296, 2024. _eprint: [link].
Flores-Quispe, M. d. P., Duro, S. M. S., Blumenberg, C., Facchini, L., Zibel, A. B., and Tomasi, E. Quality of newborn healthcare in the first week of life in brazil’s primary care network: a cross-sectional multilevel analysis of the national programme for improving primary care access and quality – pmaq. BMJ Open, 2022.
Iqbal, F., Satti, M. I., Irshad, A., and Shah, M. A. Predictive analytics in smart healthcare for child mortality prediction using a machine learning approach. Open Life Sciences 18 (1): 20220609, 2023.
Jesus, E. M. d., Calais-Ferreira, L., and Barreto, M. E. Matched-pair analysis using machine learning to predict 1-year mortality in newborn twins. Brazilian Symposium on Computing Applied to Health (SBCAS 2020), 2020.
Jorge, M. H. P. d. M., Laurenti, R., and Gotlieb, S. L. D. Análise da qualidade das estatísticas vitais brasileiras: a experiência de implantação do sim e do sinasc. Ciência & Saúde Coletiva vol. 12, pp. 643–654, 2007.
Li, X., Zhang, W., Sun, Q., Wang, H., and Liu, J. Next-generation database interfaces: A survey of llm-based text-to-sql. Information Systems vol. 115, pp. 102235, 2024.
Mfateneza, E., Rutayisire, P. C., Biracyaza, E., Musafiri, S., and Mpabuka, W. G. Application of machine learning methods for predicting infant mortality in rwanda: analysis of rwanda demographic health survey 2014–15 dataset. BMC Pregnancy and Childbirth, 2022.
Organization, W. H. et al. Infant mortality, 2020.
Paul, S. G., Saha, A., Hasan, M. Z., Noori, S. R. H., and Moustafa, A. A Systematic Review of Graph Neural Network in Healthcare-Based Applications: Recent Advances, Trends, and Future Directions. IEEE Access vol. 12, pp. 15145–15170, 2024.
Ranbaduge, T., Christen, P., and Schnell, R. Large scale record linkage in the presence of missing data. arXiv preprint arXiv:2104.09677 , 2021.
Reidpath, D. D. and Allotey, P. Infant mortality rate as an indicator of population health. Journal of Epidemiology and Community Health, 2003.
Szwarcwald, C. L., Leal, M. d. C., Esteves-Pereira, A. P., Almeida, W. d. S. d., Frias, P. G. d., Damacena, G. N., Souza Júnior, P. R. B. d., Rocha, N. M., and Mullachery, P. M. H. Evaluation of data from the brazilian information system on live births (sinasc). Cadernos de Saude Publica vol. 35, pp. e00214918, 2019.
Publicado
29/09/2025
Como Citar
MORSOLETO, Ricardo; SILVA, Vinícius A.; CALIARI, Juliano de S.; MIRANDA, Simone Mara F.; FERREIRA, Hiran Nonato M..
Prediction of Infant Mortality in Brazil using Machine Learning and Entity Matching on Brazilian Unified Health System's Data. In: SYMPOSIUM ON KNOWLEDGE DISCOVERY, MINING AND LEARNING (KDMILE), 13. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 113-120.
ISSN 2763-8944.
DOI: https://doi.org/10.5753/kdmile.2025.247777.
