Optimizing Record Linkage Parameters with Genetic Algorithms for Health Data Integration
Resumo
Record linkage is widely used to integrate administrative health databases, but its performance depends on appropriate parameterization and decision thresholds. We propose optimizing CIDACS-RL parameters using genetic algorithms to efficiently explore the parameter space. Experiments linking Brazilian live birth (SINASC) and mortality (SIM) records using a labeled dataset reduced false positives by about 90% while maintaining high recall. Precision increased from 0.70 to 0.96 and accuracy from 0.89 to 0.99, with consistent improvements across all southern Brazilian states analyzed. These results suggest that parameter optimization can improve linkage quality and the reliability of large-scale health data integration.Referências
Ali, M. S., Ichihara, M. Y., Lopes, L. C., Barbosa, G. C., et al. (2019). Administrative data linkage in brazil: potentials for health technology assessment. Frontiers in pharmacology, 10:984.
Barreto, M. L., Ichihara, M. Y., Pescarini, et al. (2022). Cohort profile: the 100 million brazilian cohort. International journal of epidemiology, 51(2):e27–e38.
De Bruin, J. (2022). Record linkage toolkit documentation.
Doidge, J. C. and Harron, K. L. (2019). Reflections on modern methods: linkage error bias. International journal of epidemiology, 48(6):2050–2060.
Gkoulalas-Divanis, A., Vatsalan, et al. (2021). Modern privacy-preserving record linkage techniques: An overview. IEEE Transactions on Information Forensics and Security, 16:4966–4987.
Harron, K., Goldstein, H., and Dibben, C. (2016). Methodological developments in data linkage. Wiley Online Library.
Joffe, E., Byrne, M. J., et al. (2014). A benchmark comparison of deterministic and probabilistic methods for defining manual review datasets in duplicate records reconciliation. Journal of the American Medical Informatics Association, 21(1):97–104.
Linacre, R., Lindsay, S., Manassis, et al. (2022). Applyisplink: free software for probabilistic record linkage at scale. International Journal of Population Data Science, 7(3):1794.
Nelson, W., Khanna, N., Ibrahim, et al. (2023). Optimizing patient record linkage in a master patient index using machine learning: Algorithm development and validation. JMIR Formative Research, 7:e44331.
Paixao, E. S., Cardim, L. L., Falcao, I. R., Ortelan, N., Silva, et al. (2021). Cohort profile: Cidacs birth cohort. International journal of epidemiology, 50(1):37–38.
Pita, R., Mendonça, E., Reis, S., Barreto, M., and Denaxas, S. (2017). A machine learning trainable model to assess the accuracy of probabilistic record linkage. In DaWaK, pages 214–227. Springer.
Rebouças, P., Paixão, E. S., et al. (2024). Ethno-racial inequalities on adverse birth and neonatal outcomes. The Lancet Regional Health–Americas, 37.
Shaikh, F. and Ragkhitwetsagul, C. (2008). Evaluating genetic algorithms for selection of similarity functions for record linkage. Carnegie Mellon University.
WAYKOLE, J. R. and SHINDE, S. (2014). An approach towards record linkage using genetic algorithm along with hash algorithm 2014. International Journal of Current Engineering and Technology, 4(3):2142–2146.
Yu, J., Nabaglo, J., Vatsalan, et al. (2020). Hyper-parameter optimization for privacy-preserving record linkage. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 281–296. Springer.
Barreto, M. L., Ichihara, M. Y., Pescarini, et al. (2022). Cohort profile: the 100 million brazilian cohort. International journal of epidemiology, 51(2):e27–e38.
De Bruin, J. (2022). Record linkage toolkit documentation.
Doidge, J. C. and Harron, K. L. (2019). Reflections on modern methods: linkage error bias. International journal of epidemiology, 48(6):2050–2060.
Gkoulalas-Divanis, A., Vatsalan, et al. (2021). Modern privacy-preserving record linkage techniques: An overview. IEEE Transactions on Information Forensics and Security, 16:4966–4987.
Harron, K., Goldstein, H., and Dibben, C. (2016). Methodological developments in data linkage. Wiley Online Library.
Joffe, E., Byrne, M. J., et al. (2014). A benchmark comparison of deterministic and probabilistic methods for defining manual review datasets in duplicate records reconciliation. Journal of the American Medical Informatics Association, 21(1):97–104.
Linacre, R., Lindsay, S., Manassis, et al. (2022). Applyisplink: free software for probabilistic record linkage at scale. International Journal of Population Data Science, 7(3):1794.
Nelson, W., Khanna, N., Ibrahim, et al. (2023). Optimizing patient record linkage in a master patient index using machine learning: Algorithm development and validation. JMIR Formative Research, 7:e44331.
Paixao, E. S., Cardim, L. L., Falcao, I. R., Ortelan, N., Silva, et al. (2021). Cohort profile: Cidacs birth cohort. International journal of epidemiology, 50(1):37–38.
Pita, R., Mendonça, E., Reis, S., Barreto, M., and Denaxas, S. (2017). A machine learning trainable model to assess the accuracy of probabilistic record linkage. In DaWaK, pages 214–227. Springer.
Rebouças, P., Paixão, E. S., et al. (2024). Ethno-racial inequalities on adverse birth and neonatal outcomes. The Lancet Regional Health–Americas, 37.
Shaikh, F. and Ragkhitwetsagul, C. (2008). Evaluating genetic algorithms for selection of similarity functions for record linkage. Carnegie Mellon University.
WAYKOLE, J. R. and SHINDE, S. (2014). An approach towards record linkage using genetic algorithm along with hash algorithm 2014. International Journal of Current Engineering and Technology, 4(3):2142–2146.
Yu, J., Nabaglo, J., Vatsalan, et al. (2020). Hyper-parameter optimization for privacy-preserving record linkage. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 281–296. Springer.
Publicado
01/06/2026
Como Citar
PITA, Pablo L. et al.
Optimizing Record Linkage Parameters with Genetic Algorithms for Health Data Integration. In: SIMPÓSIO BRASILEIRO DE COMPUTAÇÃO APLICADA À SAÚDE (SBCAS), 26. , 2026, Ouro Preto/MG.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2026
.
p. 681-692.
ISSN 2763-8952.
DOI: https://doi.org/10.5753/sbcas.2026.21436.
