Scalable privacy-preserving record linkage: Evaluating MultiBit tree indexing in Atyimo

  • Victor Orrico Universidade Federal da Bahia (UFBA)
  • Fernanda Eustáquio Fundação Oswaldo Cruz (Fiocruz)
  • Bethânia Almeida Fundação Oswaldo Cruz (Fiocruz)
  • Mirlei Silva Universidade Federal da Bahia (UFBA)
  • Robespierre Pita Universidade Federal da Bahia (UFBA) / Fundação Oswaldo Cruz (Fiocruz) https://orcid.org/0000-0002-0616-620X

Resumo


Privacy-preserving record linkage (PPRL) indexing techniques typically organize Bloom Filters (BF) into data structures to reduce unnecessary comparisons. However, widely used solutions like Multibit Trees (MTB) often face scalability issues with large datasets or high-dimensional BFs, requiring parallel or distributed computation. This study explores the integration of the MTB algorithm into Atyimo, a publicly available Brazilian PPRL tool for merging large-scale administrative databases. We used both simulated and real-world data in our experiments to evaluate Atyimo’s effectiveness with MTB in linking routinely collected health records in Brazil. The results show that our Spark DataFrame-based solution builds robust index structures that preserve the linkage accuracy and significantly reduce execution time compared to the baseline.

Palavras-chave: record linkage, atyimo, indexing, multibit tree, spark dataframes

Referências

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3):403–410.

Araujo, J. D., Santos-e Silva, J. C., Costa-Martins, A. G., Sampaio, V., de Castro, D. B., de Souza, R. F., Giddaluru, J., Ramos, P. I. P., Pita, R., Barreto, M. L., et al. (2022). Tucuxi-blast: Enabling fast and accurate record linkage of large-scale health-related administrative databases through a dna-encoded approach. PeerJ, 10:e13507.

Barbosa, G. C. G., Ali, M., Barreto, M., Araujo, B., Reis, S., Sena, S., Ichihara, Y., Pescarini, J., Fiaccone, R., Amorim, L., Pita, R., Smeeth, L., and Barreto, M. (2020). CIDACS-RL: A novel indexing search and scoring-based record linkage system for huge datasets with high accuracy and scalability. BMC Medical Informatics and Decision Making, 20.

Barreto, M. L., Ichihara, M., Almeida, B. d. A., Barreto, M., Cabral, L., Fiaccone, R., Carreiro, R., Teles, C., Pitta, R., Penna, G., et al. (2019). The centre for data and knowledge integration for health (cidacs): linking health and social data in brazil. International journal of population data science, 4(2):1140.

Barreto, M. L., Ichihara, M. Y., Pescarini, J. M., Ali, M. S., Borges, G. L., Fiaccone, R. L., Ribeiro-Silva, R. d. C., Teles, C. A., Almeida, D., Sena, S., et al. (2022). Cohort profile: the 100 million brazilian cohort. International journal of epidemiology, 51(2):e27–e38.

Christen, P. (2008). Febrl-an open source data cleaning, deduplication and record linkage system with a graphical user interface. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1065–1068.

Christen, P. (2011). A survey of indexing techniques for scalable record linkage and deduplication. IEEE transactions on knowledge and data engineering, 24(9):1537–1555.

Christen, P. (2019). Data linkage: The big picture. Harvard Data Science Review, 1.

Christen, P., Ranbaduge, T., and Schnell, R. (2020). Linking Sensitive Data: Methods and Techniques for Practical Privacy-Preserving Information Sharing. Springer Cham.

Christen, P., Ranbaduge, T., and Schnell, R. (2021). Linking sensitive data: Methods and techniques for practical privacy-preserving information sharing: Synopsis by kerina jones. International Journal of Population Data Science, 6(2).

Dillinger, P. C. and Manolios, P. (2004). Fast and accurate bitstate verification for spin. In Graf, S. and Mounier, L., editors, Model Checking Software, pages 57–75, Berlin, Heidelberg. Springer Berlin Heidelberg.

Dong, X. L. and Srivastava, D. (2013). Big data integration. In Proceedings - International Conference on Data Engineering, pages 1245–1248.

Durham, E. A., Kantarcioglu, M., Xue, Y., Toth, C., Kuzu, M., and Malin, B. (2013). Composite bloom filters for secure record linkage. IEEE transactions on knowledge and data engineering, 26(12):2956–2968.

Jurczyk, P., Lu, J. J., Xiong, L., Cragan, J. D., and Correa, A. (2008). Fril: a tool for comparative record linkage. In AMIA annual symposium proceedings, volume 2008, page 440.

Kirsch, A. and Mitzenmacher, M. (2006). Less hashing, same performance: Building a better bloom filter. volume 4168, pages 456–467.

Kristensen, T. G., Nielsen, J., and Pedersen, C. N. (2010). A tree-based method for the rapid screening of chemical fingerprints. Algorithms for Molecular Biology, 5:1–10.

Nóbrega, T., Pires, C. E. S., and Nascimento, D. C. (2021). Blockchain-based privacy-preserving record linkage: enhancing data privacy in an untrusted environment. Information Systems, 102:101826.

Nóbrega, T., Pires, C. E. S., and Nascimento, D. C. (2022). Explanation and answers to critiques on: Blockchain-based privacy-preserving record linkage. Information systems, 108:101935.

Patki, N. (2016). The synthetic data vault: generative modeling for relational databases. PhD thesis, Massachusetts Institute of Technology.

Pinto, C., Pita, R., Melo, P., Sena, S., and Barreto, M. (2015). Correlaçao probabilıstica de bancos de dados governamentais. Simpósio Brasileiro de Bancos de Dados (SBBD 2015), pages 77–85.

Pita, R., Carreiro, R. P., Santos, C. J. C., Protasio, L. d. S., Barreto, M. E., Orrico, V. B., Gomes, J. A. D., Eustáquio, F. S., Sena, S., Barreto, M. L., Ramos, P. I. P., Rangel, D., and Almeida, B. d. A. (2025). Big data linkage no brasil: Aspectos metodológicos e práticos. In Minicursos do XXV Simpósio Brasileiro de Computação Aplicada à Saúde, pages 306–345. SBC.

Pita, R., Menezes, L., and Barreto, M. E. (2018a). Applying term frequency-based indexing to improve scalability and accuracy of probabilistic data linkage. In LADaS@ VLDB, pages 65–72.

Pita, R., Pinto, C., Sena, S., Fiaccone, R., Amorim, L., Reis, S., Barreto, M. L., Denaxas, S., and Barreto, M. E. (2018b). On the accuracy and scalability of probabilistic data linkage over the brazilian 114 million cohort. IEEE Journal of Biomedical and Health Informatics, 22(2):346–353.

Ranbaduge, T., Vatsalan, D., and Ding, M. (2023). Privacy-preserving deep learning based record linkage. IEEE Transactions on Knowledge and Data Engineering, 36(11):6839–6850.

Schnell, R. (2014). An efficient privacy-preserving record linkage technique for administrative data and censuses. Statistical journal of the IAOS, 30:263–270.

Schnell, R. (2015). Privacy-preserving record linkage. Methodological developments in data linkage, pages 201–225.

Schnell, R., Bachteler, T., and Reiher, J. (2009). Development of a new method for privacy-preserving record linkage allowing for errors in identifiers. methods, data, analyses, 3(2):15.

Schnell, R., Bachteler, T., and Reiher, J. (2011). A novel error-tolerant anonymous linking code. Available at SSRN 3549247.

Vatsalan, D., Sehili, Z., Christen, P., and Rahm, E. (2017). Privacy-preserving record linkage for big data: Current approaches and research challenges. Handbook of big data technologies, pages 851–895.
Publicado
29/09/2025
ORRICO, Victor; EUSTÁQUIO, Fernanda; ALMEIDA, Bethânia; SILVA, Mirlei; PITA, Robespierre. Scalable privacy-preserving record linkage: Evaluating MultiBit tree indexing in Atyimo. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 40. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 602-615. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2025.247288.