Towards Auditable and Intelligent Privacy-Preserving Record Linkage

  • Thiago Nóbrega Universidade Federal de Campina Grande
  • Carlos Eduardo S. Pires Universidade Federal de Campina Grande
  • Dimas Cassimiro Nascimento Universidade Federal de Campina Grande


Privacy-Preserving Record Linkage (PPRL) intends to integrate private/ sensitive data from several data sources held by different parties. It aims to identify records (e.g., persons or objects) representing the same real-world entity over private data sources held by different custodians. Due to recent laws and regulations (e.g., General Data Protection Regulation), PPRL approaches are increasingly demanded in real-world application areas such as health care, credit analysis, public policy evaluation, and national security. As a result, the PPRL process needs to deal with efficacy (linkage quality), and privacy problems. For instance, the PPRL process needs to be executed over data sources (e.g., a database containing personal information of governmental income distribution and assistance programs), with an accurate linkage of the entities, and, at the same time, protect the privacy of the information. In this context, our work presents contributions to improve the privacy and quality capabalities of the PPRL. Moreover, we propose improvement to the linkage quality and simplify the process by employing Machine Learning techniques to decide whether two records represent the same entity, or not; and enable the auditability the computations performed during PPRL.

Palavras-chave: PPRL, Entity matching, Data privacy, Data security, Machine learning


Al-Rubaie, M. and Chang, J. M. (2019). Privacy-Preserving Machine Learning: Threats and Solutions. IEEE Security & Privacy, 17(2):49–58.

Batini, C. and Scannapieco, M. (2016). Data and Information Quality. Data-Centric Systems and Applications. Springer, 1 edition.

Boyd, J. H., Ferrante, A. M., O’Keefe, C. M., Bass, A. J., Randall, S. M., and Semmens, J. B. (2012). Data linkage infrastructure for cross-jurisdictional health-related research in australia. BMC health services research, 12(1):1–8.

Brickell, J. and Shmatikov, V. (2009). Privacy-preserving classifier learning. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 5628 LNCS:128–147.

Bygrave, L. (1998). Data protection pursuant to the right to privacy in human rights treaties. International Journal of Law and Information Technology, 6(3):247–284.

Chaudhuri, K. and Monteleoni, C. (2009). Privacy-preserving logistic regression. Advances in Neural Information Processing Systems 21 - Proceedings of the 2008 Conference, pages 289–296.

Christen, P. (2008). Automatic record linkage using seeded nearest neighbour and support vector machine classification. ACM SIGKDD, pages 151–159.

Christen, P. (2012). Data Matching. Springer Berlin Heidelberg, Berlin, Heidelberg.

Christen, P., Ranbaduge, T., and Schnell, R. (2020). Linking Sensitive Data. Springer, Cham.

Christen, P., Ranbaduge, T., Vatsalan, D., and Schnell, R. (2019). Precise and Fast Cryptanalysis for Bloom Filter Based Privacy-Preserving Record Linkage. IEEE Trans. Knowl. Data Eng., 31(11):2164–2177.

Christen, P., Schnell, R., Vatsalan, D., and Ranbaduge, T. (2017). Efficient Cryptanalysis of Bloom Filters for Privacy-Preserving Record Linkage Peter, volume 10235 of Lecture Notes in Computer Science. Springer, Cham.

Christen, P. and Vatsalan, D. (2012). A flexible data generator for privacy-preserving data mining and record linkage.

Cryan, M. (2006). Probability and Computing Randomized Algorithms and Probabilistic Analysis. JSTOR.

Dong, X. L. and Rekatsinas, T. (2018). Data Integration and Machine Learning. In ICDM, pages 1645–1650, New York, NY, USA. ACM.

Dwivedi, A. D., Srivastava, G., Dhar, S., and Singh, R. (2019). A decentralized privacy-preserving healthcare blockchain for iot. Sensors, 19(2):326.

Dwork, C. (2008). Theory and Applications of Models of Computation. 4978:1–19.

Dwork, C. and Roth, A. (2014). The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407.

He, X., Machanavajjhala, A., Flynn, C., and Srivastava, D. (2017). Composing Differential Privacy and Secure Computation. In ACM SIGSAC, number 1, pages 1389–1406, New York, New York, USA. ACM Press.

Inan, A., Kantarcioglu, M., Ghinita, G., and Bertino, E. (2010). Private record matching using differential privacy. In EDBT, page 123, New York, New York, USA. ACM Press.

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2021). An Introduction to Statistical Learning. Springer Texts in Statistics. Springer New York, New York, NY.

Jarke, M. and Quix, C. (2022). Federated Data Integration in Data Spaces, pages 181–194. Springer, Cham.

Kirielle, N., Christen, P., and Ranbaduge, T. (2022). Transer: Homogeneous transfer learning for entity resolution. In EDBT, pages 2:118–2:130.

Koudas, N., Sarawagi, S., and Srivastava, D. (2006). Record linkage: similarity measures and algorithms. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 802–803.

Kuzu, M., Kantarcioglu, M., Durham, E., and Malin, B. (2011). A Constraint Satisfaction Cryptanalysis of Bloom Filters in Private Record Linkage. Privacy Enhancing Technologies, 6794:226–245.

Kuzu, M., Kantarcioglu, M., Durham, E. a., Toth, C., and Malin, B. (2012). A practical approach to achieve private medical record linkage in light of public resources. Journal of the American Medical Informatics Association, pages 285–292.

Lindell, Y. (2017). Tutorials on the Foundations of Cryptography. Springer.

Loster, M., Koumarelas, I., and Naumann, F. (2021). Knowledge transfer for entity resolution with siamese neural networks. Journal of Data and Information Quality (JDIQ), 13(1):1–25.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality.

Mivule, K., Turner, C., and Ji, S. Y. (2012). Towards a differential privacy and utility preserving machine learning classifier. Procedia Computer Science, 12:176–181.

Miyajima, H., Shigei, N., Makino, S., Miyajima, H., Miyanishi, Y., Kitagami, S., and Shiratori, N. (2017). A proposal of privacy preserving reinforcement learning for secure multiparty computation. Artificial Intelligence Research, 6(2):57.

Nóbrega, T., Pires, C. E. S., Nascimento, D. C., and Marinho, L. B. (2023). Towards automatic privacy-preserving record linkage: A transfer learning based classification step. Data & Knowledge Engineering, 145:102180.

Nóbrega, T. P. d., Pires, C. E. S., and Araujo, T. B. (2016). Avaliação Empírica de Técnicas de Comparação Privada Aplicadas na Resolução de Entidades. In Proceedings of the 31 st of the Brazilian Symposium on Databases (SBBD16), pages 121–126.

Nóbrega, T. (2022). Towards Auditable and Intelligent Privacy-Preserving Record Linkage. PhD thesis, PPGCC/UFCG.

Nóbrega, T., Pires, C. E. S., and Nascimento, D. C. (2021). Blockchain-based privacy-preserving record linkage: enhancing data privacy in an untrusted environment. Information Systems, 102:101826.

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. (2018). Deep contextualized word representations.

Pita, R., Pinto, C., Melo, P., Silva, M., Barreto, M., and Rasella, D. (2015). A Spark-based workflow for probabilistic record linkage of healthcare data. CEUR Workshop Proceedings, 1330:17–26.

Prokosch, H.-U., Bahls, T., Bialke, M., Eils, J., Fegeler, C., Gruendner, J., Haarbrandt, B., Hampf, C., Hoffmann, W., Hund, H., et al. (2022). The covid-19 data exchange platform of the german university medicine. In Challenges of Trustable AI and Added-Value on Health, pages 674–678. IOS Press.

Rajkumar, A. and Agarwal, S. (2012). A differentially private stochastic gradient descent algorithm for multiparty classification. Journal of Machine Learning Research, 22:933–941.

Ranbaduge, T. and Christen, P. (2018). Privacy-Preserving Temporal Record Linkage. 2018 IEEE International Conference on Data Mining (ICDM), pages 377–386.

Rao, F. Y., Cao, J., Bertino, E., and Kantarcioglu, M. (2019). Hybrid private record linkage: Separating differentially private synopses from matching records. ACM Transactions on Privacy and Security, 22(3).

Russinovich, M., Costa, M., Fournet, C., Chisnall, D., Delignat-Lavaud, A., Clebsch, S., Vaswani, K., and Bhatia, V. (2021). Toward confidential cloud computing. Communications of the ACM, 64(6):54–61.

Tang, F., Wu, W., Liu, J., Wang, H., and Xian, M. (2019). Privacy-preserving distributed deep learning via homomorphic re-encryption. Electronics (Switzerland), 8(4).

Thirumuruganathan, S., Parambath, S. A. P., Ouzzani, M., Tang, N., and Joty, S. (2018). Reuse and Adaptation for Entity Resolution through Transfer Learning.

Vatsalan, D. (2014). Scalable and Approximate Privacy-Preserving Record Linkage. PhD thesis.

Vatsalan, D., B, D. K., and Gkoulalas-divanis, A. (2019). An Overview of Big Data Issues in Privacy-Preserving Record Linkage, volume 2. Springer.

Vatsalan, D. and Christen, P. (2016). Multi-Party Privacy-Preserving Record Linkage using Bloom Filters.

Vatsalan, D., Christen, P., and Verykios, V. S. (2013a). A taxonomy of privacy-preserving record linkage techniques. Information Systems, 38(6):946–969.

Vatsalan, D., Christen, P., and Verykios, V. S. (2013b). Efficient Two-Party Private Blocking based on Sorted Nearest Neighborhood Clustering. pages 1949–1958.

Vatsalan, D., Karapiperis, D., and Verykios, V. S. (2018). Privacy-Preserving Record Linkage. (January).

Vatsalan, D., Sehili, Z., Christen, P., and Rahm, E. (2016). Privacy-Preserving Record Linkage for Big Data : Current Approaches and Research Challenges. In Big Data Handbook. Springer.

Vidanage, A., Christen, P., Ranbaduge, T., and Schnell, R. (2020). A Graph Matching Attack on Privacy-Preserving Record Linkage. Int. Conf. Inf. Knowl. Manag. Proc., pages 1485–1494.

Vidanage, A., Ranbaduge, T., Christen, P., and Schnell, R. (2019). Efficient Pattern Mining based Cryptanalysis for Privacy-Preserving Record Linkage. Proceedings - International Conference on Data Engineering, pages 1698–1701.

Vidanage, A., Ranbaduge, T., Christen, P., and Schnell, R. (2022). A taxonomy of attacks on privacy-preserving record linkage. Journal of Privacy and Confidentiality, 12(1).

Weng, J., Weng, J., Zhang, J., Li, M., Zhang, Y., and Luo, W. (2019). Deepchain: Auditable and privacy-preserving deep learning with blockchain-based incentive. IEEE Transactions on Dependable and Secure Computing, 18(5):2438–2455.
NÓBREGA, Thiago; EDUARDO S. PIRES, Carlos; CASSIMIRO NASCIMENTO, Dimas. Towards Auditable and Intelligent Privacy-Preserving Record Linkage. In: CONCURSO DE TESES E DISSERTAÇÕES (CTDBD) - SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 38. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 270-284. DOI: