DedupeGov: An Environment for Integrating Large Volumes of Data from Natural and Legal Entities in the Government Scope

  • Vitor Mangaravite Federal University of Minas Gerais (UFMG)
  • Marcos Carvalho Federal University of Minas Gerais (UFMG)
  • Luiz Cantelli Federal University of Minas Gerais (UFMG)
  • Lucas M. Ponce Federal University of Minas Gerais (UFMG)
  • Bruno Campoi Federal University of Minas Gerais (UFMG)
  • Gabriel Nunes Federal University of Minas Gerais (UFMG)
  • Alberto H. F. Laender Federal University of Minas Gerais (UFMG)
  • Marcos André Gonçalves Federal University of Minas Gerais (UFMG)

Abstract


Record Deduplication (RD) aims to identify instances that represent the same real-world entity in data repositories. In the government environment, the RD process facilitates the identification of irregularities and reduces the consumption of computing resources in data integration tasks. In this context, we propose a scalable, effective and efficient platform for integrating large data repositories (i.e., with large volumes of data, in the order of millions of records) to unify duplicate entities from multiple and different sources. Our experimental results demonstrate a 21.8% reduction from the original repository with 99% of accuracy and 95% of recall when identifying duplicate records. In addition, the proposed architecture proved to be extremely efficient and scalable for large volumes of data, deduplicating a repository of more than 392 million records in about one hour, in addition to being easy to generalize to different types of entity.

Keywords: Record Linkage, Data management, Data systems

References

Ai, W., Xu, J., Shao, H., Wang, Z., & Meng, T. (2021). An Entity Event Deduplication Method Based on Connected Subgraph. In Proceedingsns of the 7th International Conference on Systems and Informatics (ICSAI), pages 1-6. IEEE.

Alexiou, G., Papastefanatos, G., Stamatopoulos, V., Koutrika, G., & Koziris, N. (2022). QueryER: A Framework for Fast Analysis-Aware Deduplication over Dirty Data. arXiv preprint arXiv:2202.01546.

Azeroual, O., Jha, M., Nikiforova, A., Sha, K., Alsmirat, M., & Jha, S. (2022). A Record Linkage-Based Data Deduplication Framework with DataCleaner Extension. Multimodal Technologies and Interaction, 6(4):27.

Bartus, P. & Arzuaga, E. (2018). Gdedup: Distributed File System Level Deduplication for Genomic Big Data. In 2018 IEEE International Congress on Big Data (BigData Congress), pages 120-127. IEEE.

Bilenko, M. Y. (2002). Learnable Similarity Functions and Their Application to Record Linkage and Clustering. PhD thesis, The University of Texas, Austin.

Caldeira, L. S. & Ferreira, A. A. (2018). Melhorias no Processo de Blocagem para Resolução de Entidades Baseadas na Relevância dos Termos. In Anais do XXXIII Simpósio Brasileiro de Bancos de Dados, pages 61-72. SBC.

Ceccarelli, D., Lucchese, C., Orlando, S., Perego, R., & Trani, S. (2013). Dexter: An Open Source Framework for Entity Linking. In Proceedings of the Sixth International Workshop on Exploiting Semantic Annotations in Information Retrieval, pages 17-20.

Christen, P. (2009). Development and user experiences of an open source data cleaning, deduplication and record linkage system. ACM SIGKDD Explorations Newsletter, 11(1):39-48.

Christen, P. (2011). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9):1537-1555.

Espiridião, L. V., Dias, L. L., & Ferreira, A. A. (2021). Applying Data Augmentation for Disambiguating Author Names. In Anais do XXXVI Simpósio Brasileiro de Bancos de Dados, pages 109-120. SBC.

Ferreira, A. A., Gonçalves, M. A., & Laender, A. H. F. (2020). Automatic Disambiguation of Author Names in Bibliographic Repositories. Synthesis Lectures on Information Concepts, Retrieval, and Services. Morgan & Claypool Publishers.

Kaur, R., Chana, I., & Bhattacharya, J. (2018). Data deduplication techniques for efficient cloud storage management: a systematic review. The Journal of Supercomputing, 74(5):2035-2085.

Ngueilbaye, A., Wang, H., Mahamat, D. A., & Elgendy, I. A. (2021). SDLER: stacked dedupe learning for entity resolution in big data era. The Journal of Supercomputing, 77(10):10959-10983.

Papadakis, G., Skoutas, D., Thanos, E., & Palpanas, T. (2020). Blocking and filtering techniques for entity resolution: A survey. ACM Computing Surveys), 53(2):1-42.

Singhal, H., Ravi, H., Chakravarthy, S. N., Balasundaram, P., & Babu, C. (2019). EPMS: A Framework for Large-scale Patient Matching. In 31st IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pages 1096-1101. IEEE.

Stonebraker, M., Ilyas, I. F., et al. (2018). Data Integration: The Current Status and the Way Forward. IEEE Data Eng. Bull., 41(2):3-9.

Zhou, Y. & Talburt, J. R. (2011). Entity Identity Information Management (EIIM). In Proceedings of the International Conference on Information Quality, pages 327-241.

Ziegler, P. & Dittrich, K. R. (2007). Data Integration-Problems, Approaches, and Perspectives. In Conceptual Modelling in Information Systems Engineering, pages 39-58. Springer.
Published
2022-09-19
MANGARAVITE, Vitor; CARVALHO, Marcos; CANTELLI, Luiz; PONCE, Lucas M.; CAMPOI, Bruno; NUNES, Gabriel; LAENDER, Alberto H. F.; GONÇALVES, Marcos André. DedupeGov: An Environment for Integrating Large Volumes of Data from Natural and Legal Entities in the Government Scope. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 37. , 2022, Búzios. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2022 . p. 90-102. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2022.224655.