An Evaluation of Efficiency and Effectiveness of the Combination of Techniques for Data Deduplication

  • Levy de Souza Silva Federal University of Minas Gerais
  • Dimas Cassimiro Nascimento Filho Federal Rural University of Pernambuco
  • Mirella M. Moro Federal University of Minas Gerais https://orcid.org/0000-0002-0545-2001

Abstract


Data Deduplication is the task of identifying and eliminating duplicate records in a single database. It is a complex process that involves several steps, including: defining blocking key, similarity function and indexing method. There are several approaches for each of these steps. In this context, the objective of this work is to find the best combination for such algorithms aiming to improve the efficiency and effectiveness of the deduplication process as a whole. To this end, we present an experimental evaluation using real and artificial datasets. The results point to distinct combinations that present better results in specific situations.
Keywords: Data Deduplication

References

Borges, E. N., Galante, R. M., and Gonçalves, M. A. (2008). Uma abordagem efetiva e eficiente para deduplicação de metadados bibliográficos de objetos digitais. In Proceedings of the 23rd Brazilian Symposium on Databases, pages 76–90, Campinas, Brasil.

Brizan, D. G. and Tansel, A. U. (2006). A survey of entity resolution and record linkage methodologies. Communications of the IIMA, pages 41–50.

Canalle, G. K., Lóscio, B. F., and Salgado, A. C. (2016). Uma estratégia para seleção de atributos relevantes no processo de resolução de entidades. In Anais do 31º Simpósio Brasileiro de Banco de Dados, pages 259–264, Salvador, Bahia, Brasil.

Chen, J., Jin, C., Zhang, R., and Zhou, A. (2012). A learning method for entity matching. In Proceedings of 10th International Workshop on Quality in Databases, China.

Christen, P. (2012a). Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Berlin.

Christen, P. (2012b). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, pages 1537–1555.

Fellegi, I. P. and Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association, pages 1183–1210.

Hajishirzi, H., Yih, W.-t., and Kolcz, A. (2010). Adaptive near-duplicate detection via similarity learning. In Proceedings of the 33rd international ACM SIGIR conference on Research and Development in Information Retrieval, pages 419–426, Geneva, Switzerland.

Hernández, M. A. and Stolfo, S. J. (1995). The merge/purge problem for large databases. In Proceedings of the 1995 ACM SIGMOD, pages 127–138, New York, USA.

Ioannou, E., Rassadko, N., and Velegrakis, Y. (2013). On generating benchmark data for entity matching. Journal on Data Semantics, pages 37–56.

Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. Journal of the American Statistical Association, pages 414–420.

Machado, R. F., Pinheiro, R. F., Machado, K. S., and Borges, E. N. (2016). Contacts deduplication in mobile devices using textual similarity and machine learning. In Proceedings of the 22nd Brazilian Symposium on Information Systems, page 22, Florianópolis, Santa Catarina.

Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing Surveys, pages 31–88.

Odell, M. and Russell, R. (1918). The soundex coding system. US Patents, 1261167.

Vesdapunt, N., Bellare, K., and Dalvi, N. (2014). Crowdsourcing algorithms for entity resolution. Proceedings of the VLDB Endowment, pages 1071–1082.

Winkler, W. E. (1990). String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In Proceedings of the Section on Survey Research, pages 354–359, Anaheim, California.
Published
2017-10-02
SILVA, Levy de Souza; NASCIMENTO FILHO, Dimas Cassimiro; MORO, Mirella M.. An Evaluation of Efficiency and Effectiveness of the Combination of Techniques for Data Deduplication. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 32. , 2017, Uberlândia/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2017 . p. 160-171. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2017.170764.