An Experimental Analysis of the Impact of Attribute Selection on Entity Resolution Processes
Abstract
Entity Resolution is the task of identifying duplicate records in datasets by a multi-step process. A common aspect involving its steps is the attribute selection, and there is no experimental work evaluating the attribute selection impact over the complete ER process. Such an evaluation is important because the ER effectiveness varies according to the selected attributes. Therefore, we cover this gap by performing experiments over real and synthetic datasets from different domains. Finally, the results show attribute selection affects the ER effectiveness by up to 92%.
References
Baxter, R. et al. (2003). A comparison of fast blocking methods for record linkage. In ACM SIGKDD, volume 3, pages 25–27, Washington, USA.
Caldeira, L. S. and Ferreira, A. A. (2018). Melhorias no processo de blocagem para resolução de entidades baseadas na relevˆancia dos termos. In SBBD, pages 61–72, Rio de Janeiro, Brasil.
Canalle, G. K. et al. (2017). A strategy for selecting relevant attributes for entity resolution in data integration systems. In ICEIS, pages 80–88, Porto, Portugal. DOI: https://doi.org/10.5220/0006316100800088
Christen, P. (2006). A comparison of personal name matching: Techniques and practical issues. In ICDM, pages 290–294, Hong Kong, China. DOI: https://doi.org/10.1109/ICDMW.2006.2
Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. TKDE, 24(9):1537–1555. DOI: https://doi.org/10.1109/TKDE.2011.127
Cohen, W. W. et al. (2003). A comparison of string distance metrics for name-matching tasks. In WIIW, pages 73–78, Acapulco, México.
Draisbach, U. and Naumann, F. (2009). A comparison and generalization of blocking and windowing algorithms for duplicate detection. In QDB, pages 51–56, Lyon, France.
Jain, R. (1992). The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley.
Konda, P. et al. (2019). Executing entity matching end to end: A case study. In EDBT, pages 489–500, Lisbon, Portugal. DOI: https://doi.org/10.5441/002/edbt.2019.45
Papadakis, G. et al. (2015). Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. PVLDB, 9(4):312–323. DOI: https://doi.org/10.14778/2856318.2856326
Silva, L. S. et al. (2017). Uma avaliação de eficiência e eficácia da combinação de técnicas para deduplicação de dados. In SBBD, pages 160–171, Uberlândia, Brasil.
Silva, L. S. et al. (2018). Automatic identification of best attributes for indexing in data deduplication. In AMW, Cali, Colombia.
