Industrial Paper: Large-scale Record Linkage of Web-based Place Entities

Vinícius M. R. Cousseau; Luciano Barbosa

doi:10.5753/sbbd.2019.8820

Vinícius M. R. Cousseau In-Loco / UFPE
Luciano Barbosa UFPE

DOI: https://doi.org/10.5753/sbbd.2019.8820

Resumo

Extracting data about entities from the Web has become commonplace in the industry and academia alike. Web-based entities, however, are inherently noisy and, as such, introduce several normalization issues which must be attended to in order to maintain a clean database. Record linkage, which refers to the detection of replicated datum from possibly multiple sources, is one of the most critical of those issues. This paper presents a practical approach for solving the record linkage problem in the places data domain at an industrial scale, displaying both a model which reaches a normalized Gini coefficient of 0.92, and an architecture that supports large-scale processing.

Palavras-chave: record linkage, entity resolution, web data, data integration

Referências

Berjawi, B. (2017). Integration of Heterogeneous Data from Multiple Location-Based Services Providers: a Use Case on Tourist Points of Interest. PhD thesis.

Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9):1537–1555. DOI: https://doi.org/10.1109/TKDE.2011.127

Dalvi, N., Olteanu, M., Raghavan, M., and Bohannon, P. (2014). Deduplicating a places database. In Proceedings of the 23rd international conference on World wide web - WWW 14. ACM Press. DOI: https://doi.org/10.1145/2566486.2568034

Moreau, E., Yvon, F., and Capp´e, O. (2008). Robust similarity measures for named entities matching. In Proceedings of the 22nd International Conference on Computational Linguistics - COLING 08. Association for Computational Linguistics. DOI: https://doi.org/10.3115/1599081.1599156

Wilson, D. R. (2011). Beyond probabilistic record linkage: Using neural networks and complex features to improve genealogical record linkage. In The 2011 International Joint Conference on Neural Networks. IEEE. DOI: https://doi.org/10.1109/IJCNN.2011.6033192