Industrial Paper: Large-scale Record Linkage of Web-based Place Entities

  • Vinícius M. R. Cousseau In-Loco / UFPE
  • Luciano Barbosa UFPE

Resumo


Extracting data about entities from the Web has become commonplace in the industry and academia alike. Web-based entities, however, are inherently noisy and, as such, introduce several normalization issues which must be attended to in order to maintain a clean database. Record linkage, which refers to the detection of replicated datum from possibly multiple sources, is one of the most critical of those issues. This paper presents a practical approach for solving the record linkage problem in the places data domain at an industrial scale, displaying both a model which reaches a normalized Gini coefficient of 0.92, and an architecture that supports large-scale processing.

Palavras-chave: record linkage, entity resolution, web data, data integration

Referências

Berjawi, B. (2017). Integration of Heterogeneous Data from Multiple Location-Based Services Providers: a Use Case on Tourist Points of Interest. PhD thesis.

Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9):1537–1555. DOI: https://doi.org/10.1109/TKDE.2011.127

Dalvi, N., Olteanu, M., Raghavan, M., and Bohannon, P. (2014). Deduplicating a places database. In Proceedings of the 23rd international conference on World wide web - WWW 14. ACM Press. DOI: https://doi.org/10.1145/2566486.2568034

Moreau, E., Yvon, F., and Capp´e, O. (2008). Robust similarity measures for named entities matching. In Proceedings of the 22nd International Conference on Computational Linguistics - COLING 08. Association for Computational Linguistics. DOI: https://doi.org/10.3115/1599081.1599156

Wilson, D. R. (2011). Beyond probabilistic record linkage: Using neural networks and complex features to improve genealogical record linkage. In The 2011 International Joint Conference on Neural Networks. IEEE. DOI: https://doi.org/10.1109/IJCNN.2011.6033192
Publicado
07/10/2019
COUSSEAU, Vinícius M. R.; BARBOSA, Luciano. Industrial Paper: Large-scale Record Linkage of Web-based Place Entities. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 34. , 2019, Fortaleza. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2019 . p. 181-186. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2019.8820.