Industrial Paper: Large-scale Record Linkage of Web-based Place Entities
Resumo
Extracting data about entities from the Web has become commonplace in the industry and academia alike. Web-based entities, however, are inherently noisy and, as such, introduce several normalization issues which must be attended to in order to maintain a clean database. Record linkage, which refers to the detection of replicated datum from possibly multiple sources, is one of the most critical of those issues. This paper presents a practical approach for solving the record linkage problem in the places data domain at an industrial scale, displaying both a model which reaches a normalized Gini coefficient of 0.92, and an architecture that supports large-scale processing.
Referências
Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(9):1537–1555. DOI: https://doi.org/10.1109/TKDE.2011.127
Dalvi, N., Olteanu, M., Raghavan, M., and Bohannon, P. (2014). Deduplicating a places database. In Proceedings of the 23rd international conference on World wide web - WWW 14. ACM Press. DOI: https://doi.org/10.1145/2566486.2568034
Moreau, E., Yvon, F., and Capp´e, O. (2008). Robust similarity measures for named entities matching. In Proceedings of the 22nd International Conference on Computational Linguistics - COLING 08. Association for Computational Linguistics. DOI: https://doi.org/10.3115/1599081.1599156
Wilson, D. R. (2011). Beyond probabilistic record linkage: Using neural networks and complex features to improve genealogical record linkage. In The 2011 International Joint Conference on Neural Networks. IEEE. DOI: https://doi.org/10.1109/IJCNN.2011.6033192