Industrial Paper: Large-scale Record Linkage of Web-based Place Entities

  • Vinícius M. R. Cousseau In-Loco / UFPE
  • Luciano Barbosa UFPE


Extracting data about entities from the Web has become commonplace in the industry and academia alike. Web-based entities, however, are inherently noisy and, as such, introduce several normalization issues which must be attended to in order to maintain a clean database. Record linkage, which refers to the detection of replicated datum from possibly multiple sources, is one of the most critical of those issues. This paper presents a practical approach for solving the record linkage problem in the places data domain at an industrial scale, displaying both a model which reaches a normalized Gini coefficient of 0.92, and an architecture that supports large-scale processing.

Palavras-chave: record linkage, entity resolution, web data, data integration


