Improvements to the Blocking Process for Entity Resolution Based on the Relevance of the Terms

  • Laís Soares Caldeira Federal University of Ouro Preto (UFOP)
  • Anderson Almeida Ferreira Federal University of Ouro Preto (UFOP)

Abstract


Entity Resolution is a task commonly faced in data integration process. Due to quadratic number of comparisons to decide those instances belonging to the same entity, we need another way for performing such comparisons. In order to mitigate such a problem, techniques of blocking and block processing have been applied aiming the efficiency. In this work, we propose options to choose terms in the blocking step based on their relevance to the dataset in the phases of blocking and processing of blocks. We assess our proposal comparing it against relevant works available in the literature. The results show that our proposal decrease the run time by half, increasing the efficiency.
Keywords: Entity resolution, data integration, data blocking, indexing

References

Christen, P. (2012). A survey of indexing techniques for scalable record linkage and deduplication. IEEE TKDE, 24(9):1537–1555.

Fellegi, I. P. and Sunter, A. B. (1969). A theory for record linkage. volume 64, pages 1183–1210.

Halevy, A., Rajaraman, A., and Ordille, J. (2006). Data integration: the teenage years. In VLDB, pages 9–16.

Hernandez, M. A. and Stolfo, S. J. (1995). The merge/purge problem for large databases. ACM SIGMOD Rec., 24(2):127–138.

Madhavan, J., Jeffery, S. R., Cohen, S., Dong, X., Ko, D., Yu, C., and Halevy, A. (2007). Web-scale data integration: You can only afford to pay as you go. In CIDR, pages 342–350.

McCallum, A., Nigam, K., and Ungar, L. H. (2000). Efficient clustering of high dimensional data sets with application to reference matching. In ACM SIGKDD, pages 169–178.

Papadakis, G., Ioannou, E., Palpanas, T., Niederee, C., and Nejdl, W. (2013). A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE TKDE, 25(12):2665–2682.

Papadakis, G., Koutrika, G., Palpanas, T., and Nejdl, W. (2014). Meta-blocking: Taking entity resolutionto the next level. IEEE TKDEFherna, 26(8):1946–1960.

Papadakis, G., Papastefanatos, G., Palpanas, T., and Koubarakis, M. (2016). Scaling entity resolution to large, heterogeneous data with enhanced meta-blocking. In EDBT, pages 221–232.

Shannon, C. E. (2001). A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1):3–55.

Simonini, G., Bergamaschi, S., and Jagadish, H. (2016). Blast: a loosely schema-aware meta-blocking approach for entity resolution. VLDB, 9(12):1173–1184.

Whang, S. E., Menestrina, D., Koutrika, G., Theobald, M., and Garcia-Molina, H. (2009). Entity resolution with iterative blocking. In ACM SIGMOD, pages 219–232.

Wilbur, W. J. and Sirotkin, K. (1992). The automatic identification of stop words. Journal of information science, 18(1):45–55.
Published
2018-08-25
CALDEIRA, Laís Soares; FERREIRA, Anderson Almeida. Improvements to the Blocking Process for Entity Resolution Based on the Relevance of the Terms. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 33. , 2018, Rio de Janeiro. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2018 . p. 61-72. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2018.22219.