Evaluating Influential Factors over Machine Learning Algorithms in the Classification Stage of Entity Resolution

  • Milena Macedo Santos Federal University of the Agreste of Pernambuco
  • Dimas Cassimiro Nascimento Federal University of the Agreste of Pernambuco / Federal University of Campina Grande

Abstract


Entity resolution is a process that seeks to identify pairs of records in databases that correspond to the same real world entity. In this work, we evaluate several classification algorithms based on Machine Learning (ML) in the context of entity resolution. We consider the following algorithms: Adaboost, MLP, SVM, Random Forest and XGboost. In the process of evaluating the ML algorithms, we analyze the impact of balanced and unbalanced training sets over the efficacy of the algorithms in the classification stage. Based on the obtained experimental results, the Random Forest algorithm has produced a more promising result considering the evaluated datasets. In addition, the XGboost model has also presented competitive results.
Keywords: Entity Resolution, Machine Learning, Random Forest, XGboost

References

Christen, P. (2012). Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer.

Comber, S. and Arribas-Bel, D. (2019). Machine learning innovations in address matching: A practical comparison of word2vec and crfs. Transactions in GIS, 23(2):334–348.

Dal Bianco, G., Gonçalves, M. A., and Duarte, D. (2018). Bloss: Effective meta-blocking with almost no effort. Information Systems, 75:75–89.

de Souza Silva, L., Nascimento Filho, D. C., and Moro, M. M. (2017). Uma avaliação de eficiência e eficácia da combinaçao de técnicas para deduplicaçao de dados. In Anais do XXXII Simpósio Brasileiro de Bancos de Dados, pages 160–171. SBC.

Ilangovan, G. (2019). Benchmarking the effectiveness and efficiency of machine learning algorithms for record linkage. Master’s thesis, Texas AM University.

Kaur, P. et al. (2020). A comparison of machine learning classifiers for use on historical record linkage. Master’s thesis, University of Guelph.

Kim, K. and Giles, C. L. (2016). Financial entity record linkage with random forests. In Proceedings of the second international workshop on data science for macro-modeling, pages 1–2.

Köpcke, H., Thor, A., and Rahm, E. (2010). Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 3(1-2):484–493.

Li, Y., Li, J., Suhara, Y., Doan, A., and Tan, W.-C. (2020). Deep entity matching with pre-trained language models. arXiv preprint arXiv:2004.00584.

Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., and Raghavendra, V. (2018). Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data, pages 19–34.

Papadakis, G., Koutrika, G., Palpanas, T., and Nejdl, W. (2013). Meta-blocking: Taking entity resolutionto the next level. IEEE Transactions on Knowledge and Data Engineering, 26(8):1946–1960.

Peeters, R., Der, R. C., and Bizer, C. (2023). Wdc products: A multi-dimensional entity matching benchmark. arXiv preprint arXiv:2301.09521.

Pita, R., Mendonça, E., Reis, S., Barreto, M., and Denaxas, S. (2017). A machine learning trainable model to assess the accuracy of probabilistic record linkage. In Big Data Analytics and Knowledge Discovery: 19th International Conference, DaWaK 2017, Lyon, France, August 28–31, 2017, Proceedings 19, pages 214–227. Springer.

Ramezani Foukolayi, M. (2021). Comparison of machine learning algorithms in a human-computer hybrid record linkage system. Master’s thesis, Texas AM University.
Published
2023-09-25
SANTOS, Milena Macedo; NASCIMENTO, Dimas Cassimiro. Evaluating Influential Factors over Machine Learning Algorithms in the Classification Stage of Entity Resolution. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 38. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 63-75. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2023.232401.