An Efficient Training Strategy for Genetic Programming Applied to Record Deduplication

  • Davi Guimarães da Silva Federal Institute of Education Science and Technology of Pará
  • Moisés Gomes de Carvalho Federal University of Amazonas
  • Duivilly Brito Federal University of Amazonas

Abstract


Genetic Programming (GP) is a machine learning technique effectively used in the record deduplication problem. GP adopts a very expensive training step that requires that all records in a database be compared against each other several times. In this paper, we propose a novel approach for training step based on clustering technique combined with a sliding window. This combination aims at minimize the number of comparisons required in the training step without affecting its results. Our experiments using real datasets show that it is possible to reduce the time cost of the training step up to 72.8% compared to the GP state of the art approach without a significant impact in the quality of generated solutions.
Keywords: Record deduplication, genetic programming, grouping technique, sliding window

References

Baeza-Yates, R. A. and Ribeiro-Neto, B. A. (1999). Modern Information Retrieval. ACM Press/Addison-Wesley., New York, NY, USA. p. 39-48. (KDD 03).

Bianco;, G. D. et al. (2013). Tuning large scale deduplication with reduced effort. In Proceedings International Conference on Scientific and Statistical Database Management, ACM, new york., p. 18:1-18:12. (SSDBM).

Bianco;, G. D. et al. (2016). A practical and effective sampling selection strategy for large scale deduplication. IEEE International Conference on Data Engineering, p. 1518-1519.

Bilenko, M.; Mooney, R. J. (2003). Adaptive Duplicate Detection Using Learnable String Similarity Measures. In: ACM., New York, NY, USA. p. 39-48. (KDD 03).

Carvalho, M. G. et al. (2008a). The impact of parameter setup on a genetic programming approach to record deduplication. S.B.C.; Brazilian Symp. Databases, p.91-105.

Carvalho, M. G. et al. (2008b). Replica identification using genetic programming. In: ACM Symposium on Applied Computing., p. 1801-1806. Carvalho, M. G. et al. (2009). Evolutionary approaches to data integration related problems. Tese. Universidade Federal de Minas Gerais, p. 66-81.

Carvalho, M. G. et al. (2012). A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineering; NJ, USA., v.24, p. 399-412.

Fellegi, I. P; Sunter, A. B. (1969). A theory for record linkage. Journal of the American Statistical Association., [S.l.], v.64, n.328, p. 1183-1210.

Jain, A. K. et al. (1999). Data clustering: A review. ACM Computing Surveys. 31(3)8.

Koudas, N. et al. (2006). Record linkage: Similarity measures and algorithms. ACM International Conference on Management of Data., p. 802-803, Chicago, USA.

Ma, K. et al. (2015). Large-scale schema-free data deduplication approach with adaptive sliding window using mapreduce. The Computer Journal, 58, n. 11, p.3187-3201.

Ziv, J. et al. (1977). A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3) pp. 337-343.
Published
2018-08-25
DA SILVA, Davi Guimarães; DE CARVALHO, Moisés Gomes; BRITO, Duivilly. An Efficient Training Strategy for Genetic Programming Applied to Record Deduplication. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 33. , 2018, Rio de Janeiro. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2018 . p. 265-270. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2018.22241.