Analysis of the Impact of the Dataset Generator on Data Deduplication Experiments
Abstract
Using tools to create synthetic datasets is the only solution for evaluating data duplication algorithms when real datasets are not available. However, the evaluation results may be affected by the diversity and levels of parameters available in such tools. Our goal is to verify which parameters and levels impact more on the results of deduplication experiments. Hence, we perform factorial projects on datasets created with the most used tool. Results show that two parameters explain the largest variation of results.
References
Christen, P. (2008). Automatic record linkage using seeded nearest neighbour and support vector machine classification. In SIGKDD, pages 151–159, Las Vegas, Nevada, USA.
Christen, P. (2012). Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer, Berlin.
de Carvalho, A. P., Ferreira, A. A., Laender, A. H., and Gonçalves, M. A. (2011). Incremental unsupervised name disambiguation in cleaned digital libraries. JIDM, 2(3):289–304.
de Carvalho, M. G., Laender, A. H., Gonçalves, M. A., and da Silva, A. S. (2012). A genetic programming approach to record deduplication. TKDE, 24(3):399–412.
Draisbach, U., Naumann, F., Szott, S., and Wonneberg, O. (2012). Adaptive windows for duplicate detection. In ICDE, pages 1073–1083, Arlington, Virginia, USA.
Hajishirzi, H., Yih, W.-t., and Kolcz, A. (2010). Adaptive near-duplicate detection via similarity learning. In SIGIR, pages 419–426, Geneva, Switzerland.
Hernández, M. A. and Stolfo, S. J. (1995). The merge/purge problem for large databases. In SIGMOD, pages 127–138, San Jose, CA, USA.
Ioannou, E., Rassadko, N., and Velegrakis, Y. (2013). On generating benchmark data for entity matching. Journal on Data Semantics, 2(1):37–56.
Jain, R. (1992). The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley.
Sadinle, M. (2017). Bayesian estimation of bipartite matchings for record linkage. Journal of the American Statistical Association, 112(518):600–612.
Steorts, R. C., Ventura, S. L., Sadinle, M., and Fienberg, S. E. (2014). A comparison of blocking methods for record linkage. In PSD, pages 253–268, Ibiza, Spain.
