Experimental Evaluation of Error Detectors in Relational Datasets

  • William G. R. Medina University of Londrina
  • Eduardo H. M. Pena Federal University of Technology – Paraná
  • Daniel S. Kaster University of Londrina

Abstract


Data cleaning is crucial to prevent inconsistencies in the data. One of its fundamental steps is error detection. There are many methods and systems to detect errors. However, comparisons between these options are limited and often rely on heterogeneous datasets. This study evaluates different publicly available tools, considering various scenarios in a controlled and homogeneous environment. The results show that machine learning-based tools outperform older methods in error detection. However, this advantage is significant only when the error rate is relatively high.
Keywords: Error detection, Experimental evaluation, Data cleaning

References

Abedjan, Z., Chu, X., Deng, D., Fernandez, R. C., Ilyas, I. F., Ouzzani, M., Papotti, P., Stonebraker, M., and Tang, N. (2016). Detecting data errors: Where are we and what needs to be done? PVLDB, 9(12):993–1004.

Arocena, P. C., Glavic, B., Mecca, G., Miller, R. J., Papotti, P., and Santoro, D. (2015). Messing up with BART: error generation for evaluating data-cleaning algorithms. PVLDB, 9(2):36–47.

Ilyas, I. F. and Chu, X. (2019). Data Cleaning. Association for Computing Machinery, New York, NY, USA.

Mahdavi, M., Abedjan, Z., Castro Fernandez, R., Madden, S., Ouzzani, M., Stonebraker, M., and Tang, N. (2019). Raha: A configuration-free error detection system. In ICDE, pages 865–882.

Mariet, Z., Harding, R., Madden, S., et al. (2016). Outlier detection in heterogeneous datasets using automatic tuple expansion. Technical report, MIT CSAIL.

Neutatz, F., Mahdavi, M., and Abedjan, Z. (2019). ED2: A case for active learning in error detection. In CIKM, pages 2249–2252.

Rekatsinas, T., Chu, X., Ilyas, I. F., and Ré, C. (2017). Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190–1201.
Published
2023-09-25
MEDINA, William G. R.; PENA, Eduardo H. M.; KASTER, Daniel S.. Experimental Evaluation of Error Detectors in Relational Datasets. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 38. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 342-347. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2023.233429.