Experimental Evaluation of Error Detectors in Relational Datasets
Abstract
Data cleaning is crucial to prevent inconsistencies in the data. One of its fundamental steps is error detection. There are many methods and systems to detect errors. However, comparisons between these options are limited and often rely on heterogeneous datasets. This study evaluates different publicly available tools, considering various scenarios in a controlled and homogeneous environment. The results show that machine learning-based tools outperform older methods in error detection. However, this advantage is significant only when the error rate is relatively high.
Keywords:
Error detection, Experimental evaluation, Data cleaning
References
Abedjan, Z., Chu, X., Deng, D., Fernandez, R. C., Ilyas, I. F., Ouzzani, M., Papotti, P., Stonebraker, M., and Tang, N. (2016). Detecting data errors: Where are we and what needs to be done? PVLDB, 9(12):993–1004.
Arocena, P. C., Glavic, B., Mecca, G., Miller, R. J., Papotti, P., and Santoro, D. (2015). Messing up with BART: error generation for evaluating data-cleaning algorithms. PVLDB, 9(2):36–47.
Ilyas, I. F. and Chu, X. (2019). Data Cleaning. Association for Computing Machinery, New York, NY, USA.
Mahdavi, M., Abedjan, Z., Castro Fernandez, R., Madden, S., Ouzzani, M., Stonebraker, M., and Tang, N. (2019). Raha: A configuration-free error detection system. In ICDE, pages 865–882.
Mariet, Z., Harding, R., Madden, S., et al. (2016). Outlier detection in heterogeneous datasets using automatic tuple expansion. Technical report, MIT CSAIL.
Neutatz, F., Mahdavi, M., and Abedjan, Z. (2019). ED2: A case for active learning in error detection. In CIKM, pages 2249–2252.
Rekatsinas, T., Chu, X., Ilyas, I. F., and Ré, C. (2017). Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190–1201.
Arocena, P. C., Glavic, B., Mecca, G., Miller, R. J., Papotti, P., and Santoro, D. (2015). Messing up with BART: error generation for evaluating data-cleaning algorithms. PVLDB, 9(2):36–47.
Ilyas, I. F. and Chu, X. (2019). Data Cleaning. Association for Computing Machinery, New York, NY, USA.
Mahdavi, M., Abedjan, Z., Castro Fernandez, R., Madden, S., Ouzzani, M., Stonebraker, M., and Tang, N. (2019). Raha: A configuration-free error detection system. In ICDE, pages 865–882.
Mariet, Z., Harding, R., Madden, S., et al. (2016). Outlier detection in heterogeneous datasets using automatic tuple expansion. Technical report, MIT CSAIL.
Neutatz, F., Mahdavi, M., and Abedjan, Z. (2019). ED2: A case for active learning in error detection. In CIKM, pages 2249–2252.
Rekatsinas, T., Chu, X., Ilyas, I. F., and Ré, C. (2017). Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190–1201.
Published
2023-09-25
How to Cite
MEDINA, William G. R.; PENA, Eduardo H. M.; KASTER, Daniel S..
Experimental Evaluation of Error Detectors in Relational Datasets. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 38. , 2023, Belo Horizonte/MG.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2023
.
p. 342-347.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2023.233429.
