Avaliação de Processos ETL para Análise de Dados usando SGBD Orientado a Grafos

  • Jones Dhyemison Quito de Oliveira UFG
  • Leonardo Andrade Ribeiro UFG

Abstract


The presence of duplicates is a perennial problem in databases. This type of inconsistency violates integrity constraints and may compromise the results of data analysis activities. A graph-oriented DBMS can be used to perform similarity graph queries and, thus, identify potential duplicates. This approach requires the execution of an ETL process for extracting data from relational sources, transforming them into a similarity graph, and loading this graph into a graph-oriented DBMS. This paper presents a performance comparison between two ETL processes for this purpose. The first process performs the calculation of similarities using the relational DBMS itself. The second process performs the calculation of similarities using a specialized algorithm. The results show that the use of the specialized algorithm outperforms the approach based on technology purely relational by orders of magnitude.

Keywords: DBMS, ETL process, duplicates, similarity graph, algorithm

References

Aurélio (2019). Significado de Similaridade. Dicionário do Aurélio Online. Último acesso em 29.07.2019.

Baeza-Yates, R. A. and Ribeiro-Neto, B. A. (2011). Modern Information Retrieval – the Concepts and Technology behind Search. Pearson, 2sd edition.

Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A Primitive Operator for Similarity Joins in Data Cleaning. In Proceedings of the IEEE International Conference on Data Engineering, page 5.

Cohen, W. W., Ravikumar, P. D., and Fienberg, S. E. (2003). A comparison of string distance metrics for name-matching tasks. In Proceedings of IJCAI-03 Workshop on Information Integration on the Web, pages 73–78.

Dong, X. L. and Naumann, F. (2009). Data Fusion - Resolving Data Conflicts for Inte- gration. PVLDB, 2(2):1654–1655.

Elmagarmid, A. K., Ipeirotis, P. G., and Verykios, V. S. (2007). Duplicate Record Detec- tion: A Survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1–16.

Gruenheid, A., Dong, X. L., and Srivastava, D. (2014). Incremental Record Linkage. Proceedings of the VLDB Endowment, 7(9):697–708.

Herna´ndez, M. A. and Stolfo, S. J. (1998). Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Mining and Knowledge Discovery, 2(1):9–37.

Navarro, G. (2001). A Guided Tour to Approximate String Matching. ACM Computing Surveys, 33(1):31–88.

Ribeiro, L. A. and Ha¨rder, T. (2011). Generalizing Prefix Filtering to Improve Set Simi- larity Joins. Information Systems, 36(1):62–78.

Ribeiro, L. A., Schneider, N. C., de Souza Ina´cio, A., Wagner, H. M., and von Wange- nheim, A. (2016). Bridging Database Applications and Declarative Similarity Mat- ching. Journal of Information and Data Management, 7(3):217–232.

Ukkonen, E. (1992). Approximate String Matching with q-grams and Maximal Matches. Theoretical Computer Science, 92(1):191–211.

van Erven, G. C. G. (2015). MDG-NoSQL: Modelo de Dados para Bancos NoSQL Ba- seados em Grafos. Dissertac¸a˜o, Universidade de Bras´ılia - UnB, Bras´ılia.

Vaz, R. V., de Oliveira, J. D. Q., and Ribeiro, L. A. (2019). Duplicate Management Using Graph Database Systems: A Case Study. In Proceedings of the XV Brazilian Symposium on Information Systems, pages 50:1–50:8.
Published
2019-11-22
DE OLIVEIRA, Jones Dhyemison Quito; RIBEIRO, Leonardo Andrade . Avaliação de Processos ETL para Análise de Dados usando SGBD Orientado a Grafos. In: REGIONAL SCHOOL ON INFORMATICS OF GOIÁS (ERI-GO), 7. , 2019, Goiânia. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2019 . p. 61-74.