Data Migration Pipeline Development Approach Using Artificial Intelligence

Abstract


Large Language Models (LLMs) and artificial intelligence are increasingly serving as tools to enhance productivity. Given the significant growth of data availability and the need to explore it, data migration became a pressing issue. This work presents an approach for generating code for data migration and transformation between relational databases, using LLMs and open-source tools that integrate the application with artificial intelligence models. As a result, the approach proved to be viable for generating data transformation pipelines, while also highlighting certain challenges with LLMs, which do not eliminate the need for a specialist in the migration process.
Keywords: data engineering, artificial intelligence, natural language processing

References

Forbes. (2023) As 25 profissões em alta neste ano, segundo o linkedin. [link]. Acessado em: 14 de novembro de 2023.

J. ANDERSON. (2020) Data teams. [link] Chapter DataTeams.pdf. Acessado em:12 de julho de 2024.

M. Mendonça, “Metodologia de migração de dados em um contexto de migração de sistemas legados,” Master’s thesis, Universidade Federal de Pernambuco, 2009.

Brasscom, “Demanda de talentos em TIC e estratégia TCEM.” [link], 2021. Acessado em: 5 de setembro de 2023.

N. P. Russell, S., “Artificial Intelligence: A Modern Approach,” 1995.

J. Okerlund, E. Klasky, A. Middha, S. Kim, H. Rosenfeld, M. Kleinman, e S. Parthasarathy, “Large language models, why they matter, and what we should do about them,” University of Michigan, Tech. Rep., 2022. Acesso em 14 de dezembro de 2022. [Online]. Disponível em: [link]

J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, e D. C. Schmidt, “A prompt pattern catalog to enhance prompt engineering with ChatGPT,” 2023. [Online]. Disponível em: [link]

M. Ross e R. Kimball, The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling. Wiley, 2013.

R. A. Pazos R, J. J. González B, M. A. Aguirre L, J. A. Martínez F, e H. J. Fraire H, “Natural language interfaces to databases: an analysis of the state of the art,” Recent Advances on Hybrid Intelligent Systems, pp. 463–480, 2013.

A. Liu, X. Hu, L. Wen, e P. S. Yu, “A comprehensive evaluation of ChatGPT’s zero-shot text-to-SQL capability,” 2023.

G. d. M. Evangelista, “Uso de LLM open source na tradução de linguagem natural para SQL,” B.S. thesis, 2023.

C. V. der Putten, “Transforming data flow: Generative AI in ETL pipeline automatization,” Master Degree Thesis, Politecnico di Torino, 2024.

C. C. Pimentel, “Bancos de dados relacionais: uma análise comparativa entre ferramentas SGBD livre e proprietária,” Tecnologia em Gestão da Tecnologia da Informação-Unisul Virtual, 2019.

Oracle, Understanding Explain Plan. [link]. Acessado em: 8 de junho de 2024.

A. Souza, “Comparando capacidades de LLMs (Large Language Models),” [link], 2023. Acessado em: 30 de agosto de 2024.
Published
2024-11-27
SANTOS, José Vítor Donassolo Correa dos; KUSZERA, Evandro Miguel. Data Migration Pipeline Development Approach Using Artificial Intelligence. In: LATIN AMERICAN CONGRESS ON FREE SOFTWARE AND OPEN TECHNOLOGIES (LATINOWARE), 21. , 2024, Foz do Iguaçu/PR. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 41-48. DOI: https://doi.org/10.5753/latinoware.2024.245349.