Design and Implementation of a DataOps-Based Asynchronous Pipeline for Large-Scale Data Ingestion

  • Leonardo Afonso Amorim UFG
  • Vinicius Alboneti Aguiar UFG
  • Sávio Salvarino Teles de Oliveira UFG
  • Arlindo Rodrigues Galvão Filho UFG
  • Ricardo Costa UFG
  • Marcos Prado Engineering Brasil

Resumo


This paper presents the design and validation of a data operations (DataOps)-oriented asynchronous pipeline designed to optimize large-scale data ingestion into cloud data warehouses. The proposed solution addresses ingestion bottlenecks caused by large file transfers through a combination of strategic file fragmentation and parallel processing using serverless services, including Google Cloud Functions and Cloud Run. Rather than introducing a new architectural model, the contribution lies in demonstrating the practical effectiveness of integrating these cloud-native techniques under real-world conditions. Experimental results show a 52.8x speed-up in ingestion time, reducing the processing of a 2TB dataset from 11 days to just 5 hours. The pipeline also incorporates Infrastructure as Code (IaC), Continuous Integration/Continuous Deployment (CI/CD), and automated monitoring, delivering a scalable, reproducible, and cost-efficient ingestion strategy well-suited for real-world enterprise data workflows.

Palavras-chave: DataOps, Asynchronous Pipeline, Large-Scale Data Ingestion, Serverless Computing, BigQuery

Referências

Bui, T. (2024). Real-time elt pipeline architecture in google cloud. Master’s thesis, Oulu University of Applied Sciences.

Casale, G., Artac, M., Van-Gool, W., Majo, V., Weerasinghe, D. A. B., Vithanage, P. N. A. M. E., Pu, Y., Varghese, B., and Chandrasiri, K. K. R. G. K. (2021). A survey on serverless computing. Journal of Cloud Computing, 10(1):30.

Castro, J. and Aguiar, C. (2023). Big data architectures for fair-compliant repositories: A systematic review. In Anais do XXXVIII Simpósio Brasileiro de Bancos de Dados, pages 76–88, Porto Alegre, RS, Brasil. SBC.

Chavan, M. (2024). Integrating dataops practices in signature verification systems for seamless data orchestration. 2:49–64.

DAMA International (2017). DAMA-DMBOK: Data Management Body of Knowledge. Technics Publications, 2nd edition.

Google (2024). Bigquery standard sql syntax – create model statement. [link]. Accessed: 2024-03-17.

Grover, V. and Pal, P. (2025). Ingesting insights: Data ingestion strategies and techniques for marketing data. In Balusamy, B., Grover, V., Nallakaruppan, M., Rajasekaran, V., and Milanova, M., editors, Data Engineering for Data-driven Marketing, pages 47–57. Emerald Publishing Limited, Leeds.

Kleppmann, M. (2017). Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O’Reilly Media.

Loureiro, J. and de Oliveira, D. (2022). Orbiter: um arcabouço para implantação automática de aplicações big data em arquiteturas serverless. In Anais do XXXVII Simpósio Brasileiro de Bancos de Dados, pages 379–384, Porto Alegre, RS, Brasil. SBC.

Manchana, R. (2024). Dataops: Bridging the gap between legacy and modern systems for seamless data orchestration. Technical report, SRC/JAICC-137. DOI: 10.47363/JAICC/2024(3)E137 J Arti Inte . . . .

Rahman, M. A. (2024). Boosting hive efficiency: A novel dual-process architecture for asynchronous and parallel data loading. Master’s thesis, Brac University, Department of Computer Science and Engineering, Brac University.

Reis, J. and Housley, M. (2022). Fundamentals of Data Engineering: Plan and Build Robust Data Systems. O’Reilly Media.
Publicado
04/12/2025
AMORIM, Leonardo Afonso; AGUIAR, Vinicius Alboneti; OLIVEIRA, Sávio Salvarino Teles de; GALVÃO FILHO, Arlindo Rodrigues; COSTA, Ricardo; PRADO, Marcos. Design and Implementation of a DataOps-Based Asynchronous Pipeline for Large-Scale Data Ingestion. In: ESCOLA REGIONAL DE INFORMÁTICA DE GOIÁS (ERI-GO), 13. , 2025, Luziânia/GO. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 149-156. DOI: https://doi.org/10.5753/erigo.2025.17069.