Engineering a LLM-Based Data Ingestion Process for Semi-Structured Documents: An Empirical Comparison with Human-Centered Workflows

  • Sandrelly L. Coutinho UFPE
  • Paulo J. L. Adeodato UFPE
  • Bruno C. de Souza UFPE
  • Carlos E. de S. Fontes UFPE / UNINASSAU

Resumo


Data-driven software systems increasingly rely on ingesting information from semi-structured documents, whose manual processing is costly and error-prone. Although Large Language Models (LLMs) can interpret such documents, their systematic integration into reliable ingestion processes remains underexplored. This paper proposes an LLM-driven data ingestion process from a Software Engineering perspective, distinguishing prompt-based optimization from model training. Prompts are treated as versioned engineering artifacts and refined iteratively through automated comparison with ground truth under stateless execution. A controlled experiment evaluates multiple LLM backends using extraction quality, cost, and generalization.

Palavras-chave: Large Language Models, Data Ingestion, Semi-Structured Documents, Prompt Engineering, Empirical Software Engineering

Referências

AHMAD, A. et al. (2024) “Large Language Models in Software Engineering: A Focus on Integration Challenges and Opportunities”, In: CEUR Workshop Proceedings, v. 3762. [link].

ARORA, S. et al. (2023) “Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes”, arXiv preprint. [link].

KUMAR, R. et al. (2024) “Integrating Large Language Models into Data Pipelines for Automated Insight Generation”, ResearchGate preprint. [link].

RASNAYAKA, S. et al. (2024) “An Empirical Study on the Usage and Perceptions of Large Language Models in Software Engineering”, arXiv preprint. [link].

SHARMA, P. et al. (2024) “From Raw Data to Actionable Insights: Leveraging Large Language Models for Automation”, International Journal of Recent Innovations in Computer and Communication Engineering. [link].

SILVA, R. et al. (2025) “Comparing Large Language Models in Business Rule-Following”, In: Proceedings of the Ibero-American Conference on Software Engineering (CIbSE). Porto Alegre: Sociedade Brasileira de Computação. [link].

TIWARI, A. et al. (2023) “An Empirical Study on Information Extraction Using Large Language Models”, arXiv preprint. [link].

WYRICH, M. et al. (2025) “Understanding the Role of Large Language Models in Software Engineering: Evidence from an Industry Survey”, arXiv preprint. [link].

ZHENG, Z. et al. (2023) “Large Language Models for Software Engineering: Survey and Open Problems”, ResearchGate preprint. [link].

ZHENG, Z. et al. (2023) “Towards an Understanding of Large Language Models in Software Engineering Tasks”, arXiv preprint. [link].

ZHOU, Y. et al. (2025) “A Comprehensive Survey on Integrating Large Language Models with Knowledge Systems”, arXiv preprint. [link].
Publicado
11/05/2026
COUTINHO, Sandrelly L.; ADEODATO, Paulo J. L.; SOUZA, Bruno C. de; FONTES, Carlos E. de S.. Engineering a LLM-Based Data Ingestion Process for Semi-Structured Documents: An Empirical Comparison with Human-Centered Workflows. In: CONGRESSO IBERO-AMERICANO EM ENGENHARIA DE SOFTWARE (CIBSE), 29. , 2026, Recife/PE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2026 . p. 181-195.