Engineering a LLM-Based Data Ingestion Process for Semi-Structured Documents: An Empirical Comparison with Human-Centered Workflows
Resumo
Data-driven software systems increasingly rely on ingesting information from semi-structured documents, whose manual processing is costly and error-prone. Although Large Language Models (LLMs) can interpret such documents, their systematic integration into reliable ingestion processes remains underexplored. This paper proposes an LLM-driven data ingestion process from a Software Engineering perspective, distinguishing prompt-based optimization from model training. Prompts are treated as versioned engineering artifacts and refined iteratively through automated comparison with ground truth under stateless execution. A controlled experiment evaluates multiple LLM backends using extraction quality, cost, and generalization.
Referências
ARORA, S. et al. (2023) “Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes”, arXiv preprint. [link].
KUMAR, R. et al. (2024) “Integrating Large Language Models into Data Pipelines for Automated Insight Generation”, ResearchGate preprint. [link].
RASNAYAKA, S. et al. (2024) “An Empirical Study on the Usage and Perceptions of Large Language Models in Software Engineering”, arXiv preprint. [link].
SHARMA, P. et al. (2024) “From Raw Data to Actionable Insights: Leveraging Large Language Models for Automation”, International Journal of Recent Innovations in Computer and Communication Engineering. [link].
SILVA, R. et al. (2025) “Comparing Large Language Models in Business Rule-Following”, In: Proceedings of the Ibero-American Conference on Software Engineering (CIbSE). Porto Alegre: Sociedade Brasileira de Computação. [link].
TIWARI, A. et al. (2023) “An Empirical Study on Information Extraction Using Large Language Models”, arXiv preprint. [link].
WYRICH, M. et al. (2025) “Understanding the Role of Large Language Models in Software Engineering: Evidence from an Industry Survey”, arXiv preprint. [link].
ZHENG, Z. et al. (2023) “Large Language Models for Software Engineering: Survey and Open Problems”, ResearchGate preprint. [link].
ZHENG, Z. et al. (2023) “Towards an Understanding of Large Language Models in Software Engineering Tasks”, arXiv preprint. [link].
ZHOU, Y. et al. (2025) “A Comprehensive Survey on Integrating Large Language Models with Knowledge Systems”, arXiv preprint. [link].
