Evaluation of Entry-Level Open-Source Large Language Models for Information Extraction from Digitized Documents

Francisco Clerton Almeida; Carlos Caminha

doi:10.5753/kdmile.2024.243859

Francisco Clerton Almeida Unifor
Carlos Caminha Unifor / UFC

DOI: https://doi.org/10.5753/kdmile.2024.243859

Resumo

The rise of Large Language Models (LLMs) has transformed the field of natural language processing (NLP), offering a wide range of proprietary and open-source models varying significantly in size and complexity, often measured by billions of parameters. While larger models excel in complex tasks like summarization and creative text generation, smaller models are suited for simpler tasks such as document classification and information extraction from unstructured data. This study evaluates open-source LLMs, specifically those with 7 to 14 billion parameters, in the task of extracting information from OCR texts of digitized documents. The effectiveness of OCR can be influenced by factors such as skewed images and blurred photos, resulting in unstructured text with various issues. The utility of these models is highlighted in Intelligent Process Automation (IPA), where software robots partially replace humans in validating and extracting information, enhancing efficiency and accuracy. The documents used in this research, provided by a state treasury department in Brazil, comprise personal verification documents. Results show that open-source entry-level models perform 18% lower than a cutting-edge proprietary model with trillions of parameters, making them viable free alternatives.

Palavras-chave: Natural Language Processing, Large Language Models, Information Extraction

Referências

Cardoso, B. and Pereira, D. Evaluating an aspect extraction method for opinion mining in the portuguese language. In Anais do VIII Symposium on Knowledge Discovery, Mining and Learning. SBC, Porto Alegre, RS, Brasil, pp. 137–144, 2020.

Chakraborti, T., Isahagian, V., Khalaf, R., Khazaeni, Y., Muthusamy, V., Rizk, Y., and Unuvar, M. From robotic process automation to intelligent process automation. In International Conference on Business Process Management. Springer, pp. 215–228, 2020.

Gartlehner, G., Kahwati, L., Hilscher, R., Thomas, I., Kugley, S., Crotty, K., Viswanathan, M., Nussbaumer-Streit, B., Booth, G., Erskine, N., Konet, A., and Chew, R. Data Extraction for Evidence Synthesis Using a Large Language Model: A Proof-of-Concept Study. medRxiv, October, 2023. [Online]. DOI: 10.1101/2023.10.02.23296415.

Gu, Y., Dong, L., Wei, F., and Huang, M. Minillm: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations, 2024.

Hu, E. J., Shi, L., Squadrato, R., Tay, Y., Ruder, S., and Raffel, C. Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685, 2021. [Online]. Disponível em: [link].

Martins, V. and Silva, C. Text classification in law area: a systematic review. In Anais do IX Symposium on Knowledge Discovery, Mining and Learning. SBC, Porto Alegre, RS, Brasil, pp. 33–40, 2021.

Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., and Gao, J. Large language models: A survey. arXiv preprint arXiv:2402.06196 , 2024.

Papadopoulos, D., Papadakis, N., and Litke, A. A Methodology for Open Information Extraction and Representation from Large Scientific Corpora: The CORD-19 Data Exploration Use Case. Applied Sciences 10 (16): 5630, 2020.

Silva, M. d. L. M., Mendonça, A. L. C., Neto, E. R. D., Chaves, I. C., Caminha, C., Brito, F. T., Farias, V. A. E., and Machado, J. C. Facto dataset: A dataset of user reports for faulty computer components. In Anais do VI Dataset Showcase Workshop. SBC, pp. 1–12, 2024.

Townsend, V., Xie, D., Huang, P., and Cole, L. Structured Information Extraction from Complex Scientific Text with Fine-Tuned Large Language Models. arXiv preprint arXiv:2212.05238 , December, 2023. [Online]. Disponível em: [link].

Weston, L., Tshitoyan, V., Dagdelen, J., Kononova, O., Trewartha, A., Persson, K. A., Ceder, G., and Jain, A. Named entity recognition and normalization applied to large-scale information extraction from the materials science literature. Journal of Chemical Information and Modeling 59 (9): 3692–3702, 2019.

Zhang, X., Wang, Y., Xu, Y., and Zhang, J. Adaptive Weight Quantization for Efficient Neural Network Inference. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS), 2022.

Zhang, Y., Kumar, S., Singh, D., and Jain, A. LMDX: Language Model-based Document Information Extraction and Localization. arXiv preprint arXiv:2303.01234, 2023.