Extracting Information from Brazilian Legal Documents with Retrieval Augmented Generation

  • Isabella V. de Aquino Universidade Federal de Santa Catarina (UFSC) http://orcid.org/0000-0002-7055-3503
  • Matheus M. dos Santos Universidade Federal de Santa Catarina (UFSC)
  • Carina F. Dorneles Universidade Federal de Santa Catarina (UFSC)
  • Jônata T. Carvalho Universidade Federal de Santa Catarina (UFSC)


Extracting information from unstructured data is a challenge that has drawn increasing attention over time due to the exponential growth of stored digital data in modern society. Large Language Models (LLMs) have emerged as powerful tools that benefit from this abundance and have shown remarkable capabilities in Natural Language Processing tasks. Nonetheless, these models still encounter limitations on extraction tasks. Retrieval Augmented Generation (RAG) is a novel approach that combines classic retrieval techniques and LLMs to address some of these limitations. This paper proposes a workflow that allows the assessment of RAG experimental setups, including the multiple possibilities of parameters and LLMs, to extract structured data from Brazilian legal documents. We validated our proposal with experiments using forty legal documents and the extraction of two target variables. The best results obtained with our workflow showed an average extraction accuracy of 90\%, significantly outperforming a regular expression strategy, with 58.75\% average accuracy. Furthermore, our results show that each extracted variable potentially holds an optimal combination of parameters, highlighting the context-dependency of each extraction and, therefore, the proposed workflow's usefulness.
Palavras-chave: Information Extraction, Legal Documents, RAG, LLMs


