Extracting Information from Brazilian Legal Documents with Retrieval Augmented Generation

  • Isabella V. de Aquino Universidade Federal de Santa Catarina (UFSC) http://orcid.org/0000-0002-7055-3503
  • Matheus M. dos Santos Universidade Federal de Santa Catarina (UFSC)
  • Carina F. Dorneles Universidade Federal de Santa Catarina (UFSC)
  • Jônata T. Carvalho Universidade Federal de Santa Catarina (UFSC)

Resumo


Extracting information from unstructured data is a challenge that has drawn increasing attention over time due to the exponential growth of stored digital data in modern society. Large Language Models (LLMs) have emerged as powerful tools that benefit from this abundance and have shown remarkable capabilities in Natural Language Processing tasks. Nonetheless, these models still encounter limitations on extraction tasks. Retrieval Augmented Generation (RAG) is a novel approach that combines classic retrieval techniques and LLMs to address some of these limitations. This paper proposes a workflow that allows the assessment of RAG experimental setups, including the multiple possibilities of parameters and LLMs, to extract structured data from Brazilian legal documents. We validated our proposal with experiments using forty legal documents and the extraction of two target variables. The best results obtained with our workflow showed an average extraction accuracy of 90\%, significantly outperforming a regular expression strategy, with 58.75\% average accuracy. Furthermore, our results show that each extracted variable potentially holds an optimal combination of parameters, highlighting the context-dependency of each extraction and, therefore, the proposed workflow's usefulness.
Palavras-chave: Information Extraction, Legal Documents, RAG, LLMs

Referências

Bach and et al. (2019). Reference extraction from vietnamese legal documents. SoICT ’19, page 486–493, New York, NY, USA. Association for Computing Machinery.

Bhattacharya, P. and et al. (2019). Identification of rhetorical roles of sentences in indian legal judgments.

Boisen, S. and et al. (2000). Annotating resources for information extraction. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), Athens, Greece. European Language Resources Association (ELRA).

Cheng and et al. (2009). Information extraction from legal documents. In 2009 Eighth International Symposium on Natural Language Processing.

Doan, A. and et al. (2006). Managing information extraction: state of the art and research directions. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD ’06’, page 799–800, New York, NY, USA. Association for Computing Machinery.

Gao, Y. and et al. (2024). Retrieval-augmented generation for large language models: A survey.

Han, R. and et al. (2023). Is information extraction solved by chatgpt? an analysis of performance, evaluation criteria, robustness and errors.

Huang, L. and et al. (2023). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.

Jiang, A. Q. and et al. (2023). Mistral 7b.

Kandpal, N. and et al. (2023). Large language models struggle to learn long-tail knowledge.

Katz, D. M. and et al. (2023). Natural language processing in the legal domain.

Kowsrihawat and et al. (2015). An information extraction framework for legal documents: A case study of thai supreme court verdicts. In 2015 12th International Joint Conference on Computer Science and Software Engineering (JCSSE), pages 275–280. IEEE.

Liu, N. F. and et al. (2023). Lost in the middle: How language models use long contexts.

Pereira, J. and et al. (2024). Inacia: Integrating large language models in brazilian audit courts: Opportunities and challenges. Digit. Gov.: Res. Pract.

Sarkhel, R. and et al. (2021). Improving information extraction from visually rich documents using visual span representations. Proc. VLDB Endow., 14(5):822–834.

Souza, F. and et al. (2020). BERTimbau: pretrained BERT models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23 (to appear).

Touvron, H. and et al. (2023). Llama 2: Open foundation and fine-tuned chat models.

Vianna and et al. (2022). Organizing portuguese legal documents through topic discovery. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, page 3388–3392, New York, NY, USA. Association for Computing Machinery.

Wachsmuth, H. and et al. (2013). Information extraction as a filtering task. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, CIKM ’13, page 2049–2058, New York, NY, USA. Association for Computing Machinery.

Wei, X. and et al. (2024). Chatie: Zero-shot information extraction via chatting with chatgpt.

Zhu, W. and et al. (2012). Cross language information extraction for digitized textbooks of specific domains. In 2012 IEEE 12th International Conference on Computer and Information Technology, pages 1114–1118.
Publicado
14/10/2024
AQUINO, Isabella V. de; M. DOS SANTOS, Matheus; DORNELES, Carina F.; T. CARVALHO, Jônata. Extracting Information from Brazilian Legal Documents with Retrieval Augmented Generation. In: WORKSHOP ON DATA SCIENCE AGAINST CORRUPTION IN THE PUBLIC SECTOR (DS-COPS) - SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 39. , 2024, Florianópolis/SC. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 280-287. DOI: https://doi.org/10.5753/sbbd_estendido.2024.244241.