Brazilian Consumer Protection Code: a methodology for a dataset to Question-Answer (QA) Models

Abstract


This work introduces the methodology for building a new dataset based onthe Brazilian Consumer Protection Code (CDC), focusing on question-answer (QA) models. The dataset collected legal data, including CDC articles, legal summaries, and court rulings from the Superior Court of Justice (STJ). Automated data extraction techniques using Python were employed, and advanced language models such as Llama3-8b-8192, Gemma2-9b-it, and GPT-4o-mini were used to generate question-answer (QA) structures. This work presents our methodology for creating such a dataset to be used by language models for training in the legal domain, particularly in the CDC domain.

Keywords: Natural Language Processing (NLP), Corpus, LLMs, Question-Answering Systems, Machine Learning, Consumer Protection, Legal data

References

AI, G. (2023). Gemma2-9b-it model documentation. [link]. Accessed: 2024-09-12.

Jardim, P. C., Moraes, L. M. P., and Aguiar, C. D. d. A. (2023). Qasports: A question answering dataset about sports. Repositório da Produção USP.

OpenAI (2023). Gpt-4 technical report. [link].

Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.

Silveira, R., Ponte, C., Almeida, V., Pinheiro, V., and Furtado, V. (2023). Legalbert-pt: A pretrained language model for the brazilian portuguese legal domain. In Proceedings of the Brazilian Conference on Intelligent Systems (BRACIS), pages 268–282. Sociedade Brasileira de Computação.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. (2023). Llama: Open and efficient foundation language models.
Published
2024-11-28
ATHAYDES, Aline; BULCAO, Lucas Krug; SACRAMENTO, Caio; MANE, Babacar; CLARO, Daniela Barreiro; SOUZA, Marlo; PITA, Robespierre. Brazilian Consumer Protection Code: a methodology for a dataset to Question-Answer (QA) Models. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 15. , 2024, Belém/PA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 493-500. DOI: https://doi.org/10.5753/stil.2024.31168.