Brazilian Consumer Protection Code: a methodology for a dataset to Question-Answer (QA) Models

Resumo


This work introduces the methodology for building a new dataset based onthe Brazilian Consumer Protection Code (CDC), focusing on question-answer (QA) models. The dataset collected legal data, including CDC articles, legal summaries, and court rulings from the Superior Court of Justice (STJ). Automated data extraction techniques using Python were employed, and advanced language models such as Llama3-8b-8192, Gemma2-9b-it, and GPT-4o-mini were used to generate question-answer (QA) structures. This work presents our methodology for creating such a dataset to be used by language models for training in the legal domain, particularly in the CDC domain.

Palavras-chave: Natural Language Processing (NLP), Corpus, LLMs, Question-Answering Systems, Machine Learning, Consumer Protection, Legal data

Referências

AI, G. (2023). Gemma2-9b-it model documentation. [link]. Accessed: 2024-09-12.

Jardim, P. C., Moraes, L. M. P., and Aguiar, C. D. d. A. (2023). Qasports: A question answering dataset about sports. Repositório da Produção USP.

OpenAI (2023). Gpt-4 technical report. [link].

Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.

Silveira, R., Ponte, C., Almeida, V., Pinheiro, V., and Furtado, V. (2023). Legalbert-pt: A pretrained language model for the brazilian portuguese legal domain. In Proceedings of the Brazilian Conference on Intelligent Systems (BRACIS), pages 268–282. Sociedade Brasileira de Computação.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. (2023). Llama: Open and efficient foundation language models.
Publicado
28/11/2024
ATHAYDES, Aline; BULCAO, Lucas Krug; SACRAMENTO, Caio; MANE, Babacar; CLARO, Daniela Barreiro; SOUZA, Marlo; PITA, Robespierre. Brazilian Consumer Protection Code: a methodology for a dataset to Question-Answer (QA) Models. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 15. , 2024, Belém/PA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 493-500. DOI: https://doi.org/10.5753/stil.2024.31168.