Brazilian Consumer Protection Code: a methodology for a dataset to Question-Answer (QA) Models

Aline Athaydes; Lucas Krug Bulcao; Caio Sacramento; Babacar Mane; Daniela Barreiro Claro; Marlo Souza; Robespierre Pita

doi:10.5753/stil.2024.31168

Aline Athaydes UFBA http://orcid.org/0009-0008-4071-7280
Lucas Krug Bulcao UFBA https://orcid.org/0009-0009-0648-7730
Caio Sacramento UFBA https://orcid.org/0009-0004-3641-9409
Babacar Mane UFBA https://orcid.org/0000-0002-9519-2847
Daniela Barreiro Claro UFBA https://orcid.org/0000-0001-8586-1042
Marlo Souza UFBA https://orcid.org/0000-0002-5373-7271
Robespierre Pita UFBA https://orcid.org/0000-0002-0616-620X

DOI: https://doi.org/10.5753/stil.2024.31168

Resumo

This work introduces the methodology for building a new dataset based onthe Brazilian Consumer Protection Code (CDC), focusing on question-answer (QA) models. The dataset collected legal data, including CDC articles, legal summaries, and court rulings from the Superior Court of Justice (STJ). Automated data extraction techniques using Python were employed, and advanced language models such as Llama3-8b-8192, Gemma2-9b-it, and GPT-4o-mini were used to generate question-answer (QA) structures. This work presents our methodology for creating such a dataset to be used by language models for training in the legal domain, particularly in the CDC domain.

Palavras-chave: Natural Language Processing (NLP), Corpus, LLMs, Question-Answering Systems, Machine Learning, Consumer Protection, Legal data

Referências

AI, G. (2023). Gemma2-9b-it model documentation. [link]. Accessed: 2024-09-12.

Jardim, P. C., Moraes, L. M. P., and Aguiar, C. D. d. A. (2023). Qasports: A question answering dataset about sports. Repositório da Produção USP.

OpenAI (2023). Gpt-4 technical report. [link].

Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.

Silveira, R., Ponte, C., Almeida, V., Pinheiro, V., and Furtado, V. (2023). Legalbert-pt: A pretrained language model for the brazilian portuguese legal domain. In Proceedings of the Brazilian Conference on Intelligent Systems (BRACIS), pages 268–282. Sociedade Brasileira de Computação.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. (2023). Llama: Open and efficient foundation language models.