Brazilian Consumer Protection Code: a methodology for a dataset to Question-Answer (QA) Models
Resumo
This work introduces the methodology for building a new dataset based onthe Brazilian Consumer Protection Code (CDC), focusing on question-answer (QA) models. The dataset collected legal data, including CDC articles, legal summaries, and court rulings from the Superior Court of Justice (STJ). Automated data extraction techniques using Python were employed, and advanced language models such as Llama3-8b-8192, Gemma2-9b-it, and GPT-4o-mini were used to generate question-answer (QA) structures. This work presents our methodology for creating such a dataset to be used by language models for training in the legal domain, particularly in the CDC domain.
Referências
Jardim, P. C., Moraes, L. M. P., and Aguiar, C. D. d. A. (2023). Qasports: A question answering dataset about sports. Repositório da Produção USP.
OpenAI (2023). Gpt-4 technical report. [link].
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
Silveira, R., Ponte, C., Almeida, V., Pinheiro, V., and Furtado, V. (2023). Legalbert-pt: A pretrained language model for the brazilian portuguese legal domain. In Proceedings of the Brazilian Conference on Intelligent Systems (BRACIS), pages 268–282. Sociedade Brasileira de Computação.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. (2023). Llama: Open and efficient foundation language models.