Towards a Corpus Methodology for LLMs in the Legal Domain

  • Aline Athaydes UFBA
  • Lucas B. Bulcão Mota UFBA
  • Fernando Humberto de Almeida Moraes Neto UFBA
  • Samuel Rios da Silva UFBA
  • Babacar Mane UFBA
  • Daniela Barreiro Claro UFBA
  • Marlo Souza UFBA
  • Andressa Beatriz Cardoso Lisboa UFBA

Abstract


The creation of high-quality Question-Answer (QA) datasets is critical for developing reliable legal AI systems, yet a significant gap exists between intrinsic textual metrics and real-world model performance. This paper introduces an end-to-end framework to bridge this gap. We first present a refined methodology for generating a legal QA dataset (V2) based on the Brazilian Consumer Protection Code (Código de Defesa do Consumidor CDC), demonstrating its superiority over a baseline corpus using metrics such as MTLD and Shannon Entropy. We then assess its practical impact by fine-tuning a Qwen3-8B model with LoRA. The model’s performance is evaluated on a novel, expert validated 76 question multiple choice benchmark. Results show that the fine-tuned model achieves perfect accuracy on the benchmark and surpasses the base model across text generation metrics including BLEU, METEOR and BERTScore. Our work offers a reproducible methodology for legal dataset construction and validation, providing empirical evidence that improvements in data quality yield tangible gains in downstream legal reasoning tasks.

References

Athaydes, A., Bulcao, L., Sacramento, C., Mane, B., Claro, D., Souza, M., and Pita, R. (2024). Brazilian consumer protection code: a methodology for a dataset to question-answer (qa) models. In Anais do XV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 493–500, Porto Alegre, RS, Brasil. SBC.

Banerjee, S. and Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Goldstein, J., Lavie, A., Lin, C.-Y., and Voss, C., editors, Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.

Brasil (1990). Código de Defesa do Consumidor. Presidência da República Casa Civil. Lei nº 8.078, de 11 de setembro de 1990.

Guha, N., Nyarko, J., Ho, D. E., Ré, C., Chilton, A., Narayana, A., Chohlas-Wood, A., Peters, A., Waldon, B., Rockmore, D. N., Zambrano, D., Talisman, D., Hoque, E., Surani, F., Fagan, F., Sarfaty, G., Dickinson, G. M., Porat, H., Hegland, J., Wu, J., Nudell, J., Niklaus, J., Nay, J., Choi, J. H., Tobia, K., Hagan, M., Ma, M., Livermore, M., Rasumov-Rahe, N., Holzenberger, N., Kolt, N., Henderson, P., Rehaag, S., Goel, S., Gao, S., Williams, S., Gandhi, S., Zur, T., Iyer, V., and Li, Z. (2023). Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models.

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. (2021). Measuring massive multitask language understanding.

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., and Chen, W. (2021). Lora: Low-rank adaptation of large language models. CoRR, abs/2106.09685.

Lucena, D., Souza, E. P., Albuquerque, H., Da Silva, N., Oliveira, A., and de Carvalho, A. (2025). Performance analysis of llms for abstractive summarization of brazilian legislative documents. Conference on Digital Government Research, 1.

Luz de Araujo, P. H., de Campos, T. E., Ataides Braz, F., and Correia da Silva, N. (2020). VICTOR: a dataset for Brazilian legal documents classification. In Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., and Piperidis, S., editors, Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1449–1458, Marseille, France. European Language Resources Association.

Luz de Araujo, P. H., de Campos, T. E., de Oliveira, R. R. R., Stauffer, M., Couto, S., and Bermejo, P. (2018). Lener-br: A dataset for named entity recognition in brazilian legal text. In Computational Processing of the Portuguese Language: 13th International Conference, PROPOR 2018, Canela, Brazil, September 24–26, 2018, Proceedings, page 313–323, Berlin, Heidelberg. Springer-Verlag.

Mauŕıcio, A., Pinheiro, V., Furtado, V., Neto, J. A. M., Bomfim, F. C. J., da Costa, A. C. F., Silveira, R., and Aragão, N. (2023). Cdjur-br: A golden collection of legal documents from brazilian justice with fine-grained named entities. arXiv preprint arXiv:2305.18315.

OpenAI (2024). Gpt-4o: Openai’s multimodal model with improved efficiency and reasoning. [link].

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. In Isabelle, P., Charniak, E., and Lin, D., editors, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.

Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513–523.

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q., Men, R., Gao, R., Liu, S., Luo, S., Li, T., Tang, T., Yin, W., Ren, X., Wang, X., Zhang, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Zhang, Y., Wan, Y., Liu, Y., Wang, Z., Cui, Z., Zhang, Z., Zhou, Z., and Qiu, Z. (2025). Qwen3 technical report.

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. (2020). Bertscore: Evaluating text generation with bert.
Published
2025-09-29
ATHAYDES, Aline; MOTA, Lucas B. Bulcão; MORAES NETO, Fernando Humberto de Almeida; SILVA, Samuel Rios da; MANE, Babacar; CLARO, Daniela Barreiro; SOUZA, Marlo; LISBOA, Andressa Beatriz Cardoso. Towards a Corpus Methodology for LLMs in the Legal Domain. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 16. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 272-282. DOI: https://doi.org/10.5753/stil.2025.37831.