Enhancing Legal Question Answering in Brazilian Portuguese through Domain-Specific Embedding Models
Resumo
The increasing digitization of legal documents presents significant challenges for information retrieval. Traditional keyword-based search methods often fail to capture the semantic nuances of complex legal queries. RetrievalAugmented Generation (RAG) has emerged as a powerful paradigm for building Question Answering (Q&A) systems, but its effectiveness is highly dependent on the quality of its retrieval component. This paper addresses the problem of improving semantic search over legal texts from the Court of Accounts of the State of Goiás (TCE-GO). We introduce two specialized embedding models created by fine-tuning the state-of-the-art BGE-M3 model on domain-specific corpora of jurisprudence and legislation, respectively. Our experimental results demonstrate that these specialized models significantly outperform general-purpose multilingual and Portuguese models in retrieval tasks, as measured by MRR@10 and Recall@10. Notably, our fine-tuned models, despite their moderate size, surpass much larger models, highlighting that domain specialization is a more parameter-efficient strategy than simply scaling model size for niche domains.Referências
Araujo, A., Golo, M., Viana, B., Sanches, F., Romero, R., and Marcacini, R. (2020). From bag-of-words to pre-trained neural language models: Improving automatic classification of app reviews for requirements engineering. In Encontro Nacional de Inteligência Artificial e Computacional (ENIAC), pages 378–389. SBC.
Chalkidis, I., Kamateri, E., Lazaridou, K., Aletras, N., Katakalou, M., and Krithara, A. (2020). LEGAL-BERT: The Muppets straight out of Law School. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898–2904.
Evangelista, G. A., de Oliveira, J. B., et al. (2024). Hybrid cnn-gnn models in active sonar imagery: an experimental evaluation. In Encontro Nacional de Inteligência Artificial e Computacional (ENIAC), pages 37–48. SBC.
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., and Wang, H. (2024). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
Gomes, L., Branco, A., Silva, J., Rodrigues, J., and Santos, R. (2024). Open sentence embeddings for portuguese with the serafim pt encoders family. In Santos, M. F., Machado, J., Novais, P., Cortez, P., and Moreira, P. M., editors, Progress in Artificial Intelligence, pages 267–279, Cham. Springer Nature Switzerland.
Gururangan, S., Marasović, A., Swaminathan, S., Lo, K., Beltagy, I., Downey, D., and Smith, N. A. (2020). Don’t stop pretraining: Adapt language models to domains and tasks. pages 8342–8360.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2021). LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
Jian-Xiang, W., Shitao, X., Wang, Z., Jing-An, Y., Zhaoxu, D., Yu-Hong, L., Cun-Yue, G., and Shao-Dan, W. (2024). Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through multi-objective training.
Junior, G. S. T., Peres, S. M., Fantinato, M., Brandao, A. A., and Cozman, F. G. (2024). A goal-oriented chat-like system for evaluation of large language models. In Encontro Nacional de Inteligência Artificial e Computacional (ENIAC), pages 743–754. SBC.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Ott, M., Chen, W.-t., Conneau, A., et al. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
Li, Z., Zhang, X., Zhang, Y., Long, D., Xie, P., and Zhang, M. (2023). Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281.
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173.
Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. (2022). Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316.
Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3982–3992.
Rocha, L. M. and Pessoa, R. M. (2024). Advanced retrieval augmented generation for local llms. In Encontro Nacional de Inteligência Artificial e Computacional (ENIAC), pages 767–776. SBC.
Rocho, R. S. M., Perez, A. L. F., Farias, G. P., and Panisson, A. R. (2024). Integrating llms and chatbots technologies-a case study on brazilian transit law. In Encontro Nacional de Inteligência Artificial e Computacional (ENIAC), pages 731–742. SBC.
Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In Brazilian Conference on Intelligent Systems, pages 403–417. Springer.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., and Wei, F. (2024). Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672.
Zhang, X., Zhang, Y., Long, D., Xie, W., Dai, Z., Tang, J., Lin, H., Yang, B., Xie, P., Huang, F., et al. (2024). mgte: Generalized long-context text representation and reranking models for multilingual text retrieval. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1393–1412.
Chalkidis, I., Kamateri, E., Lazaridou, K., Aletras, N., Katakalou, M., and Krithara, A. (2020). LEGAL-BERT: The Muppets straight out of Law School. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898–2904.
Evangelista, G. A., de Oliveira, J. B., et al. (2024). Hybrid cnn-gnn models in active sonar imagery: an experimental evaluation. In Encontro Nacional de Inteligência Artificial e Computacional (ENIAC), pages 37–48. SBC.
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., and Wang, H. (2024). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
Gomes, L., Branco, A., Silva, J., Rodrigues, J., and Santos, R. (2024). Open sentence embeddings for portuguese with the serafim pt encoders family. In Santos, M. F., Machado, J., Novais, P., Cortez, P., and Moreira, P. M., editors, Progress in Artificial Intelligence, pages 267–279, Cham. Springer Nature Switzerland.
Gururangan, S., Marasović, A., Swaminathan, S., Lo, K., Beltagy, I., Downey, D., and Smith, N. A. (2020). Don’t stop pretraining: Adapt language models to domains and tasks. pages 8342–8360.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2021). LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
Jian-Xiang, W., Shitao, X., Wang, Z., Jing-An, Y., Zhaoxu, D., Yu-Hong, L., Cun-Yue, G., and Shao-Dan, W. (2024). Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through multi-objective training.
Junior, G. S. T., Peres, S. M., Fantinato, M., Brandao, A. A., and Cozman, F. G. (2024). A goal-oriented chat-like system for evaluation of large language models. In Encontro Nacional de Inteligência Artificial e Computacional (ENIAC), pages 743–754. SBC.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Ott, M., Chen, W.-t., Conneau, A., et al. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
Li, Z., Zhang, X., Zhang, Y., Long, D., Xie, P., and Zhang, M. (2023). Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281.
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173.
Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. (2022). Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316.
Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 3982–3992.
Rocha, L. M. and Pessoa, R. M. (2024). Advanced retrieval augmented generation for local llms. In Encontro Nacional de Inteligência Artificial e Computacional (ENIAC), pages 767–776. SBC.
Rocho, R. S. M., Perez, A. L. F., Farias, G. P., and Panisson, A. R. (2024). Integrating llms and chatbots technologies-a case study on brazilian transit law. In Encontro Nacional de Inteligência Artificial e Computacional (ENIAC), pages 731–742. SBC.
Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In Brazilian Conference on Intelligent Systems, pages 403–417. Springer.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., and Wei, F. (2024). Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672.
Zhang, X., Zhang, Y., Long, D., Xie, W., Dai, Z., Tang, J., Lin, H., Yang, B., Xie, P., Huang, F., et al. (2024). mgte: Generalized long-context text representation and reranking models for multilingual text retrieval. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1393–1412.
Publicado
29/09/2025
Como Citar
NOVAIS, Artur M. A.; FERREIRA, David O. C.; SILVA, Josiel P. C.; BRAKES, Matheus F. C.; PRESA, João P. C.; OLIVEIRA, Sávio S. T. de.
Enhancing Legal Question Answering in Brazilian Portuguese through Domain-Specific Embedding Models. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 22. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 523-533.
ISSN 2763-9061.
DOI: https://doi.org/10.5753/eniac.2025.13841.
