Evaluating BERT Models for Semantic Retrieval in Long Portuguese Legal Documents

Adrielson Ferreira Justino; Antônio Fernando Lavareda Jacob Junior; Ricardo Marcondes Marcacini; Fábio Manoel França Lobato

doi:10.5753/eniac.2025.14328

Adrielson Ferreira Justino UEMA
Antônio Fernando Lavareda Jacob Junior UEMA
Ricardo Marcondes Marcacini USP
Fábio Manoel França Lobato UEMA / UFOPA / USP

DOI: https://doi.org/10.5753/eniac.2025.14328

Resumo

O crescente número de documentos digitais no Judiciário brasileiro cria novos desafios para a eficiência processual. Este estudo avaliou cinco modelos BERT na recuperação de informações densas para documentos judiciais longos, utilizando a segmentação e a recuperação de vetores com o Elastic-search. Modelos de uso geral, específicos de domínio e específicos de tarefa foram testados para medir a coerência intra-cluster. O BumbaBERT (específico de domínio) teve o melhor desempenho, confirmando que a especialização de domínio é crucial para a recuperação semântica eficaz em cenários de “zero-shot” no contexto jurídico brasileiro.

Referências

Beltagy, I., Peters, M. E., and Cohan, A. (2020). Longformer: The long-document transformer.

CNJ (2024). Relatório analítico anual da justiça em números 2023. [link].

Costa, J. A. F. and Dantas, N. C. D. (2023). Análise comparativa de embeddings jurídicos aplicados a algoritmos de clustering. Anais do Congresso Brasileiro de Computação Jurídica.

Devlin, J. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.

do Carmo, F. A., Serejo, F., Junior, A. F. J., Santana, E. E., and Lobato, F. M. (2023). Embeddings jurídico: Representações orientadas à linguagem jurídica brasileira. In Anais do XI Workshop de Computação Aplicada em Governo Eletrônico.

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Sun, J., and Wang, H. (2023). Retrieval-augmented generation for large language models: A survey.

Guimarães, J. A. C. (2004). Elaboração de ementas jurisprudenciais: elementos teórico-metodológicos.

Harispe, S., Ranwez, S., Montmain, J., et al. (2022). Semantic similarity from natural language and ontology analysis.

Karpukhin, V., Oguz, B., Min, S., Lewis, P. S., Wu, L., Edunov, S., Chen, D., and Yih, W.-t. (2020). Dense passage retrieval for open-domain question answering. In EMNLP (1).

Khattab, O. and Zaharia, M. (2020). Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48.

Ku, L.-W., Wu, T.-H., Lee, L.-Y., and Chen, H.-H. (2005). Construction of an evaluation corpus for opinion extraction. In NTCIR.

Magalhães, R. A. and Freitas, F. O. (2023). A morosidade do poder judiciário e sua interferência nas relações contratuais. Revista Jurídica Cesumar-Mestrado.

Moore, D. S., McCabe, G. P., and Craig, B. A. (2009). Introduction to the Practice of Statistics.

Ni, C., Wu, J., Wang, H., Lu, W., and Zhang, C. (2024). Enhancing cloud-based large language model processing with elasticsearch and transformer models. In ISPP.

Oliveira, R. S. d. and Sperandio Nascimento, E. G. (2025). Analysing similarities between legal court documents using natural language processing approaches based on transformers. PloS one, 20(4):e0320244.

Pires, V. B., Guerreiro, D., et al. (2024). Portuguese fake news classification with bert models. In Encontro Nacional de Inteligência Artificial e Computacional (ENIAC). SBC.

Polo, F. M., Mendonça, G. C. F., Parreira, K. C. J., Gianvechio, L., Cordeiro, P., Ferreira, J. B., de Lima, L. M. P., Maia, A. C. d. A., and Vicente, R. (2021). Legalnlp–natural language processing methods for the brazilian legal language.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks.

Scherrer, L., Tomko, M., Ranacher, P., and Weibel, R. (2018). Travelers or locals? identifying meaningful sub-populations from human movement data in the absence of ground truth. EPJ Data Science, 7(1):1–21.

Schütze, H., Manning, C. D., and Raghavan, P. (2008). Introduction to information retrieval.

Silveira, R., Ponte, C., Almeida, V., Pinheiro, V., and Furtado, V. (2023). Legalbert-pt: A pretrained language model for the brazilian portuguese legal domain. In Brazilian Conference on Intelligent Systems, pages 268–282.

Singhal, A. et al. (2001). Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4):35–43.

Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: pretrained bert models for brazilian portuguese. In Brazilian conference on intelligent systems.

Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., and Gurevych, I. (2021). Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models.

Toffoli, J. A. D. and Gusmão, B. G. (2019). Inteligência artificial na justiça. Brasília: CNJ.

Wirth, R. and Hipp, J. (2000). Crisp-dm: Towards a standard process model for data mining. In 4th Int. Conf. on Practical Applications of Knowledge Discovery and Data Mining.

Wortsman, M., Ilharco, G., Kim, J. W., Li, M., Kornblith, S., Roelofs, R., Lopes, R. G., Hajishirzi, H., Farhadi, A., Namkoong, H., et al. (2022). Robust fine-tuning of zero-shot models.

Zhao, L. (2012). Modeling and solving term mismatch for full-text retrieval. Carnegie Mellon University.

Zhao, W. X., Liu, J., Ren, R., and Wen, J.-R. (2024). Dense text retrieval based on pretrained language models: A survey. ACM Transactions on Information Systems.

Evaluating BERT Models for Semantic Retrieval in Long Portuguese Legal Documents

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)