LegalBert-pt: A Pretrained Language Model for the Brazilian Portuguese Legal Domain

Raquel Silveira; Caio Ponte; Vitor Almeida; Vládia Pinheiro; Vasco Furtado

Raquel Silveira IFCE https://orcid.org/0000-0001-7445-605X
Caio Ponte UNIFOR https://orcid.org/0000-0003-1643-9475
Vitor Almeida UNIFOR https://orcid.org/0009-0005-3606-2304
Vládia Pinheiro UNIFOR https://orcid.org/0000-0002-9851-8304
Vasco Furtado UNIFOR https://orcid.org/0000-0001-8721-4308

Resumo

Language models trained with Bidirectional Encoder Representations from Transformers (BERT) have demonstrated remarkable results in various Natural Language Processing (NLP) tasks. However, the legal domain poses specific challenges for NLP due to its highly specialized language, which includes technical vocabulary, formal style, frequent use of law citations and semantics based on vast knowledge. Therefore, pretrained language models on a generic corpus may not be suitable for performing specific legal domain tasks. They lack the necessary expertise to understand the nuances of legal language, leading to inaccuracies and inconsistencies. This work describes the development of a specialized language model, LegalBert-pt, for the legal domain in Portuguese. The model was pretrained on a large and diverse corpus of Brazilian legal texts and is now open-source and customizable for specific tasks. Experiments were conducted to evaluate the pretrained model’s effectiveness in the legal domain, both intrinsically and in two specific tasks: named-entity recognition and text classification. The results indicate that using LegalBert-pt outperforms the generic language model in all tasks, emphasizing the importance of specialization in achieving effective results for specific tasks in the legal domain.