LegalBert-pt: A Pretrained Language Model for the Brazilian Portuguese Legal Domain


Language models trained with Bidirectional Encoder Representations from Transformers (BERT) have demonstrated remarkable results in various Natural Language Processing (NLP) tasks. However, the legal domain poses specific challenges for NLP due to its highly specialized language, which includes technical vocabulary, formal style, frequent use of law citations and semantics based on vast knowledge. Therefore, pretrained language models on a generic corpus may not be suitable for performing specific legal domain tasks. They lack the necessary expertise to understand the nuances of legal language, leading to inaccuracies and inconsistencies. This work describes the development of a specialized language model, LegalBert-pt, for the legal domain in Portuguese. The model was pretrained on a large and diverse corpus of Brazilian legal texts and is now open-source and customizable for specific tasks. Experiments were conducted to evaluate the pretrained model’s effectiveness in the legal domain, both intrinsically and in two specific tasks: named-entity recognition and text classification. The results indicate that using LegalBert-pt outperforms the generic language model in all tasks, emphasizing the importance of specialization in achieving effective results for specific tasks in the legal domain.
SILVEIRA, Raquel; PONTE, Caio; ALMEIDA, Vitor; PINHEIRO, Vládia; FURTADO, Vasco. LegalBert-pt: A Pretrained Language Model for the Brazilian Portuguese Legal Domain. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 12. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 268-282. ISSN 2643-6264.