GovBERT-BR: A BERT-Based Language Model for Brazilian Portuguese Governmental Data

  • Mariana O. Silva UFMG
  • Gabriel P. Oliveira UFMG
  • Lucas G. L. Costa UFMG
  • Gisele L. Pappa UFMG

Resumo


Given the growing interest in natural language processing (NLP) for governmental applications, particularly in Brazil, where vast amounts of governmental data are processed daily, the need for specialized NLP models tailored to the nuances of Brazilian Portuguese and the legal and administrative domains has become increasingly apparent. However, existing models may struggle to accurately interpret the complexities of governmental texts, often leading to suboptimal performance in document classification and analysis tasks. To address these challenges, we introduce GovBERT-BR, a pre-trained language model tailored to the Brazilian governmental context, covering legal and administrative domains. Leveraging insights from diverse governmental texts, GovBERT-BR addresses the challenges of accurately interpreting Brazilian Portuguese and the unique legal and bureaucratic terminology prevalent in governmental documents. We present the pre-training process and experimental evaluation of GovBERT-BR, comparing its performance against baseline models across various text classification tasks relevant to the Brazilian public sector. Our findings demonstrate that GovBERT-BR outperforms existing models in document and short-text classification tasks, showcasing its efficacy in accurately analyzing governmental text data. Furthermore, our analysis reveals insights into the convergence behavior of GovBERT-BR during fine-tuning, highlighting its rapid adaptation to downstream tasks.
Publicado
17/11/2024
SILVA, Mariana O.; OLIVEIRA, Gabriel P.; COSTA, Lucas G. L.; PAPPA, Gisele L.. GovBERT-BR: A BERT-Based Language Model for Brazilian Portuguese Governmental Data. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 13. , 2024, Belém/PA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 19-32. ISSN 2643-6264.