ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language

  • Marcos Piau UNICAMP
  • Roberto Lotufo UNICAMP / NeuralMind
  • Rodrigo Nogueira UNICAMP / Maritaca AI

Resumo


Despite advancements in Natural Language Processing (NLP) and the growing availability of pretrained models, the English language remains the primary focus of model development. Continued pretraining on language-specific corpora provides a practical solution for adapting models to other languages. However, the impact of different pretraining settings on downstream tasks remains underexplored. This work introduces ptt5-v2, investigating the continued pretraining of T5 models for Portuguese. We first develop a baseline set of settings and pretrain models with sizes up to 3B parameters. Finetuning on three Portuguese downstream tasks (ASSIN2 STS, ASSIN2 RTE, and TweetSentBR) yields SOTA results on the latter two. We then explore the effects of different pretraining configurations, including quality filters, optimization strategies, and multi-epoch pretraining. Perhaps surprisingly, their impact remains subtle compared to our baseline. We release ptt5-v2 pretrained checkpoints and finetuned MonoT5 rerankers on HuggingFace in their respective collections at https://huggingface.co/unicamp-dl.

Publicado
17/11/2024
PIAU, Marcos; LOTUFO, Roberto; NOGUEIRA, Rodrigo. ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 13. , 2024, Belém/PA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 324-338. ISSN 2643-6264.