Reducing Dependence on Labeled Data: A Self-Supervised Fine-Tuning Approach for Low-Resource Language Models
Resumo
The development of domain-specific language models faces significant challenges due to the scarcity of labeled data, especially in low-resource languages such as Portuguese. Annotating data is expensive and time-consuming, limiting the ability to train effective models in specialized contexts. To address this, we investigate a self-supervised fine-tuning strategy based on the BERTimbau pre-training protocol. This approach allows the model to improve generalization using only unlabeled data, avoiding the need for manual annotation. We explore different combinations of unfrozen layers and learning rate configurations to identify training regimes that balance performance and computational cost. The method is evaluated on three sentiment analysis datasets in Portuguese, each from a distinct domain. Results show that unfreezing only the final layer, together with a properly tuned learning rate, achieves performance comparable to traditional fine-tuning approach. These findings confirm the method’s viability in low-resource settings and its potential to scale to large unlabeled datasets. The approach provides an efficient alternative for adapting language models when annotated data is limited.
Publicado
29/09/2025
Como Citar
CONDORI-LUNA, Gian Franco; VEGA-OLIVEROS, Didier; REIS, Marcelo da Silva.
Reducing Dependence on Labeled Data: A Self-Supervised Fine-Tuning Approach for Low-Resource Language Models. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 35. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 395-409.
ISSN 2643-6264.
