Reducing Dependence on Labeled Data: A Self-Supervised Fine-Tuning Approach for Low-Resource Language Models

Gian Franco Condori-Luna; Didier Vega-Oliveros; Marcelo da Silva Reis

Gian Franco Condori-Luna UNIFESP
Didier Vega-Oliveros UNIFESP
Marcelo da Silva Reis Unicamp

Resumo

The development of domain-specific language models faces significant challenges due to the scarcity of labeled data, especially in low-resource languages such as Portuguese. Annotating data is expensive and time-consuming, limiting the ability to train effective models in specialized contexts. To address this, we investigate a self-supervised fine-tuning strategy based on the BERTimbau pre-training protocol. This approach allows the model to improve generalization using only unlabeled data, avoiding the need for manual annotation. We explore different combinations of unfrozen layers and learning rate configurations to identify training regimes that balance performance and computational cost. The method is evaluated on three sentiment analysis datasets in Portuguese, each from a distinct domain. Results show that unfreezing only the final layer, together with a properly tuned learning rate, achieves performance comparable to traditional fine-tuning approach. These findings confirm the method’s viability in low-resource settings and its potential to scale to large unlabeled datasets. The approach provides an efficient alternative for adapting language models when annotated data is limited.