Pseudo-labeling for Multi-label Legal Text Classification

  • Lucas Freitas STF
  • Thais Rodrigues UnB
  • Guilherme Rodrigues UnB
  • Pamella Edokawa STF
  • Ariane Farias TRE-RR

Resumo


Data augmentation is a widely used strategy to improve classification performance, yet it is only applied to labeled training data. In many real-world scenarios, however, vast amounts of unlabeled data remain underutilized. Pseudo-labeling offers a semi-supervised approach to incorporate this unlabeled data into model training. In this paper, we propose a simple yet effective pseudo-labeling method that combines clustering and label propagation to enhance performance in multi-label text classification tasks. Our approach addresses common challenges such as biases arising from decision boundaries and class imbalance. As a case study, we apply this method to the classification of legal cases in accordance with the United Nations 2030 Agenda for Sustainable Development Goals. In this context, the proposed augmentation strategy led to notable improvements in both accuracy and sensitivity metrics when compared to models trained solely on the original labeled dataset. This approach provides a valuable means to expand the existing knowledge base without the need for labor-intensive manual classification efforts.
Publicado
29/09/2025
FREITAS, Lucas; RODRIGUES, Thais; RODRIGUES, Guilherme; EDOKAWA, Pamella; FARIAS, Ariane. Pseudo-labeling for Multi-label Legal Text Classification. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 35. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 363-377. ISSN 2643-6264.