Pseudo-labeling for Multi-label Legal Text Classification

Lucas Freitas; Thais Rodrigues; Guilherme Rodrigues; Pamella Edokawa; Ariane Farias

Lucas Freitas STF
Thais Rodrigues UnB
Guilherme Rodrigues UnB
Pamella Edokawa STF
Ariane Farias TRE-RR

Resumo

Data augmentation is a widely used strategy to improve classification performance, yet it is only applied to labeled training data. In many real-world scenarios, however, vast amounts of unlabeled data remain underutilized. Pseudo-labeling offers a semi-supervised approach to incorporate this unlabeled data into model training. In this paper, we propose a simple yet effective pseudo-labeling method that combines clustering and label propagation to enhance performance in multi-label text classification tasks. Our approach addresses common challenges such as biases arising from decision boundaries and class imbalance. As a case study, we apply this method to the classification of legal cases in accordance with the United Nations 2030 Agenda for Sustainable Development Goals. In this context, the proposed augmentation strategy led to notable improvements in both accuracy and sensitivity metrics when compared to models trained solely on the original labeled dataset. This approach provides a valuable means to expand the existing knowledge base without the need for labor-intensive manual classification efforts.