Relato de Experiência na Avaliação em Larga Escala de Estratégias de Undersampling para Redução de Viés em Classificação de Texto Baseada em SLMs/LLMs

Guilherme Fonseca; Gabriel Prenassi; Washington Cunha; Marcos André Gonçalves; Leonardo Rocha

doi:10.5753/pesquisanuvem.2026.22779

Guilherme Fonseca UFMG
Gabriel Prenassi UFSJ
Washington Cunha Unicamp
Marcos André Gonçalves UFMG
Leonardo Rocha UFSJ

DOI: https://doi.org/10.5753/pesquisanuvem.2026.22779

Resumo

Este artigo apresenta um relato de experiência sobre o uso da infraestrutura em nuvem da AWS para viabilizar uma ampla avaliação de métodos de undersampling em Classificação Automática de Texto (CAT) com SLMs e LLMs. O protocolo experimental envolveu 21 técnicas de undersampling, 13 bases de dados, com até 1,3 milhão de instâncias, e modelos como RoBERTa e Llama 3.1, impondo demandas computacionais massivas e heterogêneas. A solução adotou o Amazon S3 como data lake e instâncias EC2 especializadas: c6a.8xlarge para balanceamento de dados e g4dn.xlarge/g5.xlarge para fine-tuning e inferência. A nuvem permitiu padronizar o ambiente experimental, paralelizar execuções e assegurar rigor metodológico na análise de desempenho e tempo.

Referências

Biau, G. and Scornet, E. (2016). A random forest guided tour. Test, 25(2):197–227.

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research.

Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd KDD.

Cunha, W., França, C., Fonseca, G., Rocha, L., and Gonçalves, M. A. (2023). An effective, efficient, and scalable confidence-based instance selection framework for transformer-based text classification. In the 46th ACM SIGIR.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783.

Ferrer, X., Nuenen, T. v., Such, J. M., Coté, M., and Criado, N. (2021). Bias and discrimination in ai: A cross-disciplinary perspective. IEEE Technology and Society Magazine.

Fonseca, G., Cunha, W., Prenassi, G., Gonçalves, M. A., and Da Rocha, L. C. D. (2025). Instance-selection-inspired undersampling strategies for bias reduction in small and large language models for binary text classification. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9323–9340.

Fonseca, G., Cunha, W., and Rocha, L. (2024a). Análise comparativa de métodos de undersampling em classificaçao automática de texto baseada em transformers. Revista Eletrônica de Iniciação Científica em Computação, 22:1–10.

Fonseca, G., Prenassi, G., Cunha, W., Gonçalves, M. A., and Rocha, L. (2024b). Estratégias de undersampling para reduçao de viés em classificaçao de texto baseada em transformers. In Brazilian Symposium on Multimedia and the Web (WebMedia), pages 144–152. SBC.

Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning, pages 137–142. Springer.

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in NeurIPS.

Kubat, M., Matwin, S., et al. (1997). Addressing the curse of imbalanced training sets: one-sided selection. In Icml. Citeseer.

LaValley, M. P. (2008). Logistic regression. Circulation, 117(18):2395–2399.

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. (2020). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

Mani, I. and Zhang, I. (2003). knn approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of workshop on learning from imbalanced datasets. ICML.

Ribeiro, F. N., Araújo, M., Gonçalves, P., André Gonçalves, M., and Benevenuto, F. (2016). Sentibench-a benchmark comparison of state-of-the-practice sentiment analysis methods. EPJ DS.

Tomek, I. (1976). Two modifications of cnn. IEEE Transactions on Systems, Man, and Cybernetics.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems.

Zanotto, B. S., Beck da Silva Etges, A. P., Ruschel, R., Luiz, W., et al. (2021). Stroke outcome measurements from electronic medical records: cross-sectional study on the effectiveness of neural and nonneural classifiers. JMIR Med. Infor.