A Comprehensive Exploitation of Instance Selection Methods for Automatic Text Classification — “Doing More with Less”

Washington Cunha; Leonardo Rocha; Marcos A. Gonçalves

doi:10.5753/sbbd_estendido.2025.247534

Washington Cunha Universidade Federal de Minas Gerais (UFMG)
Leonardo Rocha Universidade Federal de São Jõao del-Rei (UFSJ)
Marcos A. Gonçalves Universidade Federal de Minas Gerais (UFMG)

DOI: https://doi.org/10.5753/sbbd_estendido.2025.247534

Resumo

Progresso recente em PLN seguiu a tendência de “quanto mais, melhor” (mais dados, poder computacional e complexidade de modelos), exemplificada pelos Grandes Modelos de Linguagem. Contudo, o treinamento desses modelos continua sendo um processo intensivo em recursos. Esta tese de doutorado explora a Seleção de Instâncias (SI), uma técnica de engenharia de dados promissora, porém pouco explorada, que reduz o tamanho do conjunto de treinamento removendo instâncias ruidosas ou redundantes, reduzindo o custo computacional sem sacrificar a qualidade. Avaliamos de forma abrangente os métodos de SI para classificação automática de texto em diversos modelos e 22 conjuntos de dados, revelando um potencial significativo inexplorado. Além disso, propomos dois novos métodos de SI com foco em grandes conjuntos de dados e LLMs. Nossa melhor solução reduziu o tamanho dos conjuntos de treinamento em 41% em média, preservando a qualidade, e alcançou speed-ups de até 2,46x, comprovando sua escalabilidade.

Palavras-chave: instance selection, automatic text classification, large language models, green computing

Referências

Cunha, W., Canuto, S., Viegas, F., Salles, T., Gomes, C., Mangaravite, V., Resende, E., Rosa, T., Gonçalves, M. A., and Rocha, L. (2020). Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. Information Processing & Management, 57(4):102263.

Cunha, W., França, C., Fonseca, G., Rocha, L., and Gonçalves, M. A. (2023a). An effective, efficient, and scalable confidence-based instance selection framework for transformer-based text classification. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 665–674.

Cunha, W., Mangaravite, V., Gomes, C., Canuto, S., Resende, E., Nascimento, C., Viegas, F., França, C., Martins, W. S., Almeida, J. M., et al. (2021). On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. Information Processing & Management, 58(3):102481.

Cunha, W., Moreo Fernández, A., Esuli, A., Sebastiani, F., Rocha, L., and Gonçalves, M. A. (2025). A noise-oriented and redundancy-aware instance selection framework. ACM Transactions on Information Systems, 43(2):1–33.

Cunha, W., Viegas, F., França, C., Rosa, T., Rocha, L., and Gonçalves, M. A. (2023b). A comparative survey of instance selection methods applied to non-neural and transformer-based text classification. ACM Computing Surveys, 55(13s):1–52.

DeepSeek et al. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.

Garcia, S., Derrac, J., Cano, J., and Herrera, F. (2012). Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Martins, K., Vaz de Melo, P., and Santos, R. (2021). Why do document-level polarity classifiers fail? In Proceedings of the 2021 Conference of the NAACL: Human Language Technologies.

Ng, A. (2016). Nuts and bolts of building ai applications using deep learning. NIPS Keynote Talk, 64.

Rajaraman, S., Ganesan, P., and Antani, S. (2022). Deep learning model calibration for improving performance in class-imbalanced medical image classification tasks. PloS one.

Roy, A. and Cambria, E. (2022). Soft labeling constraint for generalizing from sentiments in single domain. Knowledge-Based Systems, 245:108346.

Uppaal, R., Hu, J., and Li, Y. (2023). Is fine-tuning needed? pre-trained language models are near perfect for out-of-domain detection. arXiv preprint arXiv:2305.13282.