A Comprehensive Exploitation of Instance Selection Methods for Automatic Text Classification: “Doing More with Less”

Washington Cunha; Leonardo Rocha; Marcos A. Gonçalves

doi:10.5753/ctd.2025.7399

Washington Cunha UFMG
Leonardo Rocha UFSJ
Marcos A. Gonçalves UFMG

DOI: https://doi.org/10.5753/ctd.2025.7399

Resumo

Progress in Natural Language Processing (NLP) has been dictated by the “rule of more”: more data, more computing power and more complexity, best exemplified by the current Large Language Models (LLMs). Indeed, to properly work (with high accuracy) for (domain-)specific applications, these LLMs have to be fine-tuned, i.e., trained with domain-specific data, which usually requires significant amounts of computational (and natural) resources. This Ph.D. dissertation focuses on a data engineering technique under-investigated in NLP, whose potential is enormous in the current data-intensive scenario, known as Instance Selection (IS). The IS goal is to reduce the training set size by removing noisy or redundant training instances while maintaining the effectiveness of the trained models, thus reducing the training process costs. In the PhD dissertation, we provide a comprehensive and scientifically sound comparison of many state-of-the-art (SOTA) IS methods applied to an essential NLP task – Automatic Text Classification (ATC), considering several classification solutions and many datasets. Our findings reveal a significant untapped potential for IS solutions. As a response to the limitations found in the SOTA IS methods when applied to ATC, the dissertation proposes two novel noise-oriented and redundancy-aware IS solutions specifically designed for large datasets and Transformer architectures. Our final solution achieved an average reduction of 41% in training set size while maintaining the same effectiveness levels in all experimented datasets. Our solutions demonstrated average speedup improvements of 1.67x (up to 2.46x), reducing carbon emissions (up to 65%), making them scalable for datasets with hundreds of thousands of documents. All code and datasets produced in the dissertation are available for replication on GitHub. Our results were published in some of the most important Information Retrieval and NLP conferences and journals, as it shall be detailed in this document.

Referências

Andrade, C., Belém, F. M., Cunha, W., et al. (2023). On the class separability of contextual embeddings representations – or “the classifier does not matter when the (text) representation is so good!”. IP&M.

Andrade, C., Cunha, W., Fonseca, G., Pagano, A., Santos, L., Pagano, A., Rocha, L., and Gonçalves, M. (2024). Explaining the hardest errors of contextual embedding based classifiers. In CoNNL’24.

Cunha, Washington Rosa, T., Rocha, L., Gonçalves, M. A., et al. (2023a). A comparative survey of instance selection methods applied to nonneural and transformer-based text classification. ACM Comput. Surv.

Cunha, W., Canuto, S., Viegas, F., Salles, T., et al. (2020). Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. IP&M.

Cunha, W., França, C., Rocha, L., and Gonçalves, M. A. (2023b). Tpdr: A novel two-step transformer-based product and class description match and retrieval method. arXiv preprint arXiv:2310.03491.

Cunha, W., Mangaravite, V., Gomes, C., et al. (2021). On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. IP&M.

Cunha, W., Moreo, A., Esuli, A., Sebastiani, F., Rocha, L., and Gonçalves, M. (2024a). A noise-oriented and redundancy-aware instance selection framework. ACM Transactions on Information Systems.

Cunha, W., Pasin, A., Goncalves, M., and Ferro, N. (2024b). A quantum annealing instance selection approach for efficient and effective transformer fine-tuning. In ACM SIGIR ICTIR’24.

Cunha, W., Rocha, L., Gonçalves, M. A., et al. (2023c). An effective, efficient, and scalable confidence-based instance selection framework for transformer-based text classification. In ACM SIGIR’23.

DeepSeek et al. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.

Fonseca, G., Cunha, W., and Rocha, L. (2024). Análise comparativa de métodos de undersampling em classificação automática de texto baseada em transformers. CTIC, 22(1).

Garcia, S., Derrac, J., Cano, J., and Herrera, F. (2012). Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Li, Q., Peng, H., Li, J., Xia, C., Yang, R., Sun, L., Yu, P. S., and He, L. (2022). A survey on text classification: From traditional to deep learning. ACM TIST, 13(2):1–41.

Martins, K., Vaz de Melo, P., and Santos, R. (2021). Why do document-level polarity classifiers fail? In Proceedings of the 2021 Conference of the NAACL: Human Language Technologies.

Pasin, A., Cunha, W., Dacrema, M. F., Cremonesi, P., Goncalves, M., and Ferro, N. (2025). Quantumclef - quantum computing at clef. In Advances in Information Retrieval (ECIR).

Rajaraman, S., Ganesan, P., and Antani, S. (2022). Deep learning model calibration for improving performance in class-imbalanced medical image classification tasks. PloS one.

Roy, A. and Cambria, E. (2022). Soft labeling constraint for generalizing from sentiments in single domain. Knowledge-Based Systems, 245:108346.

Uppaal, R., Hu, J., and Li, Y. (2023). Is fine-tuning needed? pre-trained language models are near perfect for out-of-domain detection. arXiv preprint arXiv:2305.13282.