A Comprehensive Exploitation of Instance Selection Methods for Automatic Text Classification — “Doing More with Less”

  • Washington Cunha Federal University of Minas Gerais (UFMG)
  • Leonardo Rocha Federal University of São João del-Rei (UFSJ)
  • Marcos A. Gonçalves Federal University of Minas Gerais (UFMG)

Abstract


Recent progress in NLP has followed a “more is better” trend (more data, computing power, and model complexity) best exemplified by the Large Language Models (LLMs). However, training such models remains resource-intensive. This Ph.D. dissertation explores Instance Selection (IS), a promising yet underexplored data engineering technique that reduces training set size by removing noisy or redundant instances, lowering computational cost without sacrificing performance. We evaluate comprehensively the IS methods for Automatic Text Classification (ATC) across several classifiers and 22 datasets, uncovering significant untapped potential. Additionally, we propose two novel IS methods tailored for large datasets and LLMs. Our best solution cut training set sizes by 41% on average while preserving effectiveness, and achieved speedups of up to 2.46x, proving its scalability.

Keywords: instance selection, automatic text classification, large language models, green computing

References

Cunha, W., Canuto, S., Viegas, F., Salles, T., Gomes, C., Mangaravite, V., Resende, E., Rosa, T., Gonçalves, M. A., and Rocha, L. (2020). Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. Information Processing & Management, 57(4):102263.

Cunha, W., França, C., Fonseca, G., Rocha, L., and Gonçalves, M. A. (2023a). An effective, efficient, and scalable confidence-based instance selection framework for transformer-based text classification. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 665–674.

Cunha, W., Mangaravite, V., Gomes, C., Canuto, S., Resende, E., Nascimento, C., Viegas, F., França, C., Martins, W. S., Almeida, J. M., et al. (2021). On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. Information Processing & Management, 58(3):102481.

Cunha, W., Moreo Fernández, A., Esuli, A., Sebastiani, F., Rocha, L., and Gonçalves, M. A. (2025). A noise-oriented and redundancy-aware instance selection framework. ACM Transactions on Information Systems, 43(2):1–33.

Cunha, W., Viegas, F., França, C., Rosa, T., Rocha, L., and Gonçalves, M. A. (2023b). A comparative survey of instance selection methods applied to non-neural and transformer-based text classification. ACM Computing Surveys, 55(13s):1–52.

DeepSeek et al. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.

Garcia, S., Derrac, J., Cano, J., and Herrera, F. (2012). Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Martins, K., Vaz de Melo, P., and Santos, R. (2021). Why do document-level polarity classifiers fail? In Proceedings of the 2021 Conference of the NAACL: Human Language Technologies.

Ng, A. (2016). Nuts and bolts of building ai applications using deep learning. NIPS Keynote Talk, 64.

Rajaraman, S., Ganesan, P., and Antani, S. (2022). Deep learning model calibration for improving performance in class-imbalanced medical image classification tasks. PloS one.

Roy, A. and Cambria, E. (2022). Soft labeling constraint for generalizing from sentiments in single domain. Knowledge-Based Systems, 245:108346.

Uppaal, R., Hu, J., and Li, Y. (2023). Is fine-tuning needed? pre-trained language models are near perfect for out-of-domain detection. arXiv preprint arXiv:2305.13282.
Published
2025-09-29
CUNHA, Washington; ROCHA, Leonardo; GONÇALVES, Marcos A.. A Comprehensive Exploitation of Instance Selection Methods for Automatic Text Classification — “Doing More with Less”. In: THESIS AND DISSERTATION CONTEST (CTDBD) - BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 40. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 218-222. DOI: https://doi.org/10.5753/sbbd_estendido.2025.247534.