A Comprehensive Exploitation of Instance Selection Methods for Automatic Text Classification
Abstract
Progress in Natural Language Processing (NLP) has been dictated by the rule of more: more data, more computing power and more complexity, best exemplified by the Large Language Models. However, training (or fine-tuning) large dense models for specific applications usually requires significant amounts of computing resources. This Ph.D. dissertation focuses on an under-investigated NLP data engineering (DE) technique, whose potential is enormous in the current scenario known as Instance Selection (IS). The IS goal is to reduce the training set size by removing noisy or redundant instances while maintaining the effectiveness of the trained models and reducing the training process cost. We provide a comprehensive and scientifically sound comparison of IS methods applied to an essential NLP task – Automatic Text Classification (ATC), considering several classification solutions and many datasets. Our findings reveal a significant untapped potential for IS solutions.We also propose two novel IS solutions that are noise-oriented and redundancy-aware, specifically designed for large datasets and transformer architectures. Our final solution achieved an average reduction of 41% in training sets, while maintaining the same levels of effectiveness in all datasets. Importantly, our solutions demonstrated speedup improvements of 1.67x (up to 2.46x), making them scalable for datasets with hundreds of thousands of documents. This thesis strongly aligns with WebMedia’s objectives by addressing key challenges in processing vast web and social media data through innovative, scalable, and cost-effective strategies, falls under the (1) Document Engineering, Models and Languages; (2) AI, Machine Learning, and Deep Learning; and (3) NLP topics of the WebMedia call for papers.
Keywords:
Instance Selection, Automatic Text Classification, Deep Learning
References
Washington Cunha, Sérgio Canuto, Felipe Viegas, Thiago Salles, C. Gomes, V. Mangaravite, E. Resende, Thierson Rosa, Marcos Gonçalves, and Leonardo Rocha. 2020. Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. IP&M (2020).
Washington Cunha, Celso França, Guilherme Fonseca, Leonardo Rocha, and Marcos André Gonçalves. 2023. An effective, efficient, and scalable confidence-based instance selection framework for transformer-based text classification. In SIGIR’23.
Washington Cunha, V. Mangaravite, C. Gomes, S. Canuto, E. Resende, Cecilia Nascimento, F. Viegas, C. França, Jussara M Almeida, et al. 2021. On the costeffectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. IP&M 58, 3 (2021), 102481.
Washington Cunha, Alejandro Moreo Fernández, Andrea Esuli, Fabrizio Sebastiani, Leonardo Rocha, and Marcos André Gonçalves. 2025. A Noise-Oriented and Redundancy-Aware Instance Selection Framework. ACM TOIS 43, 2 (2025).
Washington Cunha, Andrea Pasin, Marcos Goncalves, and Nicola Ferro. 2024. A Quantum Annealing Instance Selection Approach for Efficient and Effective Transformer Fine-Tuning. In Proceedings of the 2024 ACM SIGIR ICTIR.
Washington Cunha, Felipe Viegas, Celso França, Thierson Rosa, Leonardo Rocha, and Marcos Gonçalves. 2023. A comparative survey of instance selection methods applied to non-neural and transformer-based text classification. ACM CSUR (2023).
Washington Cunha, Celso França, Guilherme Fonseca, Leonardo Rocha, and Marcos André Gonçalves. 2023. An effective, efficient, and scalable confidence-based instance selection framework for transformer-based text classification. In SIGIR’23.
Washington Cunha, V. Mangaravite, C. Gomes, S. Canuto, E. Resende, Cecilia Nascimento, F. Viegas, C. França, Jussara M Almeida, et al. 2021. On the costeffectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. IP&M 58, 3 (2021), 102481.
Washington Cunha, Alejandro Moreo Fernández, Andrea Esuli, Fabrizio Sebastiani, Leonardo Rocha, and Marcos André Gonçalves. 2025. A Noise-Oriented and Redundancy-Aware Instance Selection Framework. ACM TOIS 43, 2 (2025).
Washington Cunha, Andrea Pasin, Marcos Goncalves, and Nicola Ferro. 2024. A Quantum Annealing Instance Selection Approach for Efficient and Effective Transformer Fine-Tuning. In Proceedings of the 2024 ACM SIGIR ICTIR.
Washington Cunha, Felipe Viegas, Celso França, Thierson Rosa, Leonardo Rocha, and Marcos Gonçalves. 2023. A comparative survey of instance selection methods applied to non-neural and transformer-based text classification. ACM CSUR (2023).
Published
2025-11-10
How to Cite
CUNHA, Washington; ROCHA, Leonardo; GONÇALVES, Marcos André.
A Comprehensive Exploitation of Instance Selection Methods for Automatic Text Classification. In: THESIS AND DISSERTATION CONTEST - BRAZILIAN SYMPOSIUM ON MULTIMEDIA AND THE WEB (WEBMEDIA), 31. , 2025, Rio de Janeiro/RJ.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 15-16.
ISSN 2596-1683.
DOI: https://doi.org/10.5753/webmedia_estendido.2025.15955.
