A comprehensive exploitation of instance selection methods for automatic text classification

Washington Cunha; Leonardo Rocha; Marcos Gonçalves

doi:10.5753/sbsi_estendido.2025.246733

Washington Cunha UFMG
Leonardo Rocha UFSJ
Marcos Gonçalves UFMG

DOI: https://doi.org/10.5753/sbsi_estendido.2025.246733

Resumo

Progress in Natural Language Processing (NLP) has been dictated by the rule of more: more data, more computing power and more complexity, best exemplified by the Large Language Models. However, training (or fine-tuning) large dense models for specific applications usually requires significant amounts of computing resources. This Ph.D. dissertation focuses on an under-investigated NLP data engineering technique, whose potential is enormous in the current scenario known as Instance Selection (IS). The IS goal is to reduce the training set size by removing noisy or redundant instances while maintaining the effectiveness of the trained models and reducing the training process cost. We provide a comprehensive and scientifically sound comparison of IS methods applied to an essential NLP task – Automatic Text Classification (ATC), considering several classification solutions and many datasets. Our findings reveal a significant untapped potential for IS solutions. We also propose two novel IS solutions that are noise-oriented and redundancy-aware, specifically designed for large datasets and transformer architectures. Our final solution achieved an average reduction of 41% in training sets, while maintaining the same levels of effectiveness in all datasets. Importantly, our solutions demonstrated speedup improvements of 1.67x (up to 2.46x), making them scalable for datasets with hundreds of thousands of documents. This thesis falls under the Information Systems and Artificial Intelligence (Generative, LLM, NLP, among others) topic of the SBSI-CTDGSI call for papers.

Referências

Andrade, C., Cunha, W., Fonseca, G., Pagano, A., Santos, L., Pagano, A., Rocha, L., and Gonçalves, M. (2024). Explaining the hardest errors of contextual embedding based classifiers. In Proceedings of the 28th Conference on Computational Natural Language Learning.

Barigou, F. (2018). Impact of instance selection on knn-based text categorization. Journal of Information Processing Systems, 14(2).

Carbonera, J. L. and Abel, M. (2018). Efficient instance selection based on spatial abstraction. In 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI).

Cardoso, T. N., Silva, R. M., Canuto, S., Moro, M. M., and Gonçalves, M. A. (2017). Ranked batch-mode active learning. Information Sciences, 379:313–337.

Cunha, W., Canuto, S., Viegas, F., Salles, T., Gomes, C., Mangaravite, V., Resende, E., Rosa, T., Gonçalves, M. A., and Rocha, L. (2020). Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. IP&M, 57(4):102263.

Cunha, W., França, C., Rocha, L., and Gonçalves, M. A. (2023a). Tpdr: A novel two-step transformer-based product and class description match and retrieval method. arXiv preprint arXiv:2310.03491.

Cunha, W., França, C., Fonseca, G., Rocha, L., and Gonçalves, M. A. (2023b). An effective, efficient, and scalable confidence-based instance selection framework for transformer-based text classification. In ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23.

Cunha, W., Mangaravite, V., Gomes, C., Canuto, S., Resende, E., Nascimento, C., Viegas, F., França, C., Martins, W. S., Almeida, J. M., Rosa, T., Rocha, L., and Gonçalves, M. A. (2021). On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. Information Processing & Management.

Cunha, W., Moreo, A., Esuli, A., Sebastiani, F., Rocha, L., and Gonçalves, M. (2024). A noise-oriented and redundancy-aware instance selection framework. ACM Transactions on Information Systems.

Cunha, W., Viegas, F., França, C., Rosa, T., Rocha, L., and Gonçalves, M. A. (2023c). A comparative survey of instance selection methods applied to nonneural and transformer-based text classification. ACM Comput. Surv.

de Andrade, C., Cunha, W., Reis, D., Pagano, A. S., Rocha, L., and Gonçalves, M. A. (2024). A strategy to combine 1stgen transformers and open llms for automatic text classification. arXiv preprint arXiv:2408.09629.

de Andrade, C. M., Belém, F. M., Cunha, W., França, C., Viegas, F., Rocha, L., and Gonçalves, M. A. (2023). On the class separability of contextual embeddings representations – or “the classifier does not matter when the (text) representation is so good!”. IP&M, 60(4):103336.

Ferrari Dacrema, M., Pasin, A., Cremonesi, P., and Ferro, N. (2024). Quantum computing for information retrieval and recommender systems. In European Conference on Information Retrieval, pages 358–362.

Garcia, S., Derrac, J., Cano, J., and Herrera, F. (2012). Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Gasparetto, A., Marcuzzo, M., Zangari, A., and Albarelli, A. (2022). A survey on text classification algorithms: From text to predictions. Inf., 13:83.

Hart, P. (1968). The condensed nearest neighbor rule (corresp.). IEEE transactions on information theory, 14(3):515–516.

Hochberg, Y. (1988). A sharper bonferroni procedure for multiple tests of significance. Biometrika, 75(4).

Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. (2017). Bag of tricks for efficient text classification. In Proceedings of the Conference European Chapter Association Computational Linguistics (EACL).

Leyva, E., González, A., and Pérez, R. (2015). Three new instance selection methods based on local sets: A comparative study with several approaches from a bi-objective perspective. Pattern Recognition.

Li, Q., Peng, H., Li, J., Xia, C., Yang, R., Sun, L., Yu, P. S., and He, L. (2022). A survey on text classification: From traditional to deep learning. ACM Transactions on Intelligent Systems and Technology (TIST), 13(2):1–41.

Lu, X., Duan, X., Mao, X., Li, Y., and Zhang, X. (2017). Feature extraction and fusion using deep convolutional neural networks for face detection. Mathematical Problems in Engineering, 2017.

Luiz, W., Viegas, F., Alencar, R., Mourão, F., Salles, T., Carvalho, D., Gonçalves, M. A., and Rocha, L. (2018). A feature-oriented sentiment rating for mobile app reviews. In Proceedings of theWebConf’18.

Martins, K., Vaz de Melo, P., and Santos, R. (2021). Why do document-level polarity classifiers fail? In Proceedings of the 2021 Conference of the NAACL: Human Language Technologies.

Rajaraman, S., Ganesan, P., and Antani, S. (2022). Deep learning model calibration for improving performance in class-imbalanced medical image classification tasks. PloS one.

Roy, A. and Cambria, E. (2022). Soft labeling constraint for generalizing from sentiments in single domain. Knowledge-Based Systems, 245:108346.

Sindagi, V. A., Yasarla, R., Babu, D. S., Babu, R. V., and Patel, V. M. (2020). Learning to count in the crowd from limited labeled data. In Computer Vision – ECCV, pages 212–229, Cham.

Sokolova, M. and Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management (IP&M), 45(4):427–437.

Tsai, C.-F., Chen, Z.-Y., and Ke, S.-W. (2014). Evolutionary instance selection for text classification. J. Syst. Softw., 90(C):104–113.

Uppaal, R., Hu, J., and Li, Y. (2023). Is fine-tuning needed? pre-trained language models are near perfect for out-of-domain detection. arXiv preprint arXiv:2305.13282.

Wilson, D. L. (1972). Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, pages 408–421.

Wilson, D. R. and Martinez, T. R. (2000). Reduction techniques for instance-based learning algorithms. Machine learning, 38(3):257–286.