On the Cost-Effectiveness of Stacking of Neural and Non-Neural Methods for Text Classification: Scenarios and Performance Prediction

  • Christian Gomes Universidade Federal de Minas Gerais (UFMG)
  • Leonardo Rocha Universidade Federal de São Jõao del-Rei (UFSJ)
  • Marcos Gonçalves Universidade Federal de Minas Gerais (UFMG)


Nowadays Neural Network algorithms have excelled in Automatic Text Classification (ATC). However, such enhanced performance comes at high computational costs. Stacking of simpler classifiers that exploit algorithmic and representational complementarity has also been shown to produce superior performance in ATC, enjoying high effectiveness and potentially lower computational costs than complex neural networks. In this master’s thesis, we present the first and largest comparative study to exploit the cost-effectiveness of Stacking in ATC, consisting of Transformers and non-neural algorithms. We investigate cost-effective ensemble vs. the best model and propose a low-cost oracle-based prediction method.

Palavras-chave: Text classification, Natural language processing, Machine learning, Deep learning, Supervised learning, Unsupervised learning, Sentiment analysis, Ensemble, Stacking


Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175–185.

Bergstra, J., Komer, B., Eliasmith, C., Yamins, D., and Cox, D. D. (2015). Hyperopt: a python library for model selection and hyperparameter optimization. Computational Science & Discovery, 8(1):014008.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of ACL, 5:135–146.

Campos, R., Canuto, S., Salles, T., de Sá, C. C., and Gonçalves, M. A. (2017). Stacking bagged and boosted forests for effective automated classification. In SIGIR, page 105–114.

Canuto, S., Gonçalves, M. A., and Benevenuto, F. (2016). Exploiting new sentiment-based meta-level features for effective sentiment analysis. In WSDM, pages 53–62.

Canuto, S., Salles, T., Gonçalves, M. A., Rocha, L., Ramos, G., Gonçalves, L., Rosa, T., and Martins, W. (2014). On efficient meta-level features for effective text classification. In CIKM´14, pages 1709–1718.

Canuto, S., Salles, T., Rosa, T. C., and Gonçalves, M. A. (2019). Similarity-based synthetic document representations for meta-feature generation in text classification. In SIGIR, pages 355–364.

Canuto, S., Sousa, D. X., Goncalves, M. A., and Rosa, T. C. (2018). A thorough evaluation of distance-based meta-features for automated text classification. IEEE TKDE, 30(12):2242–2256.

Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In SIGKDD conference, pages 785–794.

Cunha, W., Canuto, S., Viegas, F., Salles, T., Gomes, C., Mangaravite, V., Gonçalves, M. A., and Rocha, L. (2020). Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. Inf. Processing & Management, 57(4):102263.

Cunha, W., Mangaravite, V., Gomes, C., Canuto, S., Resende, E., Nascimento, C., Viegas, F., França, C., Martins, W. S., Almeida, J. M., et al. (2021). On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. Inf. Processing & Management, 58(3):102481.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Diao, Q., Qiu, M., Wu, C.-Y., Smola, A. J., Jiang, J., and Wang, C. (2014). Jointly modeling aspects, ratings and sentiments for movie recommendation (jmars). In SIGKDD conference, KDD ’14, page 193–202.

Ding, W. and Wu, S. (2020). A cross-entropy based stacking method in ensemble learning. Journal of Intelligent & Fuzzy Systems, pages 1–12.

Džeroski, S. and Ženko, B. (2004). Is combining classifiers with stacking better than selecting the best one? Machine learning, 54(3):255–273.

Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. (2008). Liblinear: A library for large linear classification. JMLR, 9:1871–1874.

Gomes, C., Gonçalves, M. A., Rocha, L., and Canuto, S. D. (2021). On the cost-effectiveness of stacking of neural and non-neural methods for text classification: Scenarios and performance prediction. In Proc. of the Association for Computational Linguistics: ACL/IJCNLP, pages 4003–4014.

Hull, D. (1993). Using statistical testing in the evaluation of retrieval experiments. In SIGIR, pages 329–338.

Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach.

Lundberg, S. M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc.

Pedregosa, F. e. (2011). Scikit-learn: Machine learning in Python. JMLR, 12:2825–2830.

Silva, R. M., Gomes, G. C., Alvim, M. S., and Gonçalves, M. A. (2016). Compression-based selective sampling for learning to rank. In CIKM´16, pages 247–256.

Sokolova, M. and Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Inf. processing & management, 45(4):427–437.

Sun, C., Qiu, X., Xu, Y., and Huang, X. (2019). How to fine-tune bert for text classification? In Conference on Chinese Computational Linguistics, pages 194–206.

Tang, J., Alelyani, S., and Liu, H. (2014). Data classification: algorithms and applications. Data Mining and Knowledge Discovery Series, pages 37–64.

Tang, J., Qu, M., and Mei, Q. (2015). Pte: Predictive text embedding through large-scale heterogeneous text networks. In SIGKDD Conference), pages 1165–1174.

Urbano, J., Lima, H., and Hanjalic, A. (2019). Statistical significance testing in information retrieval: an empirical analysis of type i, type ii and type iii errors. In SIGIR, pages 505–514.

Viegas, F., Rocha, L., Gonçalves, M., Mourão, F., Sá, G., Salles, T., Andrade, G., and Sandin, I. (2018). A genetic programming approach for feature selection in highly dimensional skewed data. Neurocomputing, 273:554–569.

Wolpert, D. H. (1992). Stacked generalization. Neural networks, 5(2):241–259.

Yan-Shi Dong and Ke-Song Han (2004). A comparison of several ensemble methods for text categorization. In SCC 2004, pages 419–422.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., and Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. In NIPS, pages 5753–5763.

Zhang, X., Zhao, J. J., and LeCun, Y. (2015). Character-level convolutional networks for text classification. CoRR, abs/1509.01626.
GOMES, Christian; ROCHA, Leonardo; GONÇALVES, Marcos. On the Cost-Effectiveness of Stacking of Neural and Non-Neural Methods for Text Classification: Scenarios and Performance Prediction. In: CONCURSO DE TESES E DISSERTAÇÕES (CTDBD) - SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 38. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 213-224. DOI: https://doi.org/10.5753/sbbd_estendido.2023.231875.