On the Cost-Effectiveness of Stacking of Neural and Non-Neural Methods for Text Classification: Scenarios and Performance Prediction
Nowadays, neural networks algorithms, such as those based on Attention and Transformers, have excelled on Automatic Text Classification (ATC). However, such enhanced performance comes at high computational costs. Stacking of simpler classifiers that exploit algorithmic and representational complementarity has also been shown to produce superior performance in ATC, enjoying high effectiveness and potentially lower computational costs than complex neural networks. In this master's thesis, we present the first and largest comparative study to exploit the cost-effectiveness of Stacking in ATC, consisting of Transformers and non-neural algorithms. In particular, we are interested in answering the following research question: Is it possible to obtain an effective ensemble with significantly less computational cost than the best learning model for a given dataset? Besides answering that question, another main contribution of this thesis is the proposal of a low-cost oracle-based method that can predict the best ensemble in each scenario using only a fraction of the training data.
Bergstra, J., Komer, B., Eliasmith, C., Yamins, D., and Cox, D. D. (2015). Hyperopt: a python library for model selection and hyperparameter optimization. Computational Science & Discovery, 8(1):014008.
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of ACL, 5:135–146.
Campos, R., Canuto, S., Salles, T., de Sá, C. C., and Gonçalves, M. A. (2017). Stacking bagged and boosted forests for effective automated classification. In SIGIR, page 105–114.
Canuto, S., Gonçalves, M. A., and Benevenuto, F. (2016). Exploiting new sentiment-based meta-level features for effective sentiment analysis. In WSDM, pages 53–62.
Canuto, S., Salles, T., Gonçalves, M. A., Rocha, L., Ramos, G., Gonçalves, L., Rosa, T., and Martins, W. (2014). On efficient meta-level features for effective text classification. In CIKM´14, pages 1709–1718.
Canuto, S., Salles, T., Rosa, T. C., and Gonçalves, M. A. (2019). Similarity-based synthetic document representations for meta-feature generation in text classification. In SIGIR, pages 355–364.
Canuto, S., Sousa, D. X., Goncalves, M. A., and Rosa, T. C. (2018). A thorough evaluation of distancebased meta-features for automated text classification. IEEE TKDE, 30(12):2242–2256.
Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In SIGKDD conference, pages 785–794.
Cunha, W., Canuto, S., Viegas, F., Salles, T., Gomes, C., Mangaravite, V., Gonçalves, M. A., and Rocha, L. (2020). Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. Inf. Processing & Management, 57(4):102263.
Cunha, W., Mangaravite, V., Gomes, C., Canuto, S., Resende, E., Nascimento, C., Viegas, F., França, C., Martins, W. S., Almeida, J. M., et al. (2021a). On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. Inf. Processing & Management, 58(3):102481.
Cunha, W., Mangaravite, V., Gomes, C., Canuto, S. D., Resende, E., Nascimento, C., Viegas, F., França, C., Martins, W. S., Almeida, J. M., Rosa, T., da Rocha, L. C., and Gonçalves, M. A. (2021b). On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. Inf. Process. Manag., 58(3):102481.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Diao, Q., Qiu, M.,Wu, C.-Y., Smola, A. J., Jiang, J., andWang, C. (2014). Jointly modeling aspects, ratings and sentiments for movie recommendation (jmars). In SIGKDD conference, KDD ’14, page 193–202.
Ding, W. and Wu, S. (2020). A cross-entropy based stacking method in ensemble learning. Journal of Intelligent & Fuzzy Systems, pages 1–12.
Dzeroski, S. and Zenko, B. (2004). Is combining classifiers with stacking better than selecting the best one? Machine learning, 54(3):255–273.
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. (2008). Liblinear: A library for large linear classification. JMLR, 9:1871–1874.
Gomes, C., Gonçalves, M. A., Rocha, L., and Canuto, S. D. On the cost-effectiveness of stacking of neural and non-neural methods for text classification: Scenarios and performance prediction. In Proc. of the Association for Computational Linguistics: ACL/IJCNLP, pages 4003–4014.
Hull, D. (1993). Using statistical testing in the evaluation of retrieval experiments. In SIGIR, pages 329–338.
Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach.
Lundberg, S. M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc. Pedregosa, F. e. (2011). Scikit-learn: Machine learning in Python. JMLR, 12:2825–2830.
Silva, R. M., Gomes, G. C., Alvim, M. S., and Gonçalves, M. A. (2016). Compression-based selective sampling for learning to rank. In CIKM´16, pages 247–256.
Sokolova, M. and Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Inf. processing & management, 45(4):427–437.
Sun, C., Qiu, X., Xu, Y., and Huang, X. (2019). How to fine-tune bert for text classification? In Conference on Chinese Computational Linguistics, pages 194–206.
Tang, J., Alelyani, S., and Liu, H. (2014). Data classification: algorithms and applications. Data Mining and Knowledge Discovery Series, pages 37–64.
Tang, J., Qu, M., and Mei, Q. (2015). Pte: Predictive text embedding through large-scale heterogeneous text networks. In SIGKDD Conference), pages 1165–1174.
Urbano, J., Lima, H., and Hanjalic, A. (2019). Statistical significance testing in information retrieval: an empirical analysis of type i, type ii and type iii errors. In SIGIR, pages 505–514.
Viegas, F., Rocha, L., Gonçalves, M., Mourão, F., Sá, G., Salles, T., Andrade, G., and Sandin, I. (2018). A genetic programming approach for feature selection in highly dimensional skewed data. Neurocomputing, 273:554–569.
Wolpert, D. H. (1992). Stacked generalization. Neural networks, 5(2):241–259.
Yan-Shi Dong and Ke-Song Han (2004). A comparison of several ensemble methods for text categorization. In SCC 2004, pages 419–422.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., and Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. In NIPS, pages 5753–5763.
Zhang, X., Zhao, J. J., and LeCun, Y. (2015). Character-level convolutional networks for text classification. CoRR, abs/1509.01626.