A Methodology for Addressing Majority Bias in Stacking Models through Identification of Challenging Documents
Abstract
Stacking models are effective in automatic document classification by exploring model complementarity. Despite this, there are still situations of failure in the classification of some documents, named here as difficult documents, due to a bias in which most of the learned models point to a class different from the real one. This work presents a first proposal, consisting of two steps, aimed at overcoming failures due to majority bias. First, we train a difficult document detector. Next, we use the detector to direct difficult documents to a meta-classifier specialized in classifying such documents. Empirically, our approach shows promise in isolating the majority bias.
Keywords:
Automatic Classification, Stacking, Majority Bias, Challenging Documents
References
Canuto, S., Salles, T., Rosa, T. C., and Gonçalves, M. A. (2019). Similarity-based synthetic document representations for meta-feature generation in text classification. In Proceedings of the 42nd International ACM SIGIR Conference.
Chowanda, A. and Muliono, Y. (2022). Indonesian sentiment analysis model from social media by stacking bert and bi-lstm. In 2022 3rd International Conference on Artificial Intelligence and Data Sciences (AiDAS).
Cunha, W., Canuto, S., Viegas, F., Salles, T., Gomes, C., Rosa, T., Gonçalves, M. A., and Rocha, L. (2020). Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. IP&M, 57.
Cunha, W., Mangaravite, V., Gomes, C., Canuto, S., Viegas, F., França, C., Almeida, J. M., Rosa, T., Rocha, L., and Gonçalves, M. A. (2021). On the cost-effectiveness of neural and non-neural approaches and representations for text classification. IP&M, 58.
Cunha, W., Viegas, F., França, C., Rosa, T., Rocha, L., and Gonçalves, M. A. (2023). A comparative survey of instance selection methods applied to nonneural and transformer-based text classification. ACM Computing Surveys.
Desai, S. and Durrett, G. (2020). Calibration of pre-trained transformers. In arXiv preprint arXiv:2003.07892.
Ding, W. and Wu, S. (2020). A cross-entropy based stacking method in ensemble learning. J. of Intelligent & Fuzzy Systems, 39:1–12.
Džeroski, S. and Ženko, B. (2004). Is combining classifiers with stacking better than selecting the best one? Machine Learning, 54.
Gomes, C., Goncalves, M., Rocha, L., and Canuto, S. (2021). On the cost-effectiveness of stacking of neural and non-neural methods for text classification: Scenarios and performance prediction. In Findings of the ACL-IJCNLP 2021.
Niculescu-Mizil, A. and Caruana, R. (2005). Predicting good probabilities with supervised learning. In ICML’05.
Penha, G., Campos, R., Canuto, S., Gonçalves, M., and Santos, R. (2019). Document performance prediction for automatic text classification. In European Conference on IR Research (ECIR).
Singhal, R. and Kashef, R. (2023). A weighted stacking ensemble model with sampling for fake reviews detection. IEEE TCSS.
Subba, B. and Kumari, S. (2022). A heterogeneous stacking ensemble based sentiment analysis framework using multiple word embeddings. Computational Intelligence.
Wahba, Y., Madhavji, N., and Steinbacher, J. (2022). Reducing misclassification due to overlapping classes in text classification via stacking classifiers on different feature subsets. In Proceedings of the 2022 FICC, Vol.
Chowanda, A. and Muliono, Y. (2022). Indonesian sentiment analysis model from social media by stacking bert and bi-lstm. In 2022 3rd International Conference on Artificial Intelligence and Data Sciences (AiDAS).
Cunha, W., Canuto, S., Viegas, F., Salles, T., Gomes, C., Rosa, T., Gonçalves, M. A., and Rocha, L. (2020). Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. IP&M, 57.
Cunha, W., Mangaravite, V., Gomes, C., Canuto, S., Viegas, F., França, C., Almeida, J. M., Rosa, T., Rocha, L., and Gonçalves, M. A. (2021). On the cost-effectiveness of neural and non-neural approaches and representations for text classification. IP&M, 58.
Cunha, W., Viegas, F., França, C., Rosa, T., Rocha, L., and Gonçalves, M. A. (2023). A comparative survey of instance selection methods applied to nonneural and transformer-based text classification. ACM Computing Surveys.
Desai, S. and Durrett, G. (2020). Calibration of pre-trained transformers. In arXiv preprint arXiv:2003.07892.
Ding, W. and Wu, S. (2020). A cross-entropy based stacking method in ensemble learning. J. of Intelligent & Fuzzy Systems, 39:1–12.
Džeroski, S. and Ženko, B. (2004). Is combining classifiers with stacking better than selecting the best one? Machine Learning, 54.
Gomes, C., Goncalves, M., Rocha, L., and Canuto, S. (2021). On the cost-effectiveness of stacking of neural and non-neural methods for text classification: Scenarios and performance prediction. In Findings of the ACL-IJCNLP 2021.
Niculescu-Mizil, A. and Caruana, R. (2005). Predicting good probabilities with supervised learning. In ICML’05.
Penha, G., Campos, R., Canuto, S., Gonçalves, M., and Santos, R. (2019). Document performance prediction for automatic text classification. In European Conference on IR Research (ECIR).
Singhal, R. and Kashef, R. (2023). A weighted stacking ensemble model with sampling for fake reviews detection. IEEE TCSS.
Subba, B. and Kumari, S. (2022). A heterogeneous stacking ensemble based sentiment analysis framework using multiple word embeddings. Computational Intelligence.
Wahba, Y., Madhavji, N., and Steinbacher, J. (2022). Reducing misclassification due to overlapping classes in text classification via stacking classifiers on different feature subsets. In Proceedings of the 2022 FICC, Vol.
Published
2023-09-25
How to Cite
SANTOS, Welton; CUNHA, Washington; FRANÇA, Celso; FONSECA, Guilherme; CANUTO, Sergio; ROCHA, Leonardo; GONÇALVES, Marcos.
A Methodology for Addressing Majority Bias in Stacking Models through Identification of Challenging Documents. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 38. , 2023, Belo Horizonte/MG.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2023
.
p. 408-413.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2023.233366.
