Uma Metodologia para Tratamento do Viés da Maioria em Modelos de Stacking via Identificação de Documentos Difíceis

Welton Santos; Washington Cunha; Celso França; Guilherme Fonseca; Sergio Canuto; Leonardo Rocha; Marcos Gonçalves

doi:10.5753/sbbd.2023.233366

Welton Santos Universidade Federal de Minas Gerais http://orcid.org/0000-0001-5673-0748
Washington Cunha Universidade Federal de Minas Gerais
Celso França Universidade Federal de Minas Gerais
Guilherme Fonseca Universidade Federal de São Jõao del-Rei http://orcid.org/0009-0000-7862-8701
Sergio Canuto Universidade Federal de Minas Gerais
Leonardo Rocha Universidade Federal de São Jõao del-Rei
Marcos Gonçalves Universidade Federal de Minas Gerais

DOI: https://doi.org/10.5753/sbbd.2023.233366

Resumo

Modelos de stacking são efetivos na tarefa de classificação automática de documentos explorando a complementariedade entre modelos. Contudo, ainda há situações de falha na classificação de alguns documentos, denominados aqui como documentos difíceis, devido a um viés em que a maioria dos modelos aprendidos apontam para uma classe diferente da real. Este trabalho apresenta uma primeira proposta, composta de dois passos, que visa contornar falhas por viés da maioria. Primeiro, treinamos um detector de documentos difíceis, para depois utilizar o detector para direcionar documentos difíceis para um meta-classificador especialista em tais documentos. Empiricamente, nossa abordagem se mostra promissora no isolamento do viés da maioria.

Palavras-chave: Classificação automática, Stacking, Viés da Maioria, Documentos difíceis

Referências

Canuto, S., Salles, T., Rosa, T. C., and Gonçalves, M. A. (2019). Similarity-based synthetic document representations for meta-feature generation in text classification. In Proceedings of the 42nd International ACM SIGIR Conference.

Chowanda, A. and Muliono, Y. (2022). Indonesian sentiment analysis model from social media by stacking bert and bi-lstm. In 2022 3rd International Conference on Artificial Intelligence and Data Sciences (AiDAS).

Cunha, W., Canuto, S., Viegas, F., Salles, T., Gomes, C., Rosa, T., Gonçalves, M. A., and Rocha, L. (2020). Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. IP&M, 57.

Cunha, W., Mangaravite, V., Gomes, C., Canuto, S., Viegas, F., França, C., Almeida, J. M., Rosa, T., Rocha, L., and Gonçalves, M. A. (2021). On the cost-effectiveness of neural and non-neural approaches and representations for text classification. IP&M, 58.

Cunha, W., Viegas, F., França, C., Rosa, T., Rocha, L., and Gonçalves, M. A. (2023). A comparative survey of instance selection methods applied to nonneural and transformer-based text classification. ACM Computing Surveys.

Desai, S. and Durrett, G. (2020). Calibration of pre-trained transformers. In arXiv preprint arXiv:2003.07892.

Ding, W. and Wu, S. (2020). A cross-entropy based stacking method in ensemble learning. J. of Intelligent & Fuzzy Systems, 39:1–12.

Džeroski, S. and Ženko, B. (2004). Is combining classifiers with stacking better than selecting the best one? Machine Learning, 54.

Gomes, C., Goncalves, M., Rocha, L., and Canuto, S. (2021). On the cost-effectiveness of stacking of neural and non-neural methods for text classification: Scenarios and performance prediction. In Findings of the ACL-IJCNLP 2021.

Niculescu-Mizil, A. and Caruana, R. (2005). Predicting good probabilities with supervised learning. In ICML’05.

Penha, G., Campos, R., Canuto, S., Gonçalves, M., and Santos, R. (2019). Document performance prediction for automatic text classification. In European Conference on IR Research (ECIR).

Singhal, R. and Kashef, R. (2023). A weighted stacking ensemble model with sampling for fake reviews detection. IEEE TCSS.

Subba, B. and Kumari, S. (2022). A heterogeneous stacking ensemble based sentiment analysis framework using multiple word embeddings. Computational Intelligence.

Wahba, Y., Madhavji, N., and Steinbacher, J. (2022). Reducing misclassification due to overlapping classes in text classification via stacking classifiers on different feature subsets. In Proceedings of the 2022 FICC, Vol.