Estratégias de Undersampling para Redução de Viés em Classificação de Texto Baseada em Transformers

Guilherme Fonseca; Gabriel Prenassi; Washington Cunha; Marcos André Gonçalves; Leonardo Rocha

doi:10.5753/webmedia.2024.241229

Guilherme Fonseca UFSJ
Gabriel Prenassi UFSJ
Washington Cunha UFMG
Marcos André Gonçalves UFMG
Leonardo Rocha UFSJ

DOI: https://doi.org/10.5753/webmedia.2024.241229

Resumo

Automatic Text Classification (ATC) in unbalanced datasets is a common challenge in real-world applications. In this scenario, one (or more) class(es) is overrepresented, which usually causes a bias in the learning process towards these majority classes. This work investigates the effect of undersampling methods, which aim to reduce instances of the majority class, on the effectiveness of recent ATC methods. Through a systematic mapping of the literature, we selected and implemented 15 undersampling strategies. We also propose two new strategies and compare all 17 methods using RoBERTa as sentiment analysis classifier. Our results suggest that a set of undersampling approaches is capable of significantly reducing the learning bias of ATC methods towards the majority class on imbalanced datasets, without incurring any effectiveness loss, and with improvements in efficiency and reduction of carbon emissions.

Palavras-chave: Classificação de Texto, Transformers, Undersampling

Referências

Lasse F Wolff Anthony, Benjamin Kanding, and Raghavendra Selvan. 2020. Carbontracker: Tracking and predicting the carbon footprint of training deep learning models. arXiv preprint arXiv:2007.03051 (2020).

Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik. 1992. A training algorithm for optimal margin classifiers. In 5th COLT.

Leo Breiman. 2001. Random forests. Machine learning (2001).

Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd KDD.

Washington Cunha, Sérgio D. Canuto, Felipe Viegas, Thiago Salles, Christian Gomes, Vítor Mangaravite, Elaine Resende, Thierson Rosa, Marcos André Gonçalves, and Leonardo Rocha. 2020. Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. IP&M. (2020).

Washington Cunha, Celso França, Guilherme Fonseca, Leonardo Rocha, and Marcos André Gonçalves. 2023. An Effective, Efficient, and Scalable Confidence-Based Instance Selection Framework for Transformer-Based Text Classification. In the 46th ACM SIGIR.

Washington Cunha, Vítor Mangaravite, Christian Gomes, Sérgio Canuto, Felipe Viegas, Celso França, Wellington Santos Martins, Jussara M Almeida, et al. 2021. On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. IP&M (2021).

Washington Cunha, Felipe Viegas, Celso França, Thierson Rosa, Leonardo Rocha, and Marcos André Gonçalves. 2023. A Comparative Survey of Instance Selection Methods applied to NonNeural and Transformer-Based Text Classification. Comput. Surveys (2023).

Paula Czarnowska, Yogarshi Vyas, and Kashif Shah. 2021. Quantifying social biases in NLP: A generalization and empirical comparison of extrinsic fairness metrics. TACL (2021).

Claudio MV de Andrade, Fabiano M Belém, Washington Cunha, Celso França, Felipe Viegas, Leonardo Rocha, and Marcos André Gonçalves. 2023. On the class separability of contextual embeddings representations–or “The classifier does not matter when the (text) representation is so good!”. IP&M (2023).

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

Georgios Douzas, Maria Lechleitner, and Fernando Bacao. 2022. Improving the quality of predictive models in small data GSDOT: A new algorithm for generating synthetic data. Plos one (2022).

Vinicius HS Durelli, Rafael S Durelli, Andre T Endo, Elder Cirilo, Washington Luiz, and Leonardo Rocha. 2018. Please please me: does the presence of test cases influence mobile app users’ satisfaction?. In Proceedings of the XXXII Brazilian Symposium on Software Engineering.

Xavier Ferrer, Tom van Nuenen, Jose M. Such, Mark Coté, and Natalia Criado. 2021. Bias and Discrimination in AI: A Cross-Disciplinary Perspective. IEEE Technology and Society Magazine (2021).

Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing. Springer.

Xiao Han, Yuqi Liu, and Jimmy Lin. 2021. The simplest thing that can possibly work:(pseudo-) relevance feedback via text classification. In Proceedings of the 2021 ACM SIGIR ICTIR.

Peter Hart. 1968. The condensed nearest neighbor rule (corresp.). IEEE transactions on information theory (1968).

Antônio Júnior, Pablo Cecilio, Felipe Viegas, Washington Cunha, Elisa Albergaria, and Leonardo Rocha. 2022. Evaluating topic modeling pre-processing pipelines for portuguese texts. In WebMedia.

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in NeurIPS (2017).

Miroslav Kubat, Stan Matwin, et al. 1997. Addressing the curse of imbalanced training sets: one-sided selection. In Icml. Citeseer.

Loïc Lannelongue, Jason Grealey, and Michael Inouye. 2021. Green algorithms: quantifying the carbon footprint of computation. Advanced science (2021).

Jorma Laurikkala. 2001. Improving identification of difficult small classes by balancing class distribution. In 8th Conference on AIME.

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL.

Wei-Chao Lin, Chih-Fong Tsai, Ya-Han Hu, and Jing-Shang Jhang. 2017. Clustering-based undersampling in class-imbalanced data. Info. Sciences (2017).

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

Hongxia Lu, Louis Ehwerhemuepha, and Cyril Rakovski. 2022. A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance. BMC medical research methodology (2022).

Washington Luiz, Felipe Viegas, Rafael Alencar, Fernando Mourão, Thiago Salles, Dárlinton Carvalho, Marcos Andre Gonçalves, and Leonardo Rocha. 2018. A Feature-Oriented Sentiment Rating for Mobile App Reviews. In the World Wide Web Conference (WWW ’18).

Inderjeet Mani and I Zhang. 2003. kNN approach to unbalanced data distributions: a case study involving information extraction. In Proceedings of workshop on learning from imbalanced datasets. ICML.

Luiz Felipe Mendes, Marcos Gonçalves, Washington Cunha, Leonardo Rocha, Thierson Couto-Rosa, and Wellington Martins. 2020. " Keep it Simple, Lazy"– MetaLazy: A New MetaStrategy for Lazy Text Classification. In Proceedings of the 29th ACM International CIKM.

Andrew Ng. 2017. Machine learning yearning. [link].

Albert Orriols-Puig and Ester Bernadó-Mansilla. 2009. Evolutionary rule-based systems for imbalanced data sets. Soft Computing (2009).

Andrea Pasin, Washington Cunha, Marcos André Gonçalves, and Nicola Ferro. 2024. A Quantum Annealing Instance Selection Approach for Efficient and Effective Transformer Fine-Tuning. In ICTIR.

Alexander Ponomarenko, Nikita Avrelin, Bilegsaikhan Naidan, and Leonid Boytsov. 2014. Comparative analysis of data structures for approximate nearest neighbor search. Data analytics (2014).

Sivaramakrishnan Rajaraman, Prasanth Ganesan, and Sameer Antani. 2022. Deep learning model calibration for improving performance in class-imbalanced medical image classification tasks. PloS one (2022).

Michael R Smith, Tony Martinez, and Christophe Giraud-Carrier. 2014. An instance level analysis of data complexity. Machine learning (2014).

Ivan Tomek. 1976. An experiment with the edited nearest-neighbor rule. (1976).

Ivan Tomek. 1976. Two Modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics (1976).

Pattaramon Vuttipittayamongkol, Eyad Elyan, Andrei Petrovski, and Chrisina Jayne. 2018. Overlap-based undersampling for improving imbalanced data classification. In 19th IDEAL. Springer.

Dennis L Wilson. 1972. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems (1972).

Raymond E Wright. 1995. Logistic regression. (1995).

Show-Jane Yen and Yue-Shi Lee. 2006. Under-sampling approaches for improving prediction of the minority class in an imbalanced dataset. In ICIC Kunming, China, August 16–19, 2006. Springer.

Bruna Stella Zanotto, Ana Paula Beck da Silva Etges, Avner Dal Bosco, Eduardo Gabriel Cortes, Renata Ruschel, Washington Luiz, et al. 2021. Stroke outcome measurements from electronic medical records: cross-sectional study on the effectiveness of neural and nonneural classifiers. JMIR Medical Informatics (2021).

Estratégias de Undersampling para Redução de Viés em Classificação de Texto Baseada em Transformers

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)