A Sentiment Analysis Benchmark for Automated Machine Learning Applications and a Proof of Concept in Hate Speech Detection

Marília Costa Rosendo Silva; Vitor Augusto de Oliveira; Thiago Alexandre Salgueiro Pardo

doi:10.5753/stil.2023.234176

Marília Costa Rosendo Silva USP http://orcid.org/0000-0002-7931-7259
Vitor Augusto de Oliveira USP https://orcid.org/0009-0003-2708-2739
Thiago Alexandre Salgueiro Pardo USP https://orcid.org/0000-0003-2111-1319

DOI: https://doi.org/10.5753/stil.2023.234176

Resumo

O Aprendizado de Máquina Automático (AutoML) é uma área de pesquisa relevante, pois permite acelerar e facilitar o desenvolvimento de novas soluções aplicadas usando Inteligência Artificial. Este artigo aborda o desafio de fornecer conjuntos de dados padronizados para análise de sentimentos em inglês e propõe um benchmark de AutoML, resultando em 46 conjuntos de dados pré-processados. É realizada uma prova de conceito para a tarefa de detecção de discurso de ódio para apresentar as potencialidades do benchmark proposto.

Palavras-chave: benchmark, sentiment analysis, automated machine learning, automl

Referências

Abu Salem, F. K., Al Feel, R., Elbassuoni, S., Jaber, M., and Farah, M. (2019). FAKES: A Fake News Dataset around the Syrian War. https://doi.org/10.5281/zenodo.2607278

Alex, N., Lifland, E., Tunstall, L., Thakur, A., Maham, P., Riedel, C., Hine, E., Ashurst, C., Sedille, P., Carlier, A., Noetel, M., and Stuhlmuller, A. (2021). RAFT: A Real-World Few-Shot Text Classification Benchmark. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, pages 1–12. [link].

Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V., Rangel Pardo, F. M., Rosso, P., and Sanguinetti, M. (2019). SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 54–63. https://doi.org/10.18653/v1/S19-2007

Bastan, M., Koupaee, M., Son, Y., Sicoli, R., and Balasubramanian, N. (2020). Author’s Sentiment Prediction. In Proceedings of the 28th International Conference on Computational Linguistics, pages 604–615. https://doi.org/10.18653/v1/2020.coling-main.52

Blohm, M., Hanussek, M., and Kintz, M. (2021). Leveraging Automated Machine Learning for Text Classification: Evaluation of AutoML Tools and Comparison with Human Performance. In Proceedings of the 13th International Conference on Agents and Artificial Intelligence, pages 1131– 1136. https://doi.org/10.5220/0010331411311136

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135– 146. https://aclanthology.org/Q17-1010

Chakravarthi, B. R. (2020). HopeEDI: A Multilingual Hope Speech Detection Dataset for Equality, Diversity, and Inclusion. In Proceedings of the Third Workshop on Computational Modeling of People’s Opinions, Personality, and Emotion’s in Social Media, pages 41–53. https://www.aclweb.org/anthology/2020.peoples-1.5

Davidson, T., Warmsley, D., Macy, M., and Weber, I. (2017). Automated Hate Speech Detection and the Problem of Offensive Language. In Proceedings of the 11th International AAAI Conference on Web and Social Media, pages 512–515. [link].

de Gibert, O., Perez, N., García-Pablos, A., and Cuadros, M. (2018). Hate Speech Dataset from a White Supremacy Forum. In Proceedings of the 2nd Workshop on Abusive Language Online, pages 11–20. https://doi.org/10.18653/v1/W18-5102

Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, pages 1–30. http://jmlr.org/papers/v7/demsar06a.html

Demszky, D., Movshovitz-Attias, D., Ko, J., Cowen, A., Nemade, G., and Ravi, S. (2020). GoEmotions: A Dataset of Fine-Grained Emotions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4040–4054. https://www.aclweb.org/anthology/2020.acl-main.372

Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Larroy, P., Li, M., and Smola, A. (2020). Autogluon-tabular: Robust and accurate automl for structured data. arXiv:2003.06505, pages 1–28. https://arxiv.org/abs/2003.06505

Feurer, M., Klein, A., Eggensperger, K., Springenberg, J. T., Blum, M., and Hutter, F. (2015). Efficient and robust automated machine learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems, page 2755–2763. [link].

Founta, A.-M., Djouvas, C., Chatzakou, D., Leontiadis, I., Blackburn, J., Stringhini, G., Vakali, A., Sirivianos, M., and Kourtellis, N. (2018). Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior. In 11th International Conference on Web and Social Media, 2018, pages 491–500. [link].

Gautam, A., Mathur, P., Gosangi, R., Mahata, D., Sawhney, R., and Shah, R. R. (2020). #metooma: multi-aspect annotations of tweets related to the metoo movement. In Proceedings of International AAAI Conference on Web and Social Media, pages 209–216. [link].

Go, A., Bhayani, R., and Huang, L. (2009). Twitter sentiment classification using distant supervision. CS224N project report, Stanford, pages 1–6. [link].

Grano, G., Di Sorbo, A., Mercaldo, F., Visaggio, C. A., Canfora, G., and Panichella, S. (2017). Software Applications User Reviews. https://huggingface.co/datasets/app_reviews

Gräßer, F., Kallumadi, S., Malberg, H., and Zaunseder, S. (2018). Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning. In Proceedings of the 2018 International Conference on Digital Health, page 121–125. https://doi.org/10.1145/3194658.3194677

Grosz, D. and Conde-Cespedes, P. (2020). Automatic Detection of Sexist Statements Commonly Used at the Workplace. In Trends and Applications in Knowledge Discovery and Data Mining: 2020 Workshops, page 104–115. https://doi.org/10.1007/978-3-030-60470-7_11

Guyon, I., Chaabane, I., Escalante, H. J., Escalera, S., Jajetic, D., Lloyd, J. R., Macia, N., Ray, B., Romaszko, L., Sebag, M., Statnikov, A., Treguer, S., and Viegas, E. (2016). A brief review of the chalearn automl challenge: Any-time any-dataset learning without human intervention. In Proceedings of the Workshop on Automatic Machine Learning, pages 21–30. [link].

Hugging Face (2019). Tweets Hate Speech Detection. Accessed: 2022-04-05. [link].

Jigsaw (2018). Toxic Comment Classification Challenge. Accessed: 2022-04-06. [link].

Kaggle (2020a). Samsung Internal SSD Reviews. Accessed: 2022-04-06. [link].

Kaggle (2020b). Amazon Musical Instruments Reviews. Accessed: 2022-04-06. [link].

Kaggle (2020c). Terrorism And Jihadism Speech Detection. Accessed: 2022-04-06. [link].

Kaggle (2020d). Apple Twitter Sentiment Texts. Accessed: 2022-04-06. [link].

Kawintiranon, K. and Singh, L. (2021). Knowledge enhanced masked language model for stance detection. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4725–4735. https://aclanthology.org/2021.naacl-main.376

Keung, P., Lu, Y., Szarvas, G., and Smith, N. A. (2020). The Multilingual Amazon Reviews Corpus. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4563–4568. https://aclanthology.org/2020.emnlp-main.369/

Kiesel, J., Mestre, M., Shukla, R., Vincent, E., Adineh, P., Corney, D., Stein, B., and Potthast, M. (2019). SemEval-2019 task 4: Hyperpartisan news detection. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 829–839. https://aclanthology.org/S19-2145

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. (2011). Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150. https://aclanthology.org/P11-1015

Mathew, B., Saha, P., Yimam, S. M., Biemann, C., Goyal, P., and Mukherjee, A. (2021). HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 14867–14875. https://doi.org/10.1609/aaai.v35i17.17745

McAuley, J. and Leskovec, J. (2013). Hidden Factors and Hidden Topics: Understanding Rating Dimensions with Review Text. In Proceedings of the 7th ACM Conference on Recommender Systems, page 165–172. https://doi.org/10.1145/2507157.2507163

Mollas, I., Chrysopoulou, Z., Karlos, S., and Tsoumakas, G. (2020). ETHOS: an Online Hate Speech Detection Dataset. arXiv: 2006.08328, pages 1–16. https://arxiv.org/abs/2006.08328

Olson, R. S. and Moore, J. H. (2016). Tpot: A tree-based pipeline optimization tool for automating machine learning. In Proceeding of the ICML 2016 AutoML Workshop, pages 66–74. [link].

Pang, B. and Lee, L. (2005). Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 115–124. https://doi.org/10.3115/1219840.1219855

Pérez-Rosas, V., Kleinberg, B., Lefevre, A., and Mihalcea, R. (2018). Automatic Detection of Fake News. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3391–3401. https://aclanthology.org/C18-1287/

Pineau, J., Vincent-Lamarre, P., Sinha, K., Larivière, V., Beygelzimer, A., d’Alché Buc, F., Fox, E., and Larochelle, H. (2020). Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program). pages 1–22. https://doi.org/10.48550/ARXIV.2003.12206

Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3982–3992. http://dx.doi.org/10.18653/v1/D19-1410

Rosenthal, S., Ritter, A., Nakov, P., and Stoyanov, V. (2014). SemEval-2014 Task 9: Sentiment Analysis in Twitter. In Proceedings of the 8th International Workshop on Semantic Evaluation, pages 73–80. https://doi.org/10.3115/v1/S14-2009

Saravia, E., Liu, H.-C. T., Huang, Y.-H., Wu, J., and Chen, Y.-S. (2018). CARER: Contextualized Affect Representations for Emotion Recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3687–3697. https://doi.org/10.18653/v1/D18-1404

Shahi, G. K. and Nandini, D. (2020). FakeCovid - A Multilingual Cross-domain Fact Check News Dataset for COVID-19. In Workshop Proceedings of the 14th International AAAI Conference on Web and Social Media, pages 1–9. https://doi.org/10.36190/2020.14

Sheng, E. and Uthus, D. (2020). Investigating Societal Biases in a Poetry Composition System. In Proceedings of the Second Workshop on Gender Bias in Natural Language Processing, pages 93–106. https://aclanthology.org/2020.gebnlp-1.9

Ŝkrlj, B., Martinc, M., Lavraĉ, N., and Pollak, S. (2021). autoBOT: evolving neuro-symbolic representations for explainable low resource text classification. Machine Learning, 110(5):989–1028. https://doi.org/10.1007/s10994-021-05968-x

Sosea, T., Pham, C., Tekle, A., Caragea, C., and Li, J. J. (2022). Emotion analysis and detection during COVID-19. In Proceedings of the Language Resources and Evaluation Conference, pages 6938–6947. https://aclanthology.org/2022.lrec-1.750

Strapparava, C. and Mihalcea, R. (2007). SemEval-2007 Task 14: Affective Text. In Proceedings of the Fourth International Workshop on Semantic Evaluations, pages 70–74. https://www.aclweb.org/anthology/S07-1013

Thorne, J., Vlachos, A., Christodoulopoulos, C., and Mittal, A. (2018). FEVER: a Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 809–819. http://dx.doi.org/10.18653/v1/N18-1074

Thorne, J., Vlachos, A., Cocarascu, O., Christodoulopoulos, C., and Mittal, A. (2019). The FEVER2.0 Shared Task. In Proceedings of the Second Workshop on Fact Extraction and VERification, pages 1–6. http://dx.doi.org/10.18653/v1/D19-6601

Torabi Asr, F. and Taboada, M. (2018). The Data Challenge in Misinformation Detection: Source Reputation vs. Content Veracity. In Proceedings of the First Workshop on Fact Extraction and VERification, pages 10–15. https://doi.org/10.18653/v1/W18-5502

Wang, W. Y. (2017). “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 422–426. https://doi.org/10.18653/v1/P17-2067

Waseem, Z. and Hovy, D. (2016). Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter. In Proceedings of the NAACL Student Research Workshop, pages 88–93. http://www.aclweb.org/anthology/N16-2013

Weinzierl, M. and Harabagiu, S. (2022). VaccineLies: A Natural Language Resource for Learning to Recognize Misinformation about the COVID-19 and HPV Vaccines. In Proceedings of the Language Resources and Evaluation Conference, pages 6967–6975. https://aclanthology.org/2022.lrec-1.753

Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., and Kumar, R. (2019). Predicting the Type and Target of Offensive Posts in Social Media. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1415–1420. http://dx.doi.org/10.18653/v1/N19-1144

Zhang, X., Zhao, J., and LeCun, Y. (2015). Character-level Convolutional Networks for Text Classification. In Advances in Neural Information Processing Systems, pages 1–9. [link].