Evaluating Hate Speech Detection to Unseen Target Groups

  • Alexandre Negretti UNICAMP
  • Marcos M. Raimundo UNICAMP

Resumo


LLMs trained to detect hate speech have a significant challenge on identifying hate speech directed toward new or less common target groups. This happens because the models are primarily trained on data focused on more prevalent forms of hate, targeting groups that have historically been subjected to hate speech. Not only the way of defamation evolves through time, but new targets may emerge, presenting forms of hate that were previously non-existent in datasets. This work presents analyses of the influence of targeted groups on model prediction. We evaluate training strategies that address target group bias in hate speech detectors. Lastly, we present a novel dataset composed of text posts from Twitter regarding the 2022 Russia-Ukraine war.
Palavras-chave: Large Language Models, Hate Speech Detection, Slavic Hate Dataset

Referências

Attanasio, G., Pastor, E., Di Bonaventura, C., and Nozza, D. (2023). ferret: a framework for benchmarking explainers on transformers. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. Association for Computational Linguistics.

Ayo F, Folorunso O, I. F. (2020). Machine learning techniques for hate speech classification of twitter data: State-of-the-art, future challenges and research directions. Computer Science Review, 38.

Cai, Y., Zimek, A., Wunder, G., and Ntoutsi, E. (2022). Power of explanations: Towards automatic debiasing in hate speech detection. 2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA), pages 1–10.

Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., and Kamar, E. (2022). Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Annual Meeting of the Association for Computational Linguistics.

Kovatchev, V., Gupta, S., and Lease, M. (2022). Fairly accurate: Learning optimal accuracy vs. fairness tradeoffs for hate speech detection. ArXiv, abs/2204.07661.

Lundberg, S. M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc.

Maarouf, A., Prollochs, N., and Feuerriegel, S. (2022). The virality of hate speech on social media. ¨

Mathew, B., Saha, P., Yimam, S. M., Biemann, C., Goyal, P., and Mukherjee, A. (2020). Hatexplain: A benchmark dataset for explainable hate speech detection. In AAAI Conference on Artificial Intelligence.

Montani, I., Honnibal, M., Honnibal, M., Boyd, A., Landeghem, S. V., and Peters, H. (2023). spacy: Industrial-strength nlp.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ”why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 1135–1144.

Velankar, A., Patil, H., and Joshi, R. (2022). A review of challenges in machine learning based automated hate speech detection.

Vidgen, B., Thrush, T., Waseem, Z., and Kiela, D. (2021). Learning from the worst. In ACL.
Publicado
27/11/2024
NEGRETTI, Alexandre; RAIMUNDO, Marcos M.. Evaluating Hate Speech Detection to Unseen Target Groups. In: CONFERÊNCIA LATINO-AMERICANA DE ÉTICA EM INTELIGÊNCIA ARTIFICIAL, 1. , 2024, Niteroi. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 101-104. DOI: https://doi.org/10.5753/laai-ethics.2024.32462.