O que torna uma frase tóxica? Uma análise crítica de modelos especialistas em detecção de toxicidade
Resumo
This study examines the performance of specialized Machine Learning models in the task of online toxicity detection, under the hypothesis that these systems disproportionately focus on isolated lexical items. We conduct a comparative analysis of the Detoxify and Perspective API models using the HateXplain (English) and ToLDE-Br (Portuguese) datasets. To assess model behavior, we employ the SHAP explainability framework, which enables the interpretation of feature importance in individual predictions. Our findings reveal a misalignment between model outputs and the nuanced, evolving nature of language on social media platforms. Furthermore, the results demonstrate an overreliance on negatively connoted keywords, which compromises the models’ classification accuracy and raises concerns regarding their robustness and fairness in real-world applications.
Referências
Hind Almerekhi, Haewoon Kwak, Joni Salminen, and Bernard J Jansen. 2020. Are these comments triggering? predicting triggers of toxicity in online discussions. In Proceedings of The Web Conference 2020. 3033–3040.
Sultan Alshamrani, Mohammed Abuhamad, Ahmed Abusnaina, and David Mohaisen. 2020. Investigating Online Toxicity in Users Interactions with the Mainstream Media Channels on YouTube.. In CIKM (Workshops).
Kofi Arhin, Ioana Baldini, Dennis Wei, Karthikeyan Natesan Ramamurthy, and Moninder Singh. 2021. Ground-Truth, Whose Truth?–Examining the Challenges with Annotating Toxic Text Datasets. arXiv preprint arXiv:2112.03529 (2021).
Danah M Boyd and Nicole B Ellison. 2007. Social network sites: Definition, history, and scholarship. Journal of computer-mediated Communication 13, 1 (2007), 210–230.
Athus Cavalini, Thamya Donadia, and Giovanni Comarela. 2024. Characterizing the toxicity of the Brazilian extremist communities on telegram. In Brazilian Symposium on Multimedia and the Web (WebMedia). SBC, 370–374.
Casey Fiesler, Joshua McCann, Kyle Frye, Jed R Brubaker, et al. 2018. Reddit rules! characterizing an ecosystem of governance. In Twelfth International AAAI Conference on Web and Social Media.
Paula Fortuna and Sérgio Nunes. 2018. A survey on automatic detection of hate speech in text. ACM Computing Surveys (CSUR) 51, 4 (2018), 1–30.
Paula Fortuna, Juan Soler, and Leo Wanner. 2020. Toxic, hateful, offensive or abusive? what are we really classifying? an empirical analysis of hate speech datasets. In Proceedings of the Twelfth Language Resources and Evaluation Conference. 6786–6794.
Priya Garg, MK Sharma, and Parteek Kumar. 2024. Improving Hate Speech Classification Through Ensemble Learning and Explainable AI Techniques. Arabian Journal for Science and Engineering (2024), 1–14.
Jennifer Golbeck, Zahra Ashktorab, Rashad O Banjo, Alexandra Berlinger, Siddharth Bhagwan, Cody Buntain, Paul Cheakalos, Alicia A Geller, Rajesh Kumar Gnanasekaran, Raja Rajan Gunasekaran, et al. 2017. A large labeled corpus for online harassment research. In Proceedings of the 2017 ACM on web science conference. 229–233.
Tommi Gröndahl, Luca Pajola, Mika Juuti, Mauro Conti, and N Asokan. 2018. All you need is"love"evading hate speech detection. In Proceedings of the 11th ACM workshop on artificial intelligence and security. 2–12.
Samuel S. Guimarães, Filipe N. Ribeiro, Julio C. S. Reis, and Fabrício Benevenuto. 2020. Characterizing Toxicity on Facebook Comments in Brazil. In Proceedings of the 26th Brazilian Symposium on Multimedia and the Web (WebMedia ’20). Association for Computing Machinery (ACM), São Luís, Brazil, 1–10. DOI: 10.1145/3428658.3430974
David Gunning. 2017. Explainable artificial intelligence (xai). darpa. I20 (DARPA 2017) (2017).
Laura Hanu and Unitary team. 2020. Detoxify. Github. [link].
Hossein Hosseini, Sreeram Kannan, Baosen Zhang, and Radha Poovendran. 2017. Deceiving google’s perspective api built for detecting toxic comments. arXiv preprint arXiv:1702.08138 (2017).
Jigsaw/ConversationAI. 2018. Toxic Comment Classification Challenge. [link]
Jigsaw/ConversationAI. 2019. Jigsaw Unintended Bias in Toxicity Classification. [link]
Jigsaw/ConversationAI. 2020. Jigsaw Multilingual Toxic Comment Classification. [link]
Jae Yeon Kim, Carlos Ortiz, Sarah Nam, Sarah Santiago, and Vivek Datta. 2020. Intersectional bias in hate speech and abusive language datasets. arXiv preprint arXiv:2005.05921 (2020).
Deepak Kumar, Patrick Gage Kelley, Sunny Consolvo, Joshua Mason, Elie Bursztein, Zakir Durumeric, Kurt Thomas, and Michael Bailey. 2021. Designing toxic content classification for a diversity of perspectives. In Seventeenth Symposium on Usable Privacy and Security (SOUPS 2021). 299–318.
Ritesh Kumar, Atul Kr Ojha, Marcos Zampieri, and Shervin Malmasi. 2018. Proceedings of the first workshop on trolling, aggression and cyberbullying (TRAC-2018). In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018).
Alyssa Lees, Vinh Q Tran, Yi Tay, Jeffrey Sorensen, Jai Gupta, Donald Metzler, and Lucy Vasserman. 2022. A new generation of perspective api: Efficient multilingual character-level transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3197–3207.
Joao A Leite, Diego F Silva, Kalina Bontcheva, and Carolina Scarton. 2020. Toxic language detection in social media for Brazilian Portuguese: New dataset and multilingual analysis. arXiv preprint arXiv:2010.04543 (2020).
Chak Tou Leong, Yi Cheng, Jiashuo Wang, Jian Wang, and Wenjie Li. 2023. Self-detoxifying language models via toxification reversal. arXiv preprint arXiv:2310.09573 (2023).
Binny Mathew, Punyajoy Saha, Seid Muhie Yimam, Chris Biemann, Pawan Goyal, and Animesh Mukherjee. 2021. Hatexplain: A benchmark dataset for explainable hate speech detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 14867–14875.
Harshkumar Mehta and Kalpdrum Passi. 2022. Social media hate speech detection using explainable artificial intelligence (XAI). Algorithms 15, 8 (2022), 291.
Shruthi Mohan, Apala Guha, Michael Harris, Fred Popowich, Ashley Schuster, and Chris Priebe. 2017. The impact of toxic language on the health of reddit communities. In Canadian Conference on Artificial Intelligence. Springer, 51–56.
Christoph Molnar. 2020. Interpretable machine learning. Lulu. com. 22 pages.
Alexandra Olteanu, Kartik Talamadupula, and Kush R Varshney. 2017. The limits of abstract evaluation metrics: The case of hate speech detection. In Proceedings of the 2017 ACM on web science conference. 405–406.
J. W. Pennebaker, R. L. Boyd, K. Jordan, and K. Blackburn. 2015. The development and psychometric properties of LIWC2015. Technical Report. University of Texas at Austin. [link]
Julian Risch, Robin Ruff, and Ralf Krestel. 2020. Offensive language detection explained. In Proceedings of the second workshop on trolling, aggression and cyberbullying. 137–143.
Isadora Salles, Francielle Vargas, and Fabrício Benevenuto. 2025. HateBRXplain: A Benchmark Dataset with Human-Annotated Rationales for Explainable Hate Speech Detection in Brazilian Portuguese. In Proceedings of the 31st International Conference on Computational Linguistics. 6659–6669.
Ellen Spertus. 1997. Smokey: Automatic recognition of hostile messages. In Aaai/iaai. 1058–1065.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H.Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. [link]
Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017. Ex machina: Personal attacks seen at scale. In Proceedings of the 26th international conference on world wide web. ACM, 1391–1399.
Zhixue Zhao, Ziqi Zhang, and Frank Hopfgartner. 2021. A comparative study of using pre-trained language models for toxic comment classification. In Companion Proceedings of the Web Conference 2021. 500–507.
