Socially Responsible and Explainable Automated Fact-Checking and Hate Speech Detection

  • Francielle Vargas USP
  • Thiago Pardo USP
  • Fabrício Benevenuto UFMG

Abstract


Although Natural Language Processing (NLP) has traditionally relied on inherently interpretable “white-box” techniques, such as rule-based algorithms, decision trees, hidden Markov models, and logistic regression, the adoption of Large Language Models (LLMs) and language embeddings (often considered “black-box”) has significantly reduced interpretability. This lack of transparency introduces considerable risks, including biases, which have become a major concern in the field of Artificial Intelligence (AI). This Ph.D. thesis addresses these critical gaps by proposing new resources that ensure interpretability and fairness in NLP models for automated fact-checking and hate speech detection tasks. Specifically, we introduce five benchmark datasets (HateBR, HateBRXplain, HausaHate, MOL, and FactNews), four novel post-hoc and self-explaining methods (SELFAR, SSA, B+M and SRA), and one web platform (NoHateBrazil) designed to improve the interpretability and fairness of hate speech detection. The proposed models outperform the existing baselines for Portuguese and Hausa, both underrepresented languages. This research contributes to ongoing discussions on responsible and explainable AI, bridging the gap between model performance and interpretability to achieve positive real-world social impact. Finally, this thesis has had a significant impact both nationally and internationally, receiving citations from prestigious universities and research institutes abroad, inspiring new M.Sc. and Ph.D. in Brazil, and being recognized with multiple awards, including the Google LARA, the Maria Carolina Monard AI Award, and as a finalist at the Brazilian Computer Society Thesis and Dissertation Award.

Keywords: natural language processing, explanability and interpretability, social networks and social media, misinformation, hate speech, and online toxicity, fairness, responsible ai

References

Aida Mostafazadeh Davani, Mohammad Atari, Brendan Kennedy, and Morteza Dehghani. 2023. Hate Speech Classifiers Learn Normative Social Stereotypes. Transactions of the Association for Computational Linguistics 11 (2023), 300–319.

Thomas Davidson, Debasmita Bhattacharya, and Ingmar Weber. 2019. Racial Bias in Hate Speech and Abusive Language Detection Datasets. In Proceedings of the 3rd Workshop on Abusive Language Online. Florence, Italy, 25–35.

Michael Hameleers, Toni Van der, and Rens Vliegenthart. 2022. Civilized truths, hateful lies? Incivility and hate speech in false information – evidence from fact-checked statements in the US. Information, Communication & Society 25, 11 (2022), 1596–1613.

Cheng Li, Mengzhuo Chen, JindongWang, Sunayana Sitaram, and Xing Xie. 2024. CultureLLM: Incorporating Cultural Differences into Large Language Models. In Advances in Neural Information Processing Systems, Vol. 37. 84799–84838.

Cheng Li, Damien Teney, Linyi Yang, Qingsong Wen, Xing Xie, and Jindong Wang. 2025. CulturePark: boosting cross-cultural understanding in large language models. In Proceedings of the 38th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (NIPS ’24). Red Hook, NY, USA, Article 2082, 34 pages.

Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman, and Rachel Rudinger. 2019. On Measuring Social Biases in Sentence Encoders. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Minneapolis, Minnesota, 622–628.

Fabio Poletto, Valerio Basile, Manuela Sanguinetti, Cristina Bosco, and Viviana Patti. 2021. Resources and benchmark corpora for hate speech detection: a systematic review. Language Resources and Evaluation 55, 3 (2021), 477–523.

Isadora Salles, Francielle Vargas, and Fabrício Benevenuto. 2025. HateBRXplain: A Benchmark Dataset with Human-Annotated Rationales for Explainable Hate Speech Detection in Brazilian Portuguese. In Proceedings of the 31st International Conference on Computational Linguistics. Abu Dhabi, UAE, 6659–6669.

Henri Tajfel. 1979. An integrative theory of intergroup conflict. The social psychology of intergroup relations/Brooks/Cole (1979).

Yulia Tsvetkov, Vinodkumar Prabhakaran, and Rob Voigt. 2019. Socially Responsible Natural Language Processing. In Companion Proceedings of The 2019 World Wide Web Conference (San Francisco, USA) (WWW ’19). New York, USA, 1326.

Francielle Vargas, Isabelle Carvalho, Ali Hürriyetoğlu, Thiago Pardo, and Fabrício Benevenuto. 2023. Socially Responsible Hate Speech Detection: Can Classifiers Reflect Social Stereotypes?. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing. Varna, Bulgaria, 1187–1196.

Francielle Vargas, Isabelle Carvalho, Thiago Pardo, and Fabricio Benevenuto. 2024. Context-Aware and Expert Data Resources for Brazilian Portuguese Hate Speech Detection. Natural Language Processing (2024), 1–22.

Francielle Vargas, Isabelle Carvalho, Fabiana Rodrigues de Góes, Thiago Pardo, and Fabrício Benevenuto. 2022. HateBR: A Large Expert Annotated Corpus of Brazilian Instagram Comments for Offensive Language and Hate Speech Detection. In Proceedings of the 13th Language Resources and Evaluation Conference. Marseille, France, 7174–7183.

Francielle Vargas, Samuel Guimarães, Shamsuddeen Hassan Muhammad, Diego Alves, Ibrahim Said Ahmad, Idris Abdulmumin, Diallo Mohamed, Thiago Pardo, and Fabrício Benevenuto. 2024. HausaHate: An Expert Annotated Corpus for Hausa Hate Speech Detection. In Proceedings of the 8thWorkshop on Online Abuse and Harms. Mexico City, Mexico, 52–58.

Francielle Vargas, Kokil Jaidka, Thiago Pardo, and Fabrício Benevenuto. 2023. Predicting Sentence-Level Factuality of News and Bias of Media Outlets. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing. Varna, Bulgaria, 1197–1206.

Francielle Vargas, Fabiana Rodrigues de Góes, Isabelle Carvalho, Fabrício Benevenuto, and Thiago Pardo. 2021. Contextual-Lexicon Approach for Abusive Language Detection. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. Held Online, 1438–1447.

Francielle Vargas, Isadora Salles, Diego Alves, Ameeta Agrawal, Thiago A. S. Pardo, and Fabrício Benevenuto. 2024. Improving Explainable Fact-Checking via Sentence-Level Factual Reasoning. In Proceedings of the 7th Fact Extraction and VERification Workshop. Miami, USA, 192–204.

ClaireWardle. 2024. A Conceptual Analysis of the Overlaps and Differences between Hate Speech, Misinformation and Disinformation. Department of Peace Operations (DPO). Office of the Special Adviser on the Prevention of Genocide (OSAPG). United Nations.
Published
2025-11-10
VARGAS, Francielle; PARDO, Thiago; BENEVENUTO, Fabrício. Socially Responsible and Explainable Automated Fact-Checking and Hate Speech Detection. In: THESIS AND DISSERTATION CONTEST - BRAZILIAN SYMPOSIUM ON MULTIMEDIA AND THE WEB (WEBMEDIA), 31. , 2025, Rio de Janeiro/RJ. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 25-26. ISSN 2596-1683. DOI: https://doi.org/10.5753/webmedia_estendido.2025.16388.