Socially Responsible and Explainable Automated Fact-Checking and Hate Speech Detection
Abstract
Misinformation and hate speech form a socially harmful cycle. Research shows that misinformation can amplify hate speech targeting social identity groups and reinforce harmful stereotypes. To combat this cycle, a wide range of Natural Language Processing (NLP) methods have been proposed. Nevertheless, while NLP has historically relied on inherently explainable “white-box” techniques, such as rule-based algorithms, decision trees, hidden markov models, and logistic regression, the adoption of Large Language Models (LLMs) and language embeddings (often considered “black-box”) has significantly reduced interpretability. This lack of transparency introduces considerable risks, including biases, which have become a major concern in AI. This Ph.D. thesis addresses these critical gaps by proposing new resources that ensure explainability and bias mitigation in NLP models for these tasks. Specifically, it introduces five benchmark datasets (HateBR, HateBRXplain, HausaHate, MOL, and FactNews), three novel methods (SELFAR, SSA, and B+M), and one web system (NoHateBrazil) designed to improve the explainability and fairness of automated fact-checking and hate speech detection. The proposed models outperform existing baselines for Portuguese and Hausa, both underrepresented languages. This research contributes to ongoing discussions on responsible and explainable AI, bridging the gap between model performance and interpretability for realworld applications. Finally, it has had a significant impact both nationally and internationally, receiving citations from prestigious universities and research institutes abroad, and inspiring new M.Sc. and Ph.D. projects in Brazil.References
Al Kuwatly, H., Wich, M., and Groh, G. (2020). Identifying and measuring annotator bias based on annotators’ demographic characteristics. In Akiwowo, S., Vidgen, B., Prabhakaran, V., and Waseem, Z., editors, Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 184–190, Held Online.
Amazeen, M. (2015). Revisiting the epistemology of fact-checking. Critical Review, 27(1):1–30.
Chuang, Y.-S., Gao, M., Luo, H., Glass, J., Lee, H.-y., Chen, Y.-N., and Li, S.-W. (2021). Mitigating biases in toxic language detection through invariant rationalization. In Proceedings of the 5th Workshop on Online Abuse and Harms, pages 114–120, Held Online.
Davani, A. M., Atari, M., Kennedy, B., and Dehghani, M. (2023). Hate speech classifiers learn normative social stereotypes. Transactions of the Association for Computational Linguistics, 11:300–319.
Davidson, T., Bhattacharya, D., and Weber, I. (2019). Racial bias in hate speech and abusive language detection datasets. In Proceedings of the 3rd Workshop on Abusive Language Online, pages 25–35, Florence, Italy.
Dixon, L., Li, J., Sorensen, J., Thain, N., and Vasserman, L. (2018). Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’18, page 67–73, New York, USA.
Garg, P., Chakravarthy, A. S., Mandal, M., Narang, P., Chamola, V., and Guizani, M. (2021). Isdnet: Ai-enabled instance segmentation of aerial scenes for smart cities. ACM Transactions on Internet Technology (TOIT), 21(3):1–18.
Gongane, V. U., Munot, M. V., and Anuse, A. D. (2024). A survey of explainable AI techniques for detection of fake news and hate speech on social media platforms. Journal of Computational Social Science, 7(1):587–623.
Hameleers, M., Van der, T., and Vliegenthart, R. (2022). Civilized truths, hateful lies? incivility and hate speech in false information – evidence from fact-checked statements in the us. Information, Communication & Society, 25(11):1596–1613.
Kennedy, B., Jin, X., Mostafazadeh Davani, A., Dehghani, M., and Ren, X. (2020). Contextualizing hate speech classifiers with post-hoc explanation. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5435–5442, Held Online.
Kuzmin, G., Larionov, D., Pisarevskaya, D., and Smirnov, I. (2020). Fake news detection for the Russian language. In Proceedings of the 3rd International Workshop on Rumours and Deception in Social Media, pages 45–57, Barcelona, Spain.
Marietta, M., Barker, D. C., and Bowser, T. (2015). Fact-checking polarized politics: Does the fact-check industry provide consistent guidance on disputed realities? The Forum, 13(4):577–596.
Marwick, A. E. and Lewis, B. (2017). Media manipulation and disinformation online. Data and Society Research Institute, pages 1 – 104.
May, C., Wang, A., Bordia, S., Bowman, S. R., and Rudinger, R. (2019). On measuring social biases in sentence encoders. In Burstein, J., Doran, C., and Solorio, T., editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 622–628, Minneapolis, Minnesota.
Nieminen, S. and Rapeli, L. (2019). Fighting misperceptions and doubting journalists’ objectivity: A review of fact-checking literature. Political Studies Review, 17(3):296–309.
Park, S., Park, J. Y., Kang, J.-h., and Cha, M. (2021). The presence of unexpected biases in online fact-checking. Harvard Kennedy School Misinformation Review, 2(1).
Pennycook, G. and Rand, D. G. (2018). Lazy, not biased: Susceptibility to partisan fake news is better explained by lack of reasoning than by motivated reasoning. Cognition, (188):39–50.
Poletto, F., Basile, V., Sanguinetti, M., Bosco, C., and Patti, V. (2021). Resources and benchmark corpora for hate speech detection: a systematic review. Language Resources and Evaluation, 55(3):477–523.
Salles, I., Vargas, F., and Benevenuto, F. (2025). HateBRXplain: A benchmark dataset with human-annotated rationales for explainable hate speech detection in Brazilian Portuguese. In Proceedings of the 31st International Conference on Computational Linguistics, pages 6659–6669, Abu Dhabi, UAE.
Sap, M., Card, D., Gabriel, S., Choi, Y., and Smith, N. A. (2019). The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1668–1678, Florence, Italy.
Sap, M., Swayamdipta, S., Vianna, L., Zhou, X., Choi, Y., and Smith, N. A. (2022). Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In Carpuat, M., de Marneffe, M.-C., and Meza Ruiz, I. V., editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5884–5906, Seattle, United States.
Soprano, M., Roitero, K., La Barbera, D., Ceolin, D., Spina, D., Demartini, G., and Mizzaro, S. (2024). Cognitive biases in fact-checking and their countermeasures: A review. Inf. Process. Manage., 61(3).
Stryker, C. S. (2024). What is responsible ai? International Business Machines (IBM).
Tsvetkov, Y., Prabhakaran, V., and Voigt, R. (2019). Socially responsible natural language processing. In Companion Proceedings of The 2019 World Wide Web Conference, WWW ’19, page 1326, New York, USA.
Vargas, F., Carvalho, I., Hürriyetoğlu, A., Pardo, T., and Benevenuto, F. (2023a). Socially responsible hate speech detection: Can classifiers reflect social stereotypes? In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 1187–1196, Varna, Bulgaria.
Vargas, F., Carvalho, I., Pardo, T., and Benevenuto, F. (2024a). Context-aware and expert data resources for brazilian portuguese hate speech detection. Natural Language Processing, pages 1–22.
Vargas, F., Carvalho, I., Rodrigues de Góes, F., Pardo, T., and Benevenuto, F. (2022). HateBR: A large expert annotated corpus of Brazilian Instagram comments for offensive language and hate speech detection. In Proceedings of the 13th Language Resources and Evaluation Conference, pages 7174–7183, Marseille, France.
Vargas, F., Carvalho, I., Schmeisser-Nieto, W., Benevenuto, F., and Pardo, T. (2023b). NoHateBrazil: A Brazilian Portuguese text offensiveness analysis system. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 1180–1186, Varna, Bulgaria.
Vargas, F., Guimarães, S., Muhammad, S. H., Alves, D., Ahmad, I. S., Abdulmumin, I., Mohamed, D., Pardo, T., and Benevenuto, F. (2024b). HausaHate: An expert annotated corpus for Hausa hate speech detection. In Proceedings of the 8th Workshop on Online Abuse and Harms, pages 52–58, Mexico City, Mexico.
Vargas, F., Jaidka, K., Pardo, T., and Benevenuto, F. (2023c). Predicting sentence-level factuality of news and bias of media outlets. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 1197–1206, Varna, Bulgaria.
Vargas, F., Rodrigues de Góes, F., Carvalho, I., Benevenuto, F., and Pardo, T. (2021). Contextual-lexicon approach for abusive language detection. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, pages 1438–1447, Held Online.
Vargas, F., Salles, I., Alves, D., Agrawal, A., Pardo, T. A. S., and Benevenuto, F. (2024c). Improving explainable fact-checking via sentence-level factual reasoning. In Proceedings of the 7th Fact Extraction and VERification Workshop, pages 192–204, Miami, USA.
Wardle, C. (2024). A Conceptual Analysis of the Overlaps and Differences between Hate Speech, Misinformation and Disinformation. Department of Peace Operations (DPO). Office of the Special Adviser on the Prevention of Genocide (OSAPG). United Nations.
Westwood, S. J., Iyengar, S., Walgrave, S., Leonisio, R., Miller, L., and Strijbis, O. (2018). The tie that divides: Cross-national evidence of the primacy of partyism. European Journal of Political Research, 57:333–354.
Wu, J., Liu, Q., Xu, W., and Wu, S. (2022). Bias mitigation for evidence-aware fake news detection by causal intervention. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, page 2308–2313, New York, USA.
Amazeen, M. (2015). Revisiting the epistemology of fact-checking. Critical Review, 27(1):1–30.
Chuang, Y.-S., Gao, M., Luo, H., Glass, J., Lee, H.-y., Chen, Y.-N., and Li, S.-W. (2021). Mitigating biases in toxic language detection through invariant rationalization. In Proceedings of the 5th Workshop on Online Abuse and Harms, pages 114–120, Held Online.
Davani, A. M., Atari, M., Kennedy, B., and Dehghani, M. (2023). Hate speech classifiers learn normative social stereotypes. Transactions of the Association for Computational Linguistics, 11:300–319.
Davidson, T., Bhattacharya, D., and Weber, I. (2019). Racial bias in hate speech and abusive language detection datasets. In Proceedings of the 3rd Workshop on Abusive Language Online, pages 25–35, Florence, Italy.
Dixon, L., Li, J., Sorensen, J., Thain, N., and Vasserman, L. (2018). Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’18, page 67–73, New York, USA.
Garg, P., Chakravarthy, A. S., Mandal, M., Narang, P., Chamola, V., and Guizani, M. (2021). Isdnet: Ai-enabled instance segmentation of aerial scenes for smart cities. ACM Transactions on Internet Technology (TOIT), 21(3):1–18.
Gongane, V. U., Munot, M. V., and Anuse, A. D. (2024). A survey of explainable AI techniques for detection of fake news and hate speech on social media platforms. Journal of Computational Social Science, 7(1):587–623.
Hameleers, M., Van der, T., and Vliegenthart, R. (2022). Civilized truths, hateful lies? incivility and hate speech in false information – evidence from fact-checked statements in the us. Information, Communication & Society, 25(11):1596–1613.
Kennedy, B., Jin, X., Mostafazadeh Davani, A., Dehghani, M., and Ren, X. (2020). Contextualizing hate speech classifiers with post-hoc explanation. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5435–5442, Held Online.
Kuzmin, G., Larionov, D., Pisarevskaya, D., and Smirnov, I. (2020). Fake news detection for the Russian language. In Proceedings of the 3rd International Workshop on Rumours and Deception in Social Media, pages 45–57, Barcelona, Spain.
Marietta, M., Barker, D. C., and Bowser, T. (2015). Fact-checking polarized politics: Does the fact-check industry provide consistent guidance on disputed realities? The Forum, 13(4):577–596.
Marwick, A. E. and Lewis, B. (2017). Media manipulation and disinformation online. Data and Society Research Institute, pages 1 – 104.
May, C., Wang, A., Bordia, S., Bowman, S. R., and Rudinger, R. (2019). On measuring social biases in sentence encoders. In Burstein, J., Doran, C., and Solorio, T., editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 622–628, Minneapolis, Minnesota.
Nieminen, S. and Rapeli, L. (2019). Fighting misperceptions and doubting journalists’ objectivity: A review of fact-checking literature. Political Studies Review, 17(3):296–309.
Park, S., Park, J. Y., Kang, J.-h., and Cha, M. (2021). The presence of unexpected biases in online fact-checking. Harvard Kennedy School Misinformation Review, 2(1).
Pennycook, G. and Rand, D. G. (2018). Lazy, not biased: Susceptibility to partisan fake news is better explained by lack of reasoning than by motivated reasoning. Cognition, (188):39–50.
Poletto, F., Basile, V., Sanguinetti, M., Bosco, C., and Patti, V. (2021). Resources and benchmark corpora for hate speech detection: a systematic review. Language Resources and Evaluation, 55(3):477–523.
Salles, I., Vargas, F., and Benevenuto, F. (2025). HateBRXplain: A benchmark dataset with human-annotated rationales for explainable hate speech detection in Brazilian Portuguese. In Proceedings of the 31st International Conference on Computational Linguistics, pages 6659–6669, Abu Dhabi, UAE.
Sap, M., Card, D., Gabriel, S., Choi, Y., and Smith, N. A. (2019). The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1668–1678, Florence, Italy.
Sap, M., Swayamdipta, S., Vianna, L., Zhou, X., Choi, Y., and Smith, N. A. (2022). Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In Carpuat, M., de Marneffe, M.-C., and Meza Ruiz, I. V., editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5884–5906, Seattle, United States.
Soprano, M., Roitero, K., La Barbera, D., Ceolin, D., Spina, D., Demartini, G., and Mizzaro, S. (2024). Cognitive biases in fact-checking and their countermeasures: A review. Inf. Process. Manage., 61(3).
Stryker, C. S. (2024). What is responsible ai? International Business Machines (IBM).
Tsvetkov, Y., Prabhakaran, V., and Voigt, R. (2019). Socially responsible natural language processing. In Companion Proceedings of The 2019 World Wide Web Conference, WWW ’19, page 1326, New York, USA.
Vargas, F., Carvalho, I., Hürriyetoğlu, A., Pardo, T., and Benevenuto, F. (2023a). Socially responsible hate speech detection: Can classifiers reflect social stereotypes? In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 1187–1196, Varna, Bulgaria.
Vargas, F., Carvalho, I., Pardo, T., and Benevenuto, F. (2024a). Context-aware and expert data resources for brazilian portuguese hate speech detection. Natural Language Processing, pages 1–22.
Vargas, F., Carvalho, I., Rodrigues de Góes, F., Pardo, T., and Benevenuto, F. (2022). HateBR: A large expert annotated corpus of Brazilian Instagram comments for offensive language and hate speech detection. In Proceedings of the 13th Language Resources and Evaluation Conference, pages 7174–7183, Marseille, France.
Vargas, F., Carvalho, I., Schmeisser-Nieto, W., Benevenuto, F., and Pardo, T. (2023b). NoHateBrazil: A Brazilian Portuguese text offensiveness analysis system. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 1180–1186, Varna, Bulgaria.
Vargas, F., Guimarães, S., Muhammad, S. H., Alves, D., Ahmad, I. S., Abdulmumin, I., Mohamed, D., Pardo, T., and Benevenuto, F. (2024b). HausaHate: An expert annotated corpus for Hausa hate speech detection. In Proceedings of the 8th Workshop on Online Abuse and Harms, pages 52–58, Mexico City, Mexico.
Vargas, F., Jaidka, K., Pardo, T., and Benevenuto, F. (2023c). Predicting sentence-level factuality of news and bias of media outlets. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 1197–1206, Varna, Bulgaria.
Vargas, F., Rodrigues de Góes, F., Carvalho, I., Benevenuto, F., and Pardo, T. (2021). Contextual-lexicon approach for abusive language detection. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, pages 1438–1447, Held Online.
Vargas, F., Salles, I., Alves, D., Agrawal, A., Pardo, T. A. S., and Benevenuto, F. (2024c). Improving explainable fact-checking via sentence-level factual reasoning. In Proceedings of the 7th Fact Extraction and VERification Workshop, pages 192–204, Miami, USA.
Wardle, C. (2024). A Conceptual Analysis of the Overlaps and Differences between Hate Speech, Misinformation and Disinformation. Department of Peace Operations (DPO). Office of the Special Adviser on the Prevention of Genocide (OSAPG). United Nations.
Westwood, S. J., Iyengar, S., Walgrave, S., Leonisio, R., Miller, L., and Strijbis, O. (2018). The tie that divides: Cross-national evidence of the primacy of partyism. European Journal of Political Research, 57:333–354.
Wu, J., Liu, Q., Xu, W., and Wu, S. (2022). Bias mitigation for evidence-aware fake news detection by causal intervention. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, page 2308–2313, New York, USA.
Published
2025-07-20
How to Cite
VARGAS, Francielle; PARDO, Thiago; BENEVENUTO, Fabrício.
Socially Responsible and Explainable Automated Fact-Checking and Hate Speech Detection. In: THESIS AND DISSERTATION CONTEST (CTD), 38. , 2025, Maceió/AL.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 75-84.
ISSN 2763-8820.
DOI: https://doi.org/10.5753/ctd.2025.8511.
