Socially Responsible and Explainable Automated Fact-Checking and Hate Speech Detection

Francielle Vargas; Thiago Pardo; Fabrício Benevenuto

doi:10.5753/ctd.2025.8511

Francielle Vargas USP
Thiago Pardo USP
Fabrício Benevenuto UFMG

DOI: https://doi.org/10.5753/ctd.2025.8511

Resumo

Desinformação e discurso de ódio formam um ciclo socialmente prejudicial. Pesquisas mostram que a desinformação pode amplificar o discurso de ódio contra grupos badeado na sua identidade social e reforçar estereótipos prejudiciais. Para combater esse ciclo, uma ampla variedade de métodos de Processamento de Linguagem Natural (PLN) tem sido proposto. No entanto, embora o PLN tenha historicamente se baseado em técnicas inerentemente explicáveis, conhecidas como “caixa branca”, como algoritmos baseados em regras, árvores de decisão, modelos ocultos de Markov e regressão logística, a adoção de Grandes Modelos de Linguagem (LLMs) e embeddings de linguagem (frequentemente considerados “caixa preta”), reduziu significativamente a interpretabilidade. Essa falta de transparência introduz riscos consideráveis, incluindo vieses, que se tornaram uma preocupação importante na IA. Esta tese de doutorado aborda essas lacunas críticas propondo novos recursos que garantem explicabilidade e mitigação de vieses em modelos de PLN para essas tarefas. Especificamente, essa tese introduz cinco datasets benchmark (HateBR, HateBRXplain, HausaHate, MOL e FactNews), três métodos inovadores (SELFAR, SSA e B+M) e um sistema web (NoHateBrazil) projetados para melhorar a explicabilidade e a justiça da verificação automática de fatos e da detecção de discurso de ódio. Os modelos propostos superam os baselines para o português e o hausa, ambos idiomas sub-representados. Esta pesquisa contribui para as discussões em curso sobre IA responsável e explicável, preenchendo a lacuna entre desempenho dos modelos e interpretabilidade para aplicações no mundo real. Por fim, os resultados obtidos nessa tese tiveram um impacto significativo tanto nacional quanto internacionalmente, recebendo citações de universidades e institutos de pesquisa de prestígio no exterior e inspirando novos projetos de mestrado e doutorado no Brasil.

Referências

Al Kuwatly, H., Wich, M., and Groh, G. (2020). Identifying and measuring annotator bias based on annotators’ demographic characteristics. In Akiwowo, S., Vidgen, B., Prabhakaran, V., and Waseem, Z., editors, Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 184–190, Held Online.

Amazeen, M. (2015). Revisiting the epistemology of fact-checking. Critical Review, 27(1):1–30.

Chuang, Y.-S., Gao, M., Luo, H., Glass, J., Lee, H.-y., Chen, Y.-N., and Li, S.-W. (2021). Mitigating biases in toxic language detection through invariant rationalization. In Proceedings of the 5th Workshop on Online Abuse and Harms, pages 114–120, Held Online.

Davani, A. M., Atari, M., Kennedy, B., and Dehghani, M. (2023). Hate speech classifiers learn normative social stereotypes. Transactions of the Association for Computational Linguistics, 11:300–319.

Davidson, T., Bhattacharya, D., and Weber, I. (2019). Racial bias in hate speech and abusive language detection datasets. In Proceedings of the 3rd Workshop on Abusive Language Online, pages 25–35, Florence, Italy.

Dixon, L., Li, J., Sorensen, J., Thain, N., and Vasserman, L. (2018). Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’18, page 67–73, New York, USA.

Garg, P., Chakravarthy, A. S., Mandal, M., Narang, P., Chamola, V., and Guizani, M. (2021). Isdnet: Ai-enabled instance segmentation of aerial scenes for smart cities. ACM Transactions on Internet Technology (TOIT), 21(3):1–18.

Gongane, V. U., Munot, M. V., and Anuse, A. D. (2024). A survey of explainable AI techniques for detection of fake news and hate speech on social media platforms. Journal of Computational Social Science, 7(1):587–623.

Hameleers, M., Van der, T., and Vliegenthart, R. (2022). Civilized truths, hateful lies? incivility and hate speech in false information – evidence from fact-checked statements in the us. Information, Communication & Society, 25(11):1596–1613.

Kennedy, B., Jin, X., Mostafazadeh Davani, A., Dehghani, M., and Ren, X. (2020). Contextualizing hate speech classifiers with post-hoc explanation. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5435–5442, Held Online.

Kuzmin, G., Larionov, D., Pisarevskaya, D., and Smirnov, I. (2020). Fake news detection for the Russian language. In Proceedings of the 3rd International Workshop on Rumours and Deception in Social Media, pages 45–57, Barcelona, Spain.

Marietta, M., Barker, D. C., and Bowser, T. (2015). Fact-checking polarized politics: Does the fact-check industry provide consistent guidance on disputed realities? The Forum, 13(4):577–596.

Marwick, A. E. and Lewis, B. (2017). Media manipulation and disinformation online. Data and Society Research Institute, pages 1 – 104.

May, C., Wang, A., Bordia, S., Bowman, S. R., and Rudinger, R. (2019). On measuring social biases in sentence encoders. In Burstein, J., Doran, C., and Solorio, T., editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 622–628, Minneapolis, Minnesota.

Nieminen, S. and Rapeli, L. (2019). Fighting misperceptions and doubting journalists’ objectivity: A review of fact-checking literature. Political Studies Review, 17(3):296–309.

Park, S., Park, J. Y., Kang, J.-h., and Cha, M. (2021). The presence of unexpected biases in online fact-checking. Harvard Kennedy School Misinformation Review, 2(1).

Pennycook, G. and Rand, D. G. (2018). Lazy, not biased: Susceptibility to partisan fake news is better explained by lack of reasoning than by motivated reasoning. Cognition, (188):39–50.

Poletto, F., Basile, V., Sanguinetti, M., Bosco, C., and Patti, V. (2021). Resources and benchmark corpora for hate speech detection: a systematic review. Language Resources and Evaluation, 55(3):477–523.

Salles, I., Vargas, F., and Benevenuto, F. (2025). HateBRXplain: A benchmark dataset with human-annotated rationales for explainable hate speech detection in Brazilian Portuguese. In Proceedings of the 31st International Conference on Computational Linguistics, pages 6659–6669, Abu Dhabi, UAE.

Sap, M., Card, D., Gabriel, S., Choi, Y., and Smith, N. A. (2019). The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1668–1678, Florence, Italy.

Sap, M., Swayamdipta, S., Vianna, L., Zhou, X., Choi, Y., and Smith, N. A. (2022). Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In Carpuat, M., de Marneffe, M.-C., and Meza Ruiz, I. V., editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5884–5906, Seattle, United States.

Soprano, M., Roitero, K., La Barbera, D., Ceolin, D., Spina, D., Demartini, G., and Mizzaro, S. (2024). Cognitive biases in fact-checking and their countermeasures: A review. Inf. Process. Manage., 61(3).

Stryker, C. S. (2024). What is responsible ai? International Business Machines (IBM).

Tsvetkov, Y., Prabhakaran, V., and Voigt, R. (2019). Socially responsible natural language processing. In Companion Proceedings of The 2019 World Wide Web Conference, WWW ’19, page 1326, New York, USA.

Vargas, F., Carvalho, I., Hürriyetoğlu, A., Pardo, T., and Benevenuto, F. (2023a). Socially responsible hate speech detection: Can classifiers reflect social stereotypes? In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 1187–1196, Varna, Bulgaria.

Vargas, F., Carvalho, I., Pardo, T., and Benevenuto, F. (2024a). Context-aware and expert data resources for brazilian portuguese hate speech detection. Natural Language Processing, pages 1–22.

Vargas, F., Carvalho, I., Rodrigues de Góes, F., Pardo, T., and Benevenuto, F. (2022). HateBR: A large expert annotated corpus of Brazilian Instagram comments for offensive language and hate speech detection. In Proceedings of the 13th Language Resources and Evaluation Conference, pages 7174–7183, Marseille, France.

Vargas, F., Carvalho, I., Schmeisser-Nieto, W., Benevenuto, F., and Pardo, T. (2023b). NoHateBrazil: A Brazilian Portuguese text offensiveness analysis system. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 1180–1186, Varna, Bulgaria.

Vargas, F., Guimarães, S., Muhammad, S. H., Alves, D., Ahmad, I. S., Abdulmumin, I., Mohamed, D., Pardo, T., and Benevenuto, F. (2024b). HausaHate: An expert annotated corpus for Hausa hate speech detection. In Proceedings of the 8th Workshop on Online Abuse and Harms, pages 52–58, Mexico City, Mexico.

Vargas, F., Jaidka, K., Pardo, T., and Benevenuto, F. (2023c). Predicting sentence-level factuality of news and bias of media outlets. In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 1197–1206, Varna, Bulgaria.

Vargas, F., Rodrigues de Góes, F., Carvalho, I., Benevenuto, F., and Pardo, T. (2021). Contextual-lexicon approach for abusive language detection. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, pages 1438–1447, Held Online.

Vargas, F., Salles, I., Alves, D., Agrawal, A., Pardo, T. A. S., and Benevenuto, F. (2024c). Improving explainable fact-checking via sentence-level factual reasoning. In Proceedings of the 7th Fact Extraction and VERification Workshop, pages 192–204, Miami, USA.

Wardle, C. (2024). A Conceptual Analysis of the Overlaps and Differences between Hate Speech, Misinformation and Disinformation. Department of Peace Operations (DPO). Office of the Special Adviser on the Prevention of Genocide (OSAPG). United Nations.

Westwood, S. J., Iyengar, S., Walgrave, S., Leonisio, R., Miller, L., and Strijbis, O. (2018). The tie that divides: Cross-national evidence of the primacy of partyism. European Journal of Political Research, 57:333–354.

Wu, J., Liu, Q., Xu, W., and Wu, S. (2022). Bias mitigation for evidence-aware fake news detection by causal intervention. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, page 2308–2313, New York, USA.