A Process for Building Datasets that Enable the Application of Different Methods for Detecting Fake News and Social Bots

Jeferson Luis Gonçalves; Lucas Barboza de Menezes Torres; Paulo Márcio Souza Freire; Ronaldo Ribeiro Goldschmidt

doi:10.5753/sbsi.2025.246620

Jeferson Luis Gonçalves IME
Lucas Barboza de Menezes Torres FAETEC / ETEOT
Paulo Márcio Souza Freire FAETEC
Ronaldo Ribeiro Goldschmidt IME

DOI: https://doi.org/10.5753/sbsi.2025.246620

Resumo

Context: The spread of fake news on social media is an imperative concern. The dissemination of such news by social bots has added complexity to disinformation detection applications. Machine learning methods have been applied to classify news as fake and not fake and accounts as bot and not bot, based on labeled datasets. Problem: There are no datasets that embed simultaneously news labeled as fake and not fake and accounts labeled as bot and not bot. This gap hinders the evaluation of classification methods that could benefit from such data embedding. Solution: A process for building datasets that contain pieces of news and accounts appropriately labeled, and enable development and comparison of fake news and social bots detection methods. IS Theory: General Systems Theory1 and social Network Theory2. Method: Data requirements from SOTA3 fake news and social bot detection methods guided the development of the process. This process collects data from social networks and fact-checking agencies. A case study generated a dataset, illustrating the viability of the process. Summary of Results: The dataset generated is public and contains 440 labeled pieces of news and 6,274 labeled accounts. Most fake news detection methods improved their performance when they considered the labels of the accounts. Contributions and Impact on the IS field: The process that builds datasets that integrate labeled news and labeled accounts, and the dataset generated by the case study. Both contributions are related to the Grand Challenges in IS Research and the Sociotechnical Vision of IS.

Palavras-chave: Fake News, Social Bots, Social Networks, Datasets, Machine Learning

Referências

Eduardo Bezerra. 2018. Princípios de Análise e Projeto de Sistemas com UML.

Sonia Castelo, Thais Almeida, Anas Elghafari, Aécio Santos, Kien Pham, Eduardo Nakamura, and Juliana Freire. 2019. A topic-agnostic approach for identifying fake news pages. In Companion proceedings of the 2019World WideWeb conference. 975–980.

Argus Antonio Barbosa Cavalcante, Paulo Márcio Souza Freire, Ronaldo Ribeiro Goldschmidt, and Claudia Marcela Justel. 2024. Early detection of fake news on virtual social networks: A time-aware approach based on crowd signals. Expert Systems with Applications 247 (2024), 123350. DOI: 10.1016/j.eswa.2024.123350

Paulo Roberto Cordeiro and Vladia Pinheiro. 2019. Um corpus de notıcias falsas do twitter e verificaçao automática de rumores em lıngua portuguesa. In Proceedings of the Symposium in Information and Human Language Technology. 219–228.

Stefano Cresci, Fabrizio Lillo, Daniele Regoli, Serena Tardelli, and Maurizio Tesconi. 2019. Cashtag Piggybacking: Uncovering Spam and Bot Activity in Stock Microblogs on Twitter. ACM Transactions on the Web 13, 2 (April 2019), 1–27. DOI: 10.1145/3313184

Flávio Roberto Matias da Silva, Paulo Márcio Souza Freire, Marcelo Pereira de Souza, Gustavo de A. B. Plenamente, and Ronaldo Ribeiro Goldschmidt. 2020. FakeNewsSetGen - a Process to Build Datasets that Support Comparison Among Fake News Detection Methods. In Anais do XXVI Simpósio Brasileiro de Sistemas Multimídia e Web (16 ed.) (Evento Online). SBC, Porto Alegre, RS, Brasil, 188–195. [link] 2022.

Carla Chrytina de Castro Pacheco Ferreira. 2018. Detecção de socialbots em redes sociais baseada em atributos quantitativos. Ph.D. Dissertation. Instituto Militar de Engenharia, Rio de Janeiro.

Samir de O. Ramos., Ronaldo R. Goldschmidt., and Alex de V. Garcia. 2022. Social Bots Detection: A Method based on a Sentiment Lexicon Learned from Messages. In Proceedings of the 24th International Conference on Enterprise Information Systems - Volume 1: ICEIS. INSTICC, SciTePress, 273–280. DOI: 10.5220/0011115000003179

K. Faceli. 2011. Inteligência artificial: uma abordagem de aprendizado de máquina. Grupo Gen - LTC. [link]

Pedro Faustini and Thiago Covões. 2019. Fake News Detection Using One-Class Classification. In 2019 8th Brazilian Conference on Intelligent Systems (BRACIS). 592–597. DOI: 10.1109/BRACIS.2019.00109

Shangbin Feng, Herun Wan, Ningnan Wang, Jundong Li, and Minnan Luo. 2021. TwiBot-20: A Comprehensive Twitter Bot Detection Benchmark. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (Virtual Event, Queensland, Australia) (CIKM ’21). Association for Computing Machinery, New York, NY, USA, 4485–4494. DOI: 10.1145/3459637.3482019

Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer, and Alessandro Flammini. 2016. The rise of social bots. Commun. ACM 59, 7 (jun 2016), 96–104. DOI: 10.1145/2818717

Paulo Márcio Souza Freire and Ronaldo Ribeiro Goldschmidt. 2019. Uma Introdução ao Combate Automático às Fake News em Redes Sociais Virtuais. In Tópicos em Gerenciamento de Dados e Informações, SBBD (2019 ed.) (Fortaleza, CE, Brazil). SBC, Fortaleza, CE, Brasi, 38–67. [link]

Aurélien Géron. 2019. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems (2nd ed.). O’Reilly Media.

Jennifer Golbeck, Matthew Mauriello, Brooke Auxier, Keval H Bhanushali, Christopher Bonk, Mohamed Amine Bouzaghrane, Cody Buntain, Riya Chanduka, Paul Cheakalos, Jennine B Everett, et al. 2018. Fake news vs satire: A dataset and analysis. In Proceedings of the 10th ACM conference on web science. 17–21.

Yang Liu and Yi-Fang Brook Wu. 2018. Early detection of fake news on social media through propagation path classification with recurrent and convolutional networks. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence (New Orleans, Louisiana, USA) (AAAI’18/IAAI’18/EAAI’18). AAAI Press, Article 44, 8 pages.

Octavio Loyola-González, Raúl Monroy, Jorge Rodríguez, Armando López-Cuevas, and Javier Israel Mata-Sánchez. 2019. Contrast Pattern-Based Classification for Bot Detection on Twitter. IEEE Access 7 (2019), 45800–45817. DOI: 10.1109/ACCESS.2019.2904220

Michele Mazza, Stefano Cresci, Marco Avvenuti, Walter Quattrociocchi, and Maurizio Tesconi. 2019. Rtbust: Exploiting temporal patterns for botnet detection on twitter. In Proceedings of the 10th ACM conference on web science. 183–192.

Uriel Merola, Paulo Freire, Ronaldo Goldschmidt, and Jorge Soares. 2023. Métodos de Detecção de Fake News: Uma Comparação entre as Abordagens de Crowd Signals e Ensembles. Anais do Simpósio Brasileiro de Banco de Dados (SBBD) (2023), 372–377. DOI: 10.5753/sbbd.2023.233398

Rafael A Monteiro, Roney LS Santos, Thiago AS Pardo, Tiago A De Almeida, Evandro ES Ruiz, and Oto A Vale. 2018. Contributions to the study of fake news in portuguese: New corpus and automatic detection results. In Computational Processing of the Portuguese Language: 13th International Conference, PROPOR 2018, Canela, Brazil, September 24–26, 2018, Proceedings 13. Springer, 324–334.

João Moreno and Graça Bressan. 2019. FACTCK.BR: A New Dataset to Study Fake News. In Proceedings of the 25th Brazillian Symposium on Multimedia and the Web (2019 ed.) (Rio de Janeiro, Brazil) (WebMedia ’19, 1). Association for Computing Machinery, New York, NY, USA, 525–527. DOI: 10.1145/3323503.3361698

Mehwish Nasim, Andrew Nguyen, Nick Lothian, Robert Cope, and Lewis Mitchell. 2018. Real-Time Detection of Content Polluters in Partially Observable Twitter Networks. In Companion Proceedings of the The Web Conference 2018 (2018 ed.) (Lyon, France) (WWW’18, .). InternationalWorld WideWeb Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1331–1339. DOI: 10.1145/3184558.3191574

R. Pressman and B. Maxim. 2016. Engenharia de Software - 8ª Edição. [link]

Feng Qian, Chengyue Gong, Karishma Sharma, and Yan Liu. 2018. Neural user response generator: fake news detection with collective user intelligence. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (Stockholm, Sweden) (IJCAI’18). AAAI Press, 3834–3840.

Adrian Rauchfleisch and Jonas Kaiser. 2020. In The False positive problem of automatic bot detection in social science research. Harvard Dataverse. DOI: 10.7910/DVN/XVCKRS/P2ZKRU

Natali Ruchansky, Sungyong Seo, and Yan Liu. 2017. CSI:AHybrid Deep Model for Fake News Detection. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (2017 ed.) (Singapore, Singapore) (CIKM ’17, .). Association for Computing Machinery, New York, NY, USA, 797–806. DOI: 10.1145/3132847.3132877

Giovanni Santia and Jake Williams. 2018. BuzzFace: A News Veracity Dataset with Facebook User Commentary and Egos. Proceedings of the International AAAI Conference on Web and Social Media 12, 1 (Jun. 2018), 531–540. DOI: 10.1609/icwsm.v12i1.14985

Mohsen Sayyadiharikandeh, Onur Varol, Kai-Cheng Yang, Alessandro Flammini, and Filippo Menczer. 2020. Detection of Novel Social Bots by Ensembles of Specialized Classifiers. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (Virtual Event, Ireland) (CIKM ’20). Association for Computing Machinery, New York, NY, USA, 2725–2732. DOI: 10.1145/3340531.3412698

Chengcheng Shao, Giovanni Luca Ciampaglia, Onur Varol, Kai-Cheng Yang, Alessandro Flammini, and Filippo Menczer. 2018. The spread of low-credibility content by social bots. Nature Commun. 9, 1 (Nov. 2018). DOI: 10.1038/s41467-018-06930-7

Kai Shu, Deepak Mahudeswaran, and Huan Liu. 2019. FakeNewsTracker: a tool for fake news collection, detection, and visualization. Computational and Mathematical Organization Theory 25, 1 (March 2019), 60–71. DOI: 10.1007/s10588-018-09280-3

Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. 2017. Fake News Detection on Social Media: A Data Mining Perspective. SIGKDD Explor. Newsl. 19, 1 (sep 2017), 22–36. DOI: 10.1145/3137597.3137600

Paulo Márcio Souza Freire, Flávio Roberto Matias da Silva, and Ronaldo Ribeiro Goldschmidt. 2021. Fake news detection based on explicit and implicit signals of a hybrid crowd: An approach inspired in meta-learning. Expert Systems with Applications 183 (2021), 115414. DOI: 10.1016/j.eswa.2021.115414

Ajitesh Srivastava, Rajgopal Kannan, Charalampos Chelmis, and Viktor K. Prasanna. 2018. FActCheck: Keeping Activation of Fake News at Check. In Adaptive Agents and Multi-Agent Systems, AAMAS. [link]

Duyu Tang, Furu Wei, Nan Yang, Ting Liu, and Ming Zhou. 2015. Sentiment Embeddings with Applications to Sentiment Analysis. IEEE Transactions on Knowledge and Data Engineering 28 (01 2015), 1–1. DOI: 10.1109/TKDE.2015.2489653

PatrickWang, Rafael Angarita, and Ilaria Renna. 2018. Is This the Era of Misinformation yet: Combining Social Bots and Fake News to Deceive the Masses. In Companion Proceedings of the The Web Conference 2018 (www ’18 ed.) (Lyon, France). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1557–1561. DOI: 10.1145/3184558.3191610

Amy Warriner, Victor Kuperman, and Marc Brysbaert. 2013. Norms of valence, arousal, and dominance for 13,915 English lemmas. Behavior research methods 45 (02 2013). DOI: 10.3758/s13428-012-0314-x

ViniciusWoloszyn andWolfgang Nejdl. 2018. DistrustRank: Spotting False News Domains. In Proceedings of the 10th ACM Conference on Web Science (Amsterdam, Netherlands) (WebSci). Association for Computing Machinery, New York, NY, USA, 221–228. DOI: 10.1145/3201064.3201083

Liang Wu and Huan Liu. 2018. Tracing Fake-News Footprints: Characterizing Social Media Messages by How They Propagate. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (Marina Del Rey, CA, USA) (WSDM ’18). Association for Computing Machinery, New York, NY, USA, 637–645. DOI: 10.1145/3159652.3159677

Fan Yang, Shiva K. Pentyala, Sina Mohseni, Mengnan Du, Hao Yuan, Rhema Linder, Eric D. Ragan, Shuiwang Ji, and Xia (Ben) Hu. 2019. XFake: Explainable Fake News Detector with Visualizations. In The World Wide Web Conference (San Francisco, CA, USA) (WWW ’19). Association for Computing Machinery, New York, NY, USA, 3600–3604. DOI: 10.1145/3308558.3314119

Kai-Cheng Yang, Onur Varol, Clayton A. Davis, Emilio Ferrara, Alessandro Flammini, and Filippo Menczer. 2019. Arming the public with artificial intelligence to counter social bots. Human Behavior and Emerging Technologies 1, 1 (Jan. 2019), 48–61. DOI: 10.1002/hbe2.115

Kai-Cheng Yang, Onur Varol, Pik-Mai Hui, and Filippo Menczer. 2020. Scalable and Generalizable Social Bot Detection through Data Selection. Proceedings of the AAAI Conference on Artificial Intelligence 34, 01 (April 2020), 1096–1103. DOI: 10.1609/aaai.v34i01.5460

Qiang Zhang, Emine Yilmaz, and Shangsong Liang. 2018. Ranking-Based Method for News Stance Detection. In Companion Proceedings of the The Web Conference 2018 (www ’18 ed.) (Lyon, France). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 41–42. DOI: 10.1145/3184558.3186919

A Process for Building Datasets that Enable the Application of Different Methods for Detecting Fake News and Social Bots

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)