Social bots detection in Brazilian presidential elections using natural language processing
ResumoIn recent years, we have seen an expressive increase in the number of users participating in social networks. Social networks, in general, have proven to be quite effective in spreading opinions and influencing people as messages can be shared with thousands of people in a few minutes. However, this ability has been exploited in a negative way, to manipulate opinions and spread misinformation and/or fake news. A common way of doing this is through the use of bots, computer algorithms that mimic human behavior, disseminating topics and news, demonstrating support or rejection to personalities, and interacting with other users, which can impact even democratic discussions. For this reason, the present work aims to show and compare approaches for detecting social bots using Twitter users posts data extracted during the Brazilian presidential election period of 2018. Using a dataset of Twitter users labeled as bots or humans, this research applies five natural language processing (NLP) techniques to extract characteristics from the content of the users messages on the social network. In order to analyze the impact of features extracted through NLP in the task of detecting bots, five different classifiers were tested including pre-processing techniques and feature selection. The best results were achieved through a union of all the extracted features using the Random Forest classifier, achieving an accuracy of 0.91 for the bot class and AUC of 0.83.
Alessandro Bessi and Emilio Ferrara. 2016. Social bots distort the 2016 U.S. Presidential election online discussion. First Monday 21, 11 (Nov. 2016), 14 pages. https://doi.org/10.5210/fm.v21i11.7090 https://doi.org/10.5210/fm.v21i11.7090.
Chiyu Cai, Linjing Li, and Daniel Zeng. 2017. Detecting Social Bots by Jointly Modeling Deep Behavior and Content Information. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management(Singapore, Singapore) (CIKM ’17). Association for Computing Machinery, New York, NY, USA, 1995–1998. https://doi.org/10.1145/3132847.3133050
Shenglei Chen, Geoffrey I. Webb, Linyuan Liu, and Xin Ma. 2020. A novel selective naïve Bayes algorithm. Knowledge-Based Systems 192 (2020), 105361. https://doi.org/10.1016/j.knosys.2019.105361 https://doi.org/10.1016/j.knosys.2019.105361.
Z. Chu, S. Gianvecchio, H. Wang, and S. Jajodia. 2012. Detecting Automation of Twitter Accounts: Are You a Human, Bot, or Cyborg?IEEE Transactions on Dependable and Secure Computing 9, 6 (2012), 811–824. https://doi.org/10.1109/TDSC.2012.75
Clayton Allen Davis, Onur Varol, Emilio Ferrara, Alessandro Flammini, and Filippo Menczer. 2016. BotOrNot: A System to Evaluate Social Bots. In Proceedings of the 25th International Conference Companion on World Wide Web (Montréal, Québec, Canada) (WWW ’16 Companion). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 273–274. https://doi.org/10.1145/2872518.2889302
John P. Dickerson, Vadim Kagan, and V. S. Subrahmanian. 2014. Using Sentiment to Detect Bots on Twitter: Are Humans More Opinionated than Bots?. In Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining(ASONAM ’14). IEEE Press, New York, New York, USA, 620–627.
Andrea Esuli and Fabrizio Sebastiani. 2006. SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). European Language Resources Association (ELRA), Genoa, Italy, 417–422. http://www.lrec-conf.org/proceedings/lrec2006/pdf/384_pdf.pdf
Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer, and Alessandro Flammini. 2016. The Rise of Social Bots. Commun. ACM 59, 7 (6 2016), 96–104. https://doi.org/10.1145/2818717 https://doi.org/10.1145/2818717.
Erick Rocha Fonseca and João Luís G. Rosa. 2013. Mac-Morpho Revisited: Towards Robust Part-of-Speech Tagging. In Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology. SBC, Porto Alegre, Brazil, 10 pages. https://www.aclweb.org/anthology/W13-4811
CJ Hutto Eric Gilbert. 2014. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Eighth International Conference on Weblogs and Social Media (ICWSM-14). Association for the Advancement of Artificial Intelligence, Menlo Park, California, USA, 10 pages.
Mark A. Hall. 2000. Correlation-Based Feature Selection for Discrete and Numeric Class Machine Learning. In Proceedings of the Seventeenth International Conference on Machine Learning(ICML ’00). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 359–366.
Nathan S. Hartmann, Erick R. Fonseca, Christopher D. Shulby, Marcos V. Treviso, Jéssica S. Rodrigues, and Sandra M. Aluísio. 2017. Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks. In Anais do XI Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana(Minas Gerais). SBC, Porto Alegre, RS, Brasil, 122–131. https://sol.sbc.org.br/index.php/stil/article/view/4008.
Sofia Hurtado, Poushali Ray, and Radu Marculescu. 2019. Bot Detection in Reddit Political Discussion. In Proceedings of the Fourth International Workshop on Social Sensing (Montreal, QC, Canada) (SocialSense’19). Association for Computing Machinery, New York, NY, USA, 30–35. https://doi.org/10.1145/3313294.3313386
Matheus De Oliveira Leu, Daniel Marques Gomes Morais, Fernando Xavier, and Luciano Antonio Digiampietri. 2019. Detecção automática de bots em redes sociais: um estudo de caso no segundo turno das eleições presidenciais brasileiras de 2018. In Revista de Sistemas de Informação da FSMA. FSMA, Macae, Brazil, 31–39.
William S. Meisel. 1990. Speech Representation and Speech Understanding. In Proceedings of the Workshop on Speech and Natural Language (Hidden Valley, Pennsylvania) (HLT ’90). Association for Computational Linguistics, USA, 423. https://doi.org/10.3115/116580.1138607
Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). Association for Computational Linguistics, Stroudsburg, USA, 12 pages. http://arxiv.org/abs/1301.3781
S. Mohammad, M. U. S. Khan, M. Ali, L. Liu, M. Shardlow, and R. Nawaz. 2019. Bot detection using a single post on social media. In 2019 Third World Conference on Smart Trends in Systems Security and Sustainablity (WorldS4). IEEE Press, New York, New York, USA, 215–220. https://doi.org/10.1109/WorldS4.2019.8903989
Bianca Lima Santos, Gabriel Estavaringo Ferreira, Marcelo Torres do Ó, Rafael Rodrigues Braz, and Luciano Antonio Digiampietri. 2020. Comparação de algoritmos para detecção de bots sociais nas eleições presidenciais no Brasil em 2018 utilizando características do usuário. Revista Brasileira de Computação Aplicada 13, 1 (2020), 53–64.
Sinan Aral Soroush Vosoughi, Deb Roy. 2018. The spread of true and false news online. Science 359, 6380 (3 2018), 1146–1151. https://doi.org/10.1126/science.aap9559 https://doi.org/10.1126/science.aap9559.
Rômulo César Costa de Sousa. 2016. Identificando sentimentos de texto em português com o SentiWordNet traduzido. Technical Report. Universidade Federal do Ceará, Campus de Quixadá, Quixadá. http://www.repositorio.ufc.br/handle/riufc/24824
Marlo Souza, Renata Vieira, Débora Busetti, Rove Chishman, and Isa Alves. 2011. Construction of a Portuguese Opinion Lexicon from multiple resources. In Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology. SBC, Porto Alegre, RS, Brazil, 59–66.