ABSTRACT
With the changes in human interaction prompted by the development of communications platforms over the internet, hate speech and offensive language emerged as a contemporary problem. Social networks allow users with different opinions and backgrounds to interact without direct eye-to-eye contact. It brings a sense of safety to promote hate speech, which is even more significant in anonymous environments. There are sites called imageboards, composed of different boards aggregating different topics. On some boards, anonymous users widely promote hate speech. However, only a few works in literature have focused on hate speech in imageboards content. This work aims to classify Brazilian Portuguese texts to detect hate speech, using data from the Brazilian 55chan imageboard to build a dataset with hate speech content. Three classifiers were trained to hate speech binary classification. The Linear Support Vector Classifier achieved the best result with 0.955 of F1-score.
- Rakesh Agrawal, Roberto Bayardo, and Ramakrishnan Srikant. 2000. Athena: Mining-based interactive management of text databases. In International Conference on Extending Database Technology. Springer, Berlin, Heidelberg, 365--379.Google ScholarCross Ref
- Ika Alfina, Rio Mulia, Mohamad Ivan Fanany, and Yudo Ekanata. 2017. Hate speech detection in the Indonesian language: A dataset and preliminary study. In 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS). IEEE, Bali, Indonesia, 233--238. Google ScholarCross Ref
- Thais G Almeida, Bruno À Souza, Fabíola G Nakamura, and Eduardo F Nakamura. 2017. Detecting Hate, Offensive, and Regular Speech in Short Comments. In Proceedings of the 23rd Brazilian Symposium on Multimedia and the Web. SBC, Gramado, Brazil, 225--228.Google ScholarDigital Library
- Carlos Argueta, Fernando H Calderon, and Yi-Shin Chen. 2016. Multilingual emotion classifier using unsupervised pattern extraction from microblog data. Intelligent Data Analysis 20, 6 (2016), 1477--1502.Google ScholarCross Ref
- Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5--32.Google Scholar
- Flavio Carvalho, Rafael Guimarães Rodrigues, Gabriel dos Santos, Pedro Cruz, Lilian Ferrari, and Gustavo Paiva Guedes. 2019. Evaluating the 2015 Brazilian Portuguese LIWC Lexicon with sentiment analysis in social networks. In CSBC 2019 - 8th BraSNAM. SBC, Belém, Brazil, 24--34.Google ScholarCross Ref
- Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning 20, 3 (1995), 273--297.Google Scholar
- Douglas Crockford. 2006. The application/json media type for javascript object notation (json).Google Scholar
- Fernando Fontanella. 2010. Nós somos anonymous: anonimato, trolls e a subcultura dos imageboards.Google Scholar
- Florian Heimerl, Steffen Lohmann, Simon Lange, and Thomas Ertl. 2014. Word cloud explorer: Text analytics based on word clouds. In HICSS '14: Proceedings of the 2014 47th Hawaii International Conference on System Sciences. IEEE Computer Society, Washington, DC, USA, 1833--1842.Google ScholarDigital Library
- Gabriel Emile Hine, Jeremiah Onaolapo, Emiliano De Cristofaro, Nicolas Kourtellis, Ilias Leontiadis, Riginos Samaras, Gianluca Stringhini, and Jeremy Blackburn. 2017. Kek, cucks, and god emperor Trump: A measurement study of 4chan's politically incorrect forum and its effects on the web. In International AAAI Conference on Web and Social Media. AAAI, North America.Google Scholar
- Clayton J Hutto and Eric Gilbert. 2014. Vader: A parsimonious rule-based model for sentiment analysis of social media text.Google Scholar
- Dillon Ludemann. 2018. /pol/emics: Ambiguity, scales, and digital discourse on 4chan. Discourse, Context & Media 24 (2018), 92--98.Google ScholarCross Ref
- Andrew McCallum, Kamal Nigam, et al. 1998. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text categorization, Vol. 752. Citeseer, California, 41--48.Google Scholar
- Prem Melville, Wojciech Gryc, and Richard D Lawrence. 2009. Sentiment analysis of blogs by combining lexical knowledge with text classification. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, Paris, France, 1275--1284.Google ScholarDigital Library
- Alexandros Mittos, Savvas Zannettou, Jeremy Blackburn, and Emiliano De Cristofaro. 2019. "And We Will Fight For Our Race!'" A Measurement Study of Genetic Testing Conversations on Reddit and 4chan. (2019).Google Scholar
- Angela Nagle. 2017. Kill all normies: Online culture wars from 4chan and Tumblr to Trump and the alt-right. John Hunt Publishing, UK.Google Scholar
- Thais Mayumi Oshiro, Pedro Santoro Perez, and José Augusto Baranauskas. 2012. How many trees in a random forest?. In International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer, Berlin, Germany, 154--168.Google ScholarDigital Library
- James W Pennebaker, Ryan L Boyd, Kayla Jordan, and Kate Blackburn. 2015. The development and psychometric properties of LIWC2015. Technical Report. University of Texas, Austin, TX, EUA.Google Scholar
- Juan Ramos et al. 2003. Using TF-IDF to determine word relevance in document queries., 133--142 pages.Google Scholar
- Julio CS Reis, Pollyanna Gonçalves, Matheus Araújo, Adriano CM Pereira, and Fabrıcio Benevenuto. 2015. Uma abordagem multilıngue para análise de sentimentos. In IV Brazilian Workshop on Social Network Analysis and Mining (BraSNAM 2015). SBC, Porto Alegre, RS, Brasil.Google ScholarCross Ref
- Axel Rodríguez, Carlos Argueta, and Yi-Ling Chen. 2019. Automatic Detection of Hate Speech on Facebook Using Sentiment and Emotion Analysis. In 2019 International Conference on Artificial Intelligence in Information and Communication (ICAIIC). IEEE, Okinawa, Japan, 169--174.Google Scholar
- Anna Schmidt and Michael Wiegand. 2017. A Survey on Hate Speech Detection using Natural Language Processing. In Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media. Association for Computational Linguistics, Valencia, Spain, 1--10.Google ScholarCross Ref
- Anna Stavrianou, Periklis Andritsos, and Nicolas Nicoloyannis. 2007. Overview and semantic issues of text mining. ACM Sigmod Record 36, 3 (2007), 23--34.Google ScholarDigital Library
- John Suler. 2004. The Online Disinhibition Effect. CyberPsychology & Behavior 7, 3 (June 2004), 321--326. Google ScholarCross Ref
- H. Watanabe, M. Bouazizi, and T. Ohtsuki. 2018. Hate Speech on Twitter: A Pragmatic Approach to Collect Hateful and Offensive Expressions and Perform Hate Speech Detection. IEEE Access 6 (2018), 13825--13835.Google ScholarCross Ref
- Savvas Zannettou, Barry Bradlyn, Emiliano De Cristofaro, Haewoon Kwak, Michael Sirivianos, Gianluca Stringini, and Jeremy Blackburn. 2018. What is gab: A bastion of free speech or an alt-right echo chamber. In Companion of the The Web Conference 2018 on The Web Conference 2018. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 1007--1014.Google ScholarDigital Library
- Savvas Zannettou, Tristan Caulfield, Emiliano De Cristofaro, Nicolas Kourtelris, Ilias Leontiadis, Michael Sirivianos, Gianluca Stringhini, and Jeremy Blackburn. 2017. The Web Centipede: Understanding How Web Communities Influence Each Other Through the Lens of Mainstream and Alternative News Sources. In Proceedings of the 2017 Internet Measurement Conference (IMC '17). ACM, New York, NY, USA, 405--417. Google ScholarDigital Library
Index Terms
- Hate speech detection using brazilian imageboards
Recommendations
Hate Speech Detection in Roman Urdu
Special issue on Deep Learning for Low-Resource Natural Language Processing, Part 1 and Regular PapersHate speech is a specific type of controversial content that is widely legislated as a crime that must be identified and blocked. However, due to the sheer volume and velocity of the Twitter data stream, hate speech detection cannot be performed ...
Hate Speech Detection Using Static BERT Embeddings
Big Data AnalyticsAbstractWith increasing popularity of social media platforms hate speech is emerging as a major concern, where it expresses abusive speech that targets specific group characteristics, such as gender, religion or ethnicity to spread violence. Earlier ...
Accelerating automatic hate speech detection using parallelized ensemble learning models
AbstractWith increasing number of social media users and online engagement, it is essential to study hate speech propagation on social media platforms (SMPs). Automatic hate speech detection on social media is of utmost importance as hate speech can ...
Highlights- Parallelized algorithms for accelerating the process of hate speech detection
- The phenomenon of hate speech on social media during the recent events is explored
- First attempt to address hate speech propagation during the farmers’ ...
Comments