Data Augmentation for improving Hate Speech Detection on Social Networks

Lígia Iunes Venturott; Patrick Marques Ciarelli

Lígia Iunes Venturott UFES
Patrick Marques Ciarelli UFES

Resumo

The recent growth of social networks led to an increase in the dissemination of hate speech online and, as a consequence, studies about hate speech detection have emerged. Part of them are based on supervised learning techniques, which require large labeled databases, however this is still a scarce resource. Most of the existing databases on hate speech consist of texts in English. Although there are some databases in Portuguese, they are rather small with few examples, a factor which can limit both the performance of simple deep neural networks and the usage of more complex architectures. In order to work around this setback, this paper analyses the usage of Data Augmentation in texts to improve the training performance of recursive neural networks (LSTM) and convolutional neural networks (CNN) applied to the hate speech detection task. Data Augmentation is a regularization process used on deep neural networks to avoid overfitting. These techniques are common in the field of computer vision, however, due to the complexity of natural languages, this process is not as frequently used in tasks involving texts. In this paper, we experimented on a public hate speech database in Portuguese using Data Augmentation techniques such as text generation from the original database to enhance the results. The experiments show an improvement in the results, demonstrating that the techniques are promising.

Palavras-chave: neural networks, nlp, hate speech, data augmentation