Creation and Characterization of a Sexist Discourse Corpus in Portuguese


  • M. Luisa P. Braga Universidade Federal do Amazonas (UFAM)
  • Fabíola G. Nakamura Universidade Federal do Amazonas (UFAM)
  • Eduardo F. Nakamura Universidade Federal do Amazonas (UFAM)



sexism, hate speech, data sciece


Sexism is a topic whose social interest has grown as the female figure overcomes barriers of gender inequality. Sexist discourse propagates and encourages derogatory and abusive behavior against women. Accurate characterization and identification are key for treating and mitigating violence. In this work, we present a corpus of sexist discourse in Portuguese collected from news portals of great popular acceptance. The paper presents three main contributions: (1) the process of creating the corpus and labeling comments (sexist / non-sexist); (2) the characterization and analysis of the corpus and the behavior of anonymous labelers; (3) an initial assessment of machine learning techniques for classifying sexist / non-sexist comments. Preliminary results show that, when using support vector machine, it is possible to identify sexist comments with an F1 measure above 0.8, precision above 0.9 and recall close to 0.8.


Download data is not yet available.


Badjatiya, P., Gupta, S., and Gupta, M. (2017). Deep learning for hate speech detection in tweets. pages 759–760.Banks, J. (2010). Regulating hate speech online. International Review of Law, ComputersTechnology, pages 233–239.

Davidson, T., Warmsley, D., and Macy, M. (2017). Automated hate speech detection andthe problem of offensive language. Eleventh International AAAI Conference on Weband Social Media.

Fuglede, B. and Topsoe, F. (2004). Jensen-shannon divergence and hilbert space embedding. page 31.

Glick, P. and Fiske, S. T. (2018). The ambivalent sexism inventory: Differentiating hostile and benevolent sexism. In Social Cognition, pages 116–160. Routledge.

Kwok, I. and Wang, Y. (2013). Locate the hate: Detecting tweets against blacks. In Twenty-seventh AAAI conference on artificial intelligence.

Marques, J. J. and dos Santos, J. L. (2018). Mapa da violência contra a mulher.

Oliveira, S. (2018). Adolescente vítima de bullying se suicida por ‘não aguentar mais’.

Pang, B., Lee, L., and Vaithyanathan, S. (2002). Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10, pages 79–86. Association for Computational Linguistics.

Park, J. H. and Fung, P. (2017). One-step and two-step classification for abusive language detection on twitter. arXiv preprint arXiv:1706.01206.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel,M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python, the Journal of machine Learning research, 12:2825–2830.

Rossi, M. (2014). Mulher espancada após boatos em rede social morre em Guarujá, SP.

Vianna, J. and Hising, E. (2018). Homem é condenado a 41 anos de prisão por crimes como racismo, terrorismo e divulgação de pedofilia na internet.

Von Smigay, K. E. (2002).Sexismo, homofobia e outras expressões correlatas de violˆencia: desafios para a psicologia política. Psicologia em revista, 8(11):32–46



How to Cite

P. Braga, M. L., G. Nakamura, F., & F. Nakamura, E. (2021). Creation and Characterization of a Sexist Discourse Corpus in Portuguese. ISys - Brazilian Journal of Information Systems, 14(2), 79–95.



Extended versions of selected articles