Creation and Characterization of a Sexist Discourse Corpus in Portuguese


  • M. Luisa P. Braga Universidade Federal do Amazonas (UFAM)
  • Fabíola G. Nakamura Universidade Federal do Amazonas (UFAM)
  • Eduardo F. Nakamura Universidade Federal do Amazonas (UFAM)



sexism, hate speech, data sciece


Sexism is a topic whose social interest has grown as the female figure overcomes barriers of gender inequality. Sexist discourse propagates and encourages derogatory and abusive behavior against women. Accurate characterization and identification are key for treating and mitigating violence. In this work, we present a corpus of sexist discourse in Portuguese collected from news portals of great popular acceptance. The paper presents three main contributions: (1) the process of creating the corpus and labeling comments (sexist / non-sexist); (2) the characterization and analysis of the corpus and the behavior of anonymous labelers; (3) an initial assessment of machine learning techniques for classifying sexist / non-sexist comments. Preliminary results show that, when using support vector machine, it is possible to identify sexist comments with an F1 measure above 0.8, precision above 0.9 and recall close to 0.8.


Download data is not yet available.


P. Braga, M. L., G. Nakamura, F., & F. Nakamura, E. (2021). Creation and Characterization of a Sexist Discourse Corpus in Portuguese. ISys - Brazilian Journal of Information Systems, 14(2), 79–95.



