Gender Bias in Portuguese Literary Texts: A Masked Language Model Approach

  • Mariana O. Silva UFMG
  • Michele A. Brandão UFMG
  • Mirella M. Moro UFMG

Abstract


In this work, we investigate how a corpus of Portuguese literary texts shapes the gender bias of a language model using a masked language modeling approach. We fine-tune a BERT-based model on a curated corpus of 592 literary works and analyze gender associations in adjective and verb predictions. Our results show that fine-tuning shifts gender associations: female-denoting targets showed a significant reduction in negative associations, while male-denoting targets retained a negative bias. For verbs, gender disparities decrease, though male subjects retain stronger links to intellectual/work-related verbs. These findings highlight how literary texts can shape gender representations in language models, reinforcing or reshaping biases based on training data.

References

Assi, F. M. and Caseli, H. d. M. (2024). Biases in GPT-3.5 Turbo model: a case study regarding gender and language. In Simp. Bras. de Techologia da Informação e da Linguagem Humana, STIL, pages 294–305. SBC.

Bartl, M., Nissim, M., and Gatt, A. (2020). Unmasking Contextual Stereotypes: Measuring and Mitigating BERT‘s Gender Bias. In GeBNLP, pages 1–16. ACL.

Bolukbasi, T., Chang, K.-W., Zou, J. Y., et al. (2016). Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In NeurIPS, volume 29. Curran Associates, Inc.

Caliskan, A., Bryson, J. J., and Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186.

Carvalho, F., Junior, F. P., Ogasawara, E., et al. (2024). Evaluation of the brazilian portuguese version of linguistic inquiry and word count 2015 (BP-LIWC2015). Language Resources and Evaluation, 58(1):203–222.

Carvalho, F., Rodrigues, R., Santos, G., et al. (2019). Avaliação da versão em português do liwc lexicon 2015 com análise de sentimentos em redes sociais. In BRASNAM, pages 24–34. SBC.

Cheng, J. (2020). Fleshing Out Models of Gender in English-Language Novels (1850 – 2000). Journal of Cultural Analytics, 5(1).

Devlin, J., Chang, M.-W., Lee, K., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805.

Ding, Y., Zhao, J., Jia, C., Wang, Y., Qian, Z., Chen, W., and Yue, X. (2025). Gender bias in large language models across multiple languages: A case study of ChatGPT. In TrustNLP, pages 552–579. ACL.

Dong, C., Li, Y., Gong, H., et al. (2022). A Survey of Natural Language Generation. ACM Comput. Surv., 55(8):173:1–173:38.

Freitas, C. and Santos, D. (2023). Gender depiction in portuguese. In CCLS, pages 4–30.

Garg, N., Schiebinger, L., Jurafsky, D., et al. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. PNAS, 115(16):E3635–E3644.

Gonen, H. and Goldberg, Y. (2019). Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them. In WiNLP, pages 60–63. ACL.

Hartmann, N. S. et al. (2017). Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks. In STIL, pages 122–131. SBC.

Kurita, K. et al. (2019). Measuring bias in contextualized word representations. In GeBNLP, pages 166–172. ACL.

Lima, L. F. F. P. d. and Araujo, R. M. d. (2023). A call for a research agenda on fair NLP for Portuguese. In STIL, pages 187–192. SBC.

Lopes, L., Duran, M., Fernandes, P., et al. (2022). PortiLexicon-UD: a Portuguese Lexical Resource according to Universal Dependencies Model. In LREC, pages 6635–6643. ELRA.

Luo, K. et al. (2024). Reflecting the Male Gaze: Quantifying Female Objectification in 19th and 20th Century Novels. In LREC-COLING, pages 13803–13812, Torino, Italia. ELRA and ICCL.

May, C., Wang, A., Bordia, S., et al. (2019). On Measuring Social Biases in Sentence Encoders. In NAACL, pages 622–628. ACL.

Omrani Sabbaghi, S. and Caliskan, A. (2022). Measuring Gender Bias in Word Embeddings of Gendered Languages Requires Disentangling Grammatical Gender Signals. In AIES, pages 518–531. ACM.

Rodrigues, G., Albuquerque, D., and Chagas, J. (2023). Análise de vieses ideológicos em produções textuais do assistente de bate-papo chatgpt. In Anais do IV Workshop sobre as Implicações da Computação na Sociedade, WICS, pages 148–155. SBC.

Santana, B. S., Woloszyn, V., and Wives, L. K. (2018). Is there Gender bias and stereotype in Portuguese Word Embeddings? arXiv:1810.04528.

Santos, D. (2021). Portuguese Novel Corpus (ELTeC-por): April 2021 release. Zenodo. DOI: 10.5281/zenodo.4288235.

Santos, D., Freitas, C., and Bick, E. (2018). OBras: a fully annotated and partially humanrevised corpus of Brazilian literary works in public domain. CorLex. [link].

Schulz, D. and Bahník, Š. (2019). Gender associations in the twentieth-century Englishlanguage literature. Journal of Research in Personality, 81:88–97.

Silva, M., Brandão, M., and M. Moro, M. (2025). Gender Bias in Portuguese Literary Texts: A Masked Language Model Approach. In Zenodo. DOI: 10.5281/zenodo.16748552.

Silva, M. and Moro, M. (2024). NLP Pipeline for Gender Bias Detection in Portuguese Literature. In SEMISH, pages 169–180. SBC.

Silva, M. O., de Melo-Gomes, L., and Moro, M. M. (2024). From words to gender: Quantitative analysis of body part descriptions within literature in portuguese. Information Processing & Management, 61(3):103647.

Silva, M. O., Melo-Gomes, L., and Moro, M. (2023). Gender representation in literature: Analysis of characters’ physical descriptions. In KDMiLe, pages 17–24. SBC.

Silva, M. O., Scofield, C., de Melo-Gomes, L., et al. (2022). Cross-collection dataset of public domain portuguese-language works. Journal of Information and Data Management, 13(1):95–110.

Silva, M. O., Scofield, C., and Moro, M. M. (2021). PPORTAL: Public domain Portuguese-language literature Dataset. In SBBD DSW, pages 77–88. SBC.

Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In BRACIS, pages 403–417. Springer.

Stanczak, K. and Augenstein, I. (2021). A Survey on Gender Bias in Natural Language Processing. arXiv:2112.14168.

Stuhler, O. (2024). The gender agency gap in fiction writing (1850 to 2010). PNAS, 121(29):e2319514121.

Taso, F. T. d. S., Reis, V. Q., and Martinez, F. V. (2023). Sexismo no Brasil: análise de um Word Embedding por meio de testes baseados em associação implícita. In STIL, pages 53–62. SBC.

Xu, H., Zhang, Z., Wu, L., et al. (2019). The Cinderella Complex: Word embeddings reveal gender stereotypes in movies and books. PLOS ONE, 14(11):e0225385.

Zampieri, M. and Becker, M. (2013). Colonia: Corpus of historical portuguese. In Zampieri, M. and Diwersy, S., editors, Special Volume on Non-Standard Data Sources in Corpus-Based Research, volume 5 of ZSM Studien, pages 77–84. Shaker Verlag, Aachen, Germany.

Zhao, J. et al. (2018). Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. In NAACL, pages 15–20. ACL.

Zhou, P. et al. (2019). Examining gender bias in languages with grammatical gender. In EMNLP-IJCNLP, pages 5276–5284. ACL.
Published
2025-09-29
SILVA, Mariana O.; BRANDÃO, Michele A.; MORO, Mirella M.. Gender Bias in Portuguese Literary Texts: A Masked Language Model Approach. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 16. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 407-419. DOI: https://doi.org/10.5753/stil.2025.37842.