NLP Pipeline for Gender Bias Detection in Portuguese Literature

  • Mariana O. Silva UFMG
  • Mirella M. Moro UFMG

Resumo


We present a novel Natural Language Processing (NLP) pipeline designed to analyze gender bias in Portuguese literary works. Our pipeline comprises five processing steps, culminating in gender bias detection across different linguistic dimensions. We apply it to a corpus of Portuguese literary texts and evaluate its effectiveness in uncovering gender bias. Our findings reveal prevalent gender stereotypes in character descriptions, with female characters often associated with appearance and emotion, while male characters are depicted in terms of social status and personality traits. Furthermore, our analysis of physical traits stereotypes indicates a more equitable representation across genders in such a dimension.

Referências

Casey, K., Novick, K., and Lourenco, S. F. (2021). Sixty years of gender representation in children’s books: Conditions associated with overrepresentation of male versus female protagonists. Plos one, 16(12):e0260566.

Chu, K. E., Keikhosrokiani, P., and Asl, M. P. (2022). A topic modeling and sentiment analysis model for detection and visualization of themes in literary texts. Pertanika Journal of Science & Technology, 30(4):2535–2561.

Freitas, C. and Santos, D. (2023). Gender Depiction in Portuguese: Distant reading Brazilian and Portuguese literature. In CCLS, pages 1–27.

Goldman, S. R. and Lee, C. D. (2014). Text complexity: State of the art and the conundrums it raises. The Elementary School Journal, 115(2):290–300.

Gusmão, C., Figueiredo, K., and Brito, W. (2021). Técnicas de processamento de linguagem natural em denúncias criminais: Automatização e classificação de texto em português coloquial. In SEMISH, pages 172–182. SBC.

Jänicke, S., Franzini, G., Cheema, M. F., and Scheuermann, G. (2017). Visual text analysis in digital humanities. Computer Graphics Forum, 36.

Kejriwal, M. and Nagaraj, A. (2024). Quantifying gender disparity in pre-modern english literature using natural language processing. Journal of Data Science, 22(1):77.

Labatut, V. and Bost, X. (2019). Extraction and analysis of fictional character networks: A survey. ACM Comput. Surv., 52(5):89:1–89:40.

Lucy, L. and Bamman, D. (2021). Gender and representation bias in GPT-3 generated stories. In NUSE, pages 48–55. ACL.

Maharjan, S. et al. (2018). Letting emotions flow: Success prediction by modeling the flow of emotions in books. In ACL, pages 259–265.

Real, L., Johansson, K., Mendes, J., Lopes, B., and Oshiro, M. (2021). Generating e-commerce product titles in Portuguese. In SEMISH, pages 299–304. SBC.

Santana, B. S., Woloszyn, V., and Wives, L. K. (2018). Is there gender bias and stereotype in Portuguese word embeddings? In PROPOR, pages 24–26. Springer.

Silva, F. M. (2021). Diferenciações de gênero na caracterização de personagens: uma proposta metodológica e primeiros resultados. Master’s thesis, Departamento de Letras, PUC-Rio.

Silva, M. et al. (2021). Exploring brazilian cultural identity through reading preferences. In BraSNAM, pages 115–126. SBC.

Silva, M., Melo-Gomes, L., and Moro, M. (2023). Gender representation in literature: Analysis of characters’ physical descriptions. In KDMiLe, pages 17–24. SBC.

Silva, M. O., de Melo-Gomes, L., and Moro, M. M. (2024). From words to gender: Quantitative analysis of body part descriptions within literature in portuguese. Information Processing & Management, 61(3):103647.

Silva, M. O. and Moro, M. M. (2024a). Evaluating Pre-training Strategies for Literary Named Entity Recognition in Portuguese. In PROPOR, pages 384–393. ACL.

Silva, M. O. and Moro, M. M. (2024b). PPORTAL ner: An Annotated Corpus of Portuguese Literary Entities. In LREC. ELRA. to appear.

Silva, M. O., Scofield, C., de Melo-Gomes, L., and Moro, M. M. (2022). Cross-collection dataset of public domain portuguese-language works. JIDM, 13(1).

Souza, F., Nogueira, R. F., and de Alencar Lotufo, R. (2019). Portuguese named entity recognition using BERT-CRF. CoRR, abs/1909.10649.

Xu, H., Zhang, Z., Wu, L., and Wang, C.-J. (2019). The cinderella complex: Word embeddings reveal gender stereotypes in movies and books. PloS one, 14(11):e0225385.

Zahn, N., Molin, G. D., and Musse, S. (2021). Cross-media sentiment analysis on German blogs. In SEMISH, pages 114–122, Porto Alegre, RS, Brasil. SBC.
Publicado
21/07/2024
SILVA, Mariana O.; MORO, Mirella M.. NLP Pipeline for Gender Bias Detection in Portuguese Literature. In: SEMINÁRIO INTEGRADO DE SOFTWARE E HARDWARE (SEMISH), 51. , 2024, Brasília/DF. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 169-180. ISSN 2595-6205. DOI: https://doi.org/10.5753/semish.2024.2914.