When Annotators Disagree: A Controlled Evaluation of Gender Bias in Sentiment Analysis Using Synthetic Datasets

  • Érica Carneiro CEFET/RJ
  • Alexander Feitosa CEFET/RJ
  • Gustavo Guedes CEFET/RJ

Abstract


This study investigates gender-related annotation bias in sentiment classification. First, we introduce a controlled synthetic dataset generation method that simulates parallel male and female sentiment labels with adjustable inter-annotator agreement. Then, we present the Gender Comparison Methodology, which trains classifiers separately on gender-partitioned labels and evaluates their predictions using shared textual inputs. Agreement is assessed using metrics such as Cohen’s Kappa, chi-square test, and Cramér’s V. Results show that even moderate disagreement between annotators leads to systematic model divergence, highlighting the importance of annotator identity in shaping classification behavior and informing fairness-aware auditing practices.

References

Alves, A. A. C., de Souza, L. F. M., Varjolo, L. D., Mauro, R. C., Belloze, K., Paschoal, F., and Guedes, G. (2022). Vita 2.0 – class evaluation system. In Proceedings of the 17th Iberian Conference on Information Systems and Technologies (CISTI), Online. Iberian Conference on Information Systems and Technologies.

Assi, F. and Caseli, H. (2024). Biases in gpt-3.5 turbo model: a case study regarding gender and language. In Anais do XV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 294–305, Porto Alegre, RS, Brasil. SBC.

Biester, L. et al. (2022). Analyzing the effects of annotator gender across nlp tasks. In Proceedings of the 1st Workshop on Perspectivist Approaches to NLP.

de Saussure, F. (1959). Course in General Linguistics. McGraw-Hill. Feitosa, A., Carneiro, E., and Guedes, G. (2025). Beyond systematic bias: Investigating gender differences in portuguese text classification annotation patterns. In Anais do XIII Symposium on Knowledge Discovery, Mining and Learning. SBC.

Jiang, A. et al. (2024). Re-examining sexism and misogyny classification with annotator attitudes. arXiv preprint arXiv:2410.03543.

Kenyon-Dean, K., Ahmed, E., Fujimoto, S., Georges-Filteau, J., Glasz, C., Kaur, B., Lalande, A., Bhanderi, S., Belfer, R., Kanagasabai, N., et al. (2018). Sentiment analysis: It’s complicated! In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1886–1895.

Kiritchenko, S. and Mohammad, S. (2018). Examining gender and race bias in two hundred sentiment analysis systems. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 43–53.

Kumar, S. et al. (2020). Exploring impact of age and gender on sentiment analysis using machine learning. Electronics, 9(2):374.

Landis, J. R. and Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1):159–174.

Levshina, N., Koptjevskaja-Tamm, M., and Östling, R. (2024). Revered and reviled: a sentiment analysis of female and male referents in three languages. Frontiers in Communication, 9.

Luitel, S., Liu, Y., and Anwar, M. (2025). Investigating fairness in machine learning-based audio sentiment analysis. AI and Ethics, pages 1099–1108.

Plank, B. (2022). The “problem” of human label variation: On ground truth in data, modeling and evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10671–10682.

Sundarreson, P. and Kumarapathirage, S. (2024). Sentigen: Synthetic data generator for sentiment analysis. Journal of Computing Theories and Applications, 1(4).
Published
2025-09-29
CARNEIRO, Érica; FEITOSA, Alexander; GUEDES, Gustavo. When Annotators Disagree: A Controlled Evaluation of Gender Bias in Sentiment Analysis Using Synthetic Datasets. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 16. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 90-100. DOI: https://doi.org/10.5753/stil.2025.37816.