Using the Kappa Agreement Coefficient to Evaluate an AI-Supported Sentiment Analysis

  • Ana Kessilly Chiachio Cerqueira IFBA
  • Melques Santos Paiva IFBA
  • Danilo Guimarães Souza Azevedo IFBA
  • Djan Almeida Santos IFBA
  • Crescencio Lima IFBA
  • Luis Paulo da Silva Carvalho IFBA

Abstract


The increasing production of textual data on social media platforms makes Sentiment Analysis (SA) a crucial tool for extracting valuable insights. This paper addresses the transformation of SA by the rise of Large Language Models (LLMs), such as Google Gemini. Our central contribution resides in the reliability of applying the AI model in classifying polarity and emotion in YouTube comments. To this end, we employed the Cohen’s Kappa Concordance Coefficient to measure the degree of agreement between the LLM and two human evaluators. The results demonstrated Moderate agreement between the AI and the humans, as well as between the human evaluators themselves. The entire process was consolidated into a functional web application, Pulso Emocional.

References

Berry, K. J. and Jr, P. W. M. (1988). A generalization of cohen’s kappa agreement measure to interval measurement and multiple raters. Educational and Psychological Measurement, 48(4):921–933.

Brennan, R. L. and Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and psychological measurement, 41(3):687–699.

Buscemi, A. and Proverbio, D. (2024). Chatgpt vs gemini vs llama on multilingual sentiment analysis.

Chalkias, I., Tzafilkou, K., Karapiperis, D., and Tjortjis, C. (2023). Learning analytics on youtube educational videos: Exploring sentiment analysis methods and topic clustering. Electronics, 12(18):3949.

Chamid, A. A., Widowati, and Kusumaningrum, R. (2024). Labeling consistency test of multi-label data for aspect and sentiment classification using the cohen kappa method. Ingénierie des Systèmes d’Information, 29(1):161–167.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.

Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88(2):322.

Ekin, S. (2023). Prompt engineering for chatgpt: a quick guide to techniques, tips, and best practices.

Islam, M., Kabir, M., Ghani, N. A., Zamli, K., Zulkifli, N., Rahman, M. M., and Moni, M. (2024). ”challenges and future in deep learning for sentiment analysis: a comprehensive review and a proposed novel hybrid approach”. Artificial Intelligence Review, 57.

Landis, J. R. and Koch, G. G. (1977). The measurement of observer agreement for categorical data. biometrics, pages 159–174.

Liu, B. (2012). Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers. Lu, K. and Liang, H. (2025). Ncl-nlp at semeval-2025 task 11: Using prompting engineering framework and low rank adaptation of large language models for multi-label emotion detection. In Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025).

Munoz, S. R. and Bangdiwala, S. I. (1997). Interpretation of kappa and b statistics measures of agreement. Journal of Applied Statistics, 24(1):105–112.

Qi, S., Gui, L., He, Y., and Yuan, Z. (2025). A survey of automatic hallucination evaluation on natural language generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.

Runeson, P. and Höst, M. (2009). Guidelines for conducting and reporting case study research in software engineering. In Empirical Software Engineering, volume 14, pages 131–164.

Sharma, N. A., Ali, A., and Kabir, M. A. (2025). A review of sentiment analysis: tasks, applications, and deep learning techniques. International Journal of Data Science and Analytics, 19:351–388.

Stefanovitch, N. et al. (2022). Resources and experiments on sentiment classification using polarity labels. In LREC (Language Resources and Evaluation Conference) 2022.

Wan, T., Jun, H., Pan, W., Hua, H., et al. (2015). Kappa coefficient: a popular measure of rater agreement. Shanghai archives of psychiatry, 27(1):62.

Zhang, L., Wang, S., and Liu, B. (2018). Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1253.
Published
2026-05-25
CERQUEIRA, Ana Kessilly Chiachio; PAIVA, Melques Santos; AZEVEDO, Danilo Guimarães Souza; SANTOS, Djan Almeida; LIMA, Crescencio; CARVALHO, Luis Paulo da Silva. Using the Kappa Agreement Coefficient to Evaluate an AI-Supported Sentiment Analysis. In: NEW IDEAS AND EMERGING RESULTS TRACK IN INFORMATION SYSTEMS - POSITION PAPERS - BRAZILIAN SYMPOSIUM ON INFORMATION SYSTEMS (SBSI), 22. , 2026, Vitória/ES. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2026 . p. 307-319. DOI: https://doi.org/10.5753/sbsi_estendido.2026.249091.