Utilizando o Coeficiente de Concordância de Kappa para Avaliar uma Análise de Sentimentos apoiada por IA

Ana Kessilly Chiachio Cerqueira; Melques Santos Paiva; Danilo Guimarães Souza Azevedo; Djan Almeida Santos; Crescencio Lima; Luis Paulo da Silva Carvalho

doi:10.5753/sbsi_estendido.2026.249091

Ana Kessilly Chiachio Cerqueira IFBA
Melques Santos Paiva IFBA
Danilo Guimarães Souza Azevedo IFBA
Djan Almeida Santos IFBA
Crescencio Lima IFBA
Luis Paulo da Silva Carvalho IFBA

DOI: https://doi.org/10.5753/sbsi_estendido.2026.249091

Resumo

O aumento da produção de dados textuais em plataformas de mídias sociais torna a Análise de Sentimentos (AS) uma ferramenta essencial para a extração de insights valiosos. Este artigo aborda a transformação da AS com o surgimento dos Grandes Modelos de Linguagem (LLMs), como o Google Gemini. Nossa principal contribuição é a validação da confiabilidade do modelo de IA na classificação de polaridade e emoção em comentários do YouTube. Para isso, foi empregado o Coeficiente de Concordância de Cohen (Cohen’s Kappa) para medir o grau de concordância entre o LLM e dois avaliadores humanos. Os resultados demonstraram uma concordância moderada tanto entre a IA e os humanos quanto entre os próprios avaliadores humanos. Todo o processo foi consolidado em uma aplicação web funcional, Pulso Emocional.

Referências

Berry, K. J. and Jr, P. W. M. (1988). A generalization of cohen’s kappa agreement measure to interval measurement and multiple raters. Educational and Psychological Measurement, 48(4):921–933.

Brennan, R. L. and Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and psychological measurement, 41(3):687–699.

Buscemi, A. and Proverbio, D. (2024). Chatgpt vs gemini vs llama on multilingual sentiment analysis.

Chalkias, I., Tzafilkou, K., Karapiperis, D., and Tjortjis, C. (2023). Learning analytics on youtube educational videos: Exploring sentiment analysis methods and topic clustering. Electronics, 12(18):3949.

Chamid, A. A., Widowati, and Kusumaningrum, R. (2024). Labeling consistency test of multi-label data for aspect and sentiment classification using the cohen kappa method. Ingénierie des Systèmes d’Information, 29(1):161–167.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.

Conger, A. J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88(2):322.

Ekin, S. (2023). Prompt engineering for chatgpt: a quick guide to techniques, tips, and best practices.

Islam, M., Kabir, M., Ghani, N. A., Zamli, K., Zulkifli, N., Rahman, M. M., and Moni, M. (2024). ”challenges and future in deep learning for sentiment analysis: a comprehensive review and a proposed novel hybrid approach”. Artificial Intelligence Review, 57.

Landis, J. R. and Koch, G. G. (1977). The measurement of observer agreement for categorical data. biometrics, pages 159–174.

Liu, B. (2012). Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers. Lu, K. and Liang, H. (2025). Ncl-nlp at semeval-2025 task 11: Using prompting engineering framework and low rank adaptation of large language models for multi-label emotion detection. In Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025).

Munoz, S. R. and Bangdiwala, S. I. (1997). Interpretation of kappa and b statistics measures of agreement. Journal of Applied Statistics, 24(1):105–112.

Qi, S., Gui, L., He, Y., and Yuan, Z. (2025). A survey of automatic hallucination evaluation on natural language generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.

Runeson, P. and Höst, M. (2009). Guidelines for conducting and reporting case study research in software engineering. In Empirical Software Engineering, volume 14, pages 131–164.

Sharma, N. A., Ali, A., and Kabir, M. A. (2025). A review of sentiment analysis: tasks, applications, and deep learning techniques. International Journal of Data Science and Analytics, 19:351–388.

Stefanovitch, N. et al. (2022). Resources and experiments on sentiment classification using polarity labels. In LREC (Language Resources and Evaluation Conference) 2022.

Wan, T., Jun, H., Pan, W., Hua, H., et al. (2015). Kappa coefficient: a popular measure of rater agreement. Shanghai archives of psychiatry, 27(1):62.

Zhang, L., Wang, S., and Liu, B. (2018). Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1253.