Explorando o Potencial e a Viabilidade de LLMs Open-Source na Análise de Sentimentos

Breno Braga Neves; Theo Sousa; Daniel Coutinho; Alessandro Garcia; Juliana Alves Pereira

doi:10.5753/cbsoft_estendido.2024.4106

Breno Braga Neves PUC-Rio
Theo Sousa PUC-Rio
Daniel Coutinho PUC-Rio
Alessandro Garcia PUC-Rio
Juliana Alves Pereira PUC-Rio

DOI: https://doi.org/10.5753/cbsoft_estendido.2024.4106

Resumo

Ferramentas de análise de sentimentos são amplamente usadas em SE para entender a comunicação de desenvolvedores em ambientes colaborativos, como o GitHub. Como as ferramentas de ponta podem apresentar limitações de desempenho, novos LLMs têm sido adotados, embora sejam computacionalmente caros. Este estudo avalia três modelos open-source: Lllama3, Gemma e Mistral. Utilizando dados de discussões do GitHub, investigamos o desempenho desses modelos e como a engenharia de prompts impacta os resultados. Os resultados indicam que os LLMs open-source oferecem desempenho semelhante às ferramentas de ponta, sendo alternativas viáveis e econômicas. Também analisamos as vantagens e limitações das diferentes estratégias de prompt.

Referências

Ain, Q. T., Ali, M., Riaz, A., Noureen, A., Kamran, M., Hayat, B., and Rehman, A. (2017). Sentiment analysis using deep learning techniques: a review. International Journal of Advanced Computer Science and Applications, 8(6).

Barbosa, C., Uchôa, A., Coutinho, D., Assunçao, W. K., Oliveira, A., Garcia, A., Fonseca, B., Rabelo, M., Coelho, J. E., Carvalho, E., et al. (2023). Beyond the code: Investiga ting the effects of pull request conversations on design decay. In 2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 1–12. IEEE.

Barbosa, C., Uchôa, A., Coutinho, D., Falcão, F., Brito, H., Amaral, G., Soares, V., Garcia, A., Fonseca, B., Ribeiro, M., et al. (2020). Revealing the social aspects of design decay: A retrospective study of pull requests. In Proceedings of the XXXIV Brazilian Symposium on Software Engineering, pages 364–373.

Braga, B. (2024). Complementary material. [link]. Accessed: setembro/2024.

Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language models are few-shot learners.

Coutinho, D., Cito, L., Lima, M. V., Arantes, B., Pereira, J. A., Arriel, J., Godinho, J., Martins, V., Libório, P., Leite, L., Garcia, A., Assunção, W. K. G., Steinmacher, I., Baffa, A., and Fonseca, B. (2024). ”looks good to me ;-)”: Assessing sentiment analysis tools for pull request discussions. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering (EASE 2024), page 11, Salerno, Italy. ACM.

Graziotin, D., Wang, X., and Abrahamsson, P. (2014). Happy software developers solve problems better: psychological measurements in empirical software engineering. PeerJ, 2:e289.

Graziotin, D., Wang, X., and Abrahamsson, P. (2015). How do you feel, developer? an explanatory theory of the impact of affects on programming performance. PeerJ Computer Science, 1:e18.

Gururangan, S., Marasovic, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., and Smith, N. A. (2020). Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964.

Hasan, M. A., Das, S., Anjum, A., Alam, F., Anjum, A., Sarker, A., and Noori, S. R. H. (2024). Zero- and few-shot prompting with llms: A comparative study with fine-tuned models for bangla sentiment analysis. arXiv preprint arXiv:2308.10783v2.

Herrmann, M. and Klünder, J. (2021). From textual to verbal communication: Towards applying sentiment analysis to a software project meeting. In Leibniz University Hannover.

Hou, G. and Lian, Q. (2024). Benchmarking of commercial large language models: Chatgpt, mistral, and llama. Shanghai Quangong AI Lab. DOI: 10.21203/rs.3.rs-4376810/v1.

Jiang, A. Q., Sablayrolles, A., Mensch, A., et al. (2024). The model arena for cross-lingual sentiment analysis: A comparative study in the era of large language models. arXiv preprint arXiv:2406.19358v1.

Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). Scaling laws for neural language models.

Mo, K., Liu, W., Xu, X., Yu, C., Zou, Y., and Xia, F. (2024). Fine-tuning gemma-7b for enhanced sentiment analysis of financial news headlines. arXiv preprint arXiv:2406.13626.

Niimi, J. (2024). Dynamic sentiment analysis with local large language models using majority voting: A study on factors affecting restaurant evaluation. arXiv preprint arXiv:2407.13069.

Ramesh, K., Sitaram, S., and Choudhury, M. (2023). Fairness in language models beyond english: Gaps and challenges. arXiv preprint arXiv:2302.12578.

Siino, M. (2024). Transmistral at semeval-2024 task 10: Using mistral 7b for emotion discovery and reasoning its flip in conversation. In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pages 298–304. Association for Computational Linguistics.

Touvron, H., Lavril, T., Izacard, G., et al. (2023a). Llama: Open and efficient foundation language models.

Touvron, H., Martin, L., Stone, K., et al. (2023b). Large language models performance comparison of emotion and sentiment classification. arXiv preprint arXiv:2407.04050v1.

Tsay, J., Dabbish, L., and Herbsleb, J. (2014). Influence of social and technical factors for evaluating contribution in github. pages 356–366. ACM.

Vorakitphan, V., Basic, M., and Meline, G. L. (2024). Deep content understanding toward entity and aspect target sentiment analysis on foundation models. Proceedings of the 41st International Conference on Machine Learning.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., and Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.

Xing, F. (2024). Designing heterogeneous llm agents for financial sentiment analysis. arXiv preprint arXiv:2401.05799.

Yu, Y., Wang, H., Filkov, V., Devanbu, P., and Vasilescu, B. (2015). Wait for it: Determinants of pull request evaluation latency on github. In Mining software repositories (MSR), 2015 IEEE/ACM 12th working conference on, pages 367–371. IEEE.

Zhan, T., Shi, C., Shi, Y., Li, H., and Lin, Y. (2024). Optimization techniques for sentiment analysis based on llm (gpt-3). arXiv preprint arXiv:2405.09770.

Zhang, W., Deng, Y., Liu, B., Pan, S. J., and Bing, L. (2023a). Sentiment analysis in the era of large language models: A reality check. arXiv preprint arXiv:2305.15005.

Zhang, X., Li, S., Hauer, B., Shi, N., and Kondrak, G. (2023b). Don’t trust chatgpt when your question is not in english: A study of multilingual abilities and types of llms. arXiv preprint arXiv:2305.16339.