How Close Is ChatGPT to Developer Judgment? A Study on Stack Overflow Java Questions

Felipe Augusto Guimarães Reis; Marcelo A. Maia; Carlos Eduardo C. Dantas

doi:10.5753/cbsoft_estendido.2025.14164

Felipe Augusto Guimarães Reis IFTM
Marcelo A. Maia UFU
Carlos Eduardo C. Dantas IFTM

DOI: https://doi.org/10.5753/cbsoft_estendido.2025.14164

Resumo

Software developers often seek assistance on platforms such as Stack Overflow. However, with the emergence of Large Language Models (LLMs) such as ChatGPT, the way developers seek help online is gradually changing. This shift does not necessarily guarantee the accuracy of the information provided, as LLMs can have a limitation to accurately understand complex domain-specific content leading to incorrect responses. This study aims to evaluate how closely ChatGPT’s choices align with those of the Stack Overflow users in accurately addressing technical questions. For this purpose, 776 Java-related questions were selected from Stack Overflow. ChatGPT was asked to analyze five provided answers from Stack Overflow users for each question and identify the one it considered most accurate. The results show that ChatGPT identifies the accepted answer by the Stack Overflow users in 56% of the cases. In the 44% of cases where ChatGPT diverged from the accepted answer, manual analysis revealed that its selected answer was still technically accurate in many instances, although it was not marked as accepted on Stack Overflow. In particular, 31% of these divergent choices were posted after Stack Overflow users had already chosen the accepted one. This suggests that some questions on Stack Overflow may have multiple valid answers, including more recent ones that are as accurate as the accepted answer displayed at the top of the page.

Referências

Beyer, S., Macho, C., Di Penta, M., and Pinzger, M. (2020). What kind of questions do developers ask on stack overflow? a comparison of automated approaches to classify posts into question categories. Empirical Softw. Engg., 25(3):2258–2301.

Bifolco, D., Cassieri, P., Scanniello, G., Di Penta, M., and Zampetti, F. (2025). Do llms provide links to code similar to what they generate? a study with gemini and bing copilot. In IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), pages 223–235.

CHATGPT (2025). Chatgpt web page. [link]. Accessed: 2025-02-20.

Cochran, W. G. (1977). Sampling Techniques. John Wiley & Sons, New York, NY, 3rd edition.

Dantas, C. E., Rocha, A. M., and Maia, M. A. (2023). Assessing the readability of chatgpt code snippet recommendations: A comparative study. In Proceedings of the XXXVII Brazilian Symposium on Software Engineering, SBES ’23.

Ebert, C. and Louridas, P. (2023). Generative AI for software practitioners. IEEE Software, 40:30–38.

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., and Fung, P. (2023). Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).

Langchain. [link]. Accessed: 2025-02-20.

May, A., Wachs, J., and Hannák, A. (2019). Gender differences in participation and reward on Stack Overflow. Empirical Software Engineering, 24(4).

Nam, D., Macvean, A., Hellendoorn, V., Vasilescu, B., and Myers, B. (2024). Using an llm to help with code understanding. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, ICSE ’24, New York, NY, USA. Association for Computing Machinery.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2018). Language models are unsupervised multitask learners.

Reis, F. A. G., Maia, M. M., and Dantas, C. E. C. (2025). Replication Package. DOI: 10.5281/zenodo.15750158.

Sobania, D., Briesch, M., Hanna, C., and Petke, J. (2023). An Analysis of the Automatic Bug Fixing Performance of ChatGPT . In 2023 IEEE/ACM International Workshop on Automated Program Repair (APR), pages 23–30, Los Alamitos, CA, USA. IEEE Computer Society.

Stack Overflow Dump. [link]. Accessed: 2025-02-20.

Stack Overflow. [link]. Accessed: 2025-02-20.

StackExchange (2025). Stackexchange web page. [link]. Accessed: 2025-02-20.

Tamanna, S. B., Uddin, G., Wang, S., Xia, L., and Zhang, L. (2025). Chatgpt Inaccuracy Mitigation During Technical Report Understanding: Are we There Yet? . In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), pages 2290–2302, Los Alamitos, CA, USA. IEEE Computer Society.

Tufano, R., Mastropaolo, A., Pepe, F., Dabić, O., Penta, M. D., and Bavota, G. (2024). Unveiling ChatGPT’s usage in open source projects: A mining-based study.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA. Curran Associates Inc.

Zhang, Z., Xing, Z., Zhao, D., Xu, X., Zhu, L., and Lu, Q. (2024). Automated refactoring of non-idiomatic python code with pythonic idioms. IEEE Transactions on Software Engineering, PP:1–22.

Zuccon, G., Koopman, B., and Shaik, R. (2023). Chatgpt hallucinates when attributing answers. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, SIGIR-AP ’23, page 46–51, New York, NY, USA. Association for Computing Machinery.