Assessing Frontier LLMs in Solving Game Development Problems: Preliminary Findings Across Three Game Engines

Thiago Guedes Cruz de Vasconcelos; Adams Amaral de Castro Filho; Guadalupe Prado Saldanha Ribeiro; Maria Andréia F. Rodrigues; Nabor C. Mendonça

doi:10.5753/se4games.2025.14869

Thiago Guedes Cruz de Vasconcelos UNIFOR
Adams Amaral de Castro Filho UNIFOR
Guadalupe Prado Saldanha Ribeiro UNIFOR
Maria Andréia F. Rodrigues UNIFOR
Nabor C. Mendonça UNIFOR

DOI: https://doi.org/10.5753/se4games.2025.14869

Resumo

This paper evaluates three frontier LLMs (ChatGPT-4o, o3, and Gemini 2.5 Pro) against expert-rated human answers for 30 technical questions collected from online Q&A forums about three popular game engines: Unreal, Unity, and Godot. Our results reveal significant performance variance, with o3 demonstrating superior capabilities over Gemini 2.5 Pro and ChatGPT-4o. A primary weakness identified across all models was response completeness, where AI-generated answers often lacked the comprehensive detail of the human baseline. These findings suggest that although LLMs are powerful assistants, they are not yet a substitute for human expertise in engine-based game development tasks.

Referências

del Rio-Chanona, M., Laurentsyeva, N., and Wachs, J. (2023). Are Large Language Models a Threat to Digital Public Goods? Evidence from Activity on Stack Overflow. arXiv preprint arXiv:2307.07367.

Gallotta, R. et al. (2024). Large Language Models and Games: A Survey and Roadmap. IEEE Transactions on Games, pages 1–18.

Hou, X. et al. (2024). Large Language Models for Software Engineering: A Systematic Literature Review. ACM Transactions on Software Engineering and Methodology, 33(8):1–79.

Kabir, S. et al. (2024). Is Stack Overflow Obsolete? An Empirical Study of the Characteristics of ChatGPT Answers to Stack Overflow Questions. In 2024 CHI Conference on Human Factors in Computing Systems, pages 1–17.

Liu, J. et al. (2023). ChatGPT vs. Stack Overflow: An Exploratory Comparison of Programming Assistance Tools. In 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security Companion (QRS-C), pages 364–373. IEEE.

Maleki, M. F. and Zhao, R. (2024). Procedural Content Generation in Games: A Survey with Insights on Emerging LLM Integration. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, volume 20, pages 167–178.

Melhart, D., Barthet, M., and Yannakakis, G. N. (2025). Can Large Language Models Capture Video Game Engagement? arXiv preprint arXiv:2502.04379.

Mikolas, L. (2025). Machine Learning Summit: Enhancing Development with LLMs and Multimodal Retrieval in ‘Call of Duty’. Game Developers Conference (GDC).

Paduraru, C., Staicu, A., and Stefanescu, A. (2024). LLM-based methods for the creation of unit tests in game development. Procedia Computer Science, 246:2459–2468.

Saei, A. D., Sharbaf, M., and Rahimi, S. K. (2025). Large Language Models for Game Development: A Survey on Automated Code Generation. In First Large Language Models for Software Engineering Workshop (LLM4SE 2025). CEUR Workshop Proceedings.

Video Game Insights (2025). The Big Game Engine Report of 2025. [link].

Xu, B. et al. (2023). Are We Ready to Embrace Generative AI for Software Q&A? In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1713–1717. IEEE.

Zhang, B., Xu, M., and Pan, Z. (2025). Human-AI Collaborative Game Testing with Vision Language Models. arXiv preprint arXiv:2501.11782.

Zheng, L. et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems (NeurIPS), 36:46595–46623.