Analisando a Qualidade e Eficácia de Códigos Gerados por LLMs: Um Estudo com Problemas da Plataforma LeetCode

Bernardo Aquino; Aline Brito; Cleiton Tavares; Danilo Boechat; João Pedro Batisteli

doi:10.5753/vem.2025.14303

Bernardo Aquino PUC Minas
Aline Brito UFOP
Cleiton Tavares PUC Minas
Danilo Boechat CEFET-MG
João Pedro Batisteli PUC Minas

DOI: https://doi.org/10.5753/vem.2025.14303

Resumo

Large Language Models (LLMs) estão revolucionando a engenharia de software ao automatizar tarefas de geração de código. Apesar do aumento na produtividade dos desenvolvedores, ainda há dúvidas sobre a qualidade e taxa de acerto do código gerado. Este trabalho investiga a eficácia e qualidade de 2.464 respostas de código gerado via GPT-4, Gemini, Claude 3 Haiku e Llama 3.1 em 616 problemas do LeetCode em Python. Os resultados mostram taxas de acertos superiores a 50%. A análise também revela que o resultado é sensível à formulação do problema e que, embora o código gerado acumule um débito técnico expressivo (2.8K problemas de manutenibilidade e 13.9K de dívida técnica), a incidência média por resposta individual permanece baixa.

Referências

Alberts, I. L., Mercolli, L., Pyka, T., Prenosil, G., Shi, K., Rominger, A., and AfsharOromieh, A. (2023). Large language models (llm) and chatgpt: what will the impact on nuclear medicine be? European journal of nuclear medicine and molecular imaging, 50(6):1549–1552.

Almeida, A., Xavier, L., and Valente, M. T. (2024). Automatic library migration using large language models: First results. In 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 1–7.

Alshahwan, N., Chheda, J., Finogenova, A., Gokkaya, B., Harman, M., Harper, I., Marginean, A., Sengupta, S., and Wang, E. (2024). Automated unit test improvement using large language models at meta. In 32nd ACM International Conference on the Foundations of Software Engineering (FSE), page 185–196.

Becker, B. A., Denny, P., Finnie-Ansley, J., Luxton-Reilly, A., Prather, J., and Santos, E. A. (2023). Programming is hard - or at least it used to be: Educational opportunities and challenges of ai code generation. In 54th ACM Technical Symposium on Computer Science Education V. 1, pages 500–506.

Billah, M. M., Roy, P. R., Codabux, Z., and Roy, B. (2024). Are large language models a threat to programming platforms? an exploratory study. In 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pages 292–301.

Canagasuriam, D. and Lukacik, E.-R. (2024). Chatgpt, can you take my job interview? examining artificial intelligence cheating in the asynchronous video interview. International Journal of Selection and Assessment.

Coignion, T., Quinton, C., and Rouvoy, R. (2024). A performance study of llm-generated code on leetcode. In 28th International Conference on Evaluation and Assessment in Software Engineering (EASE), EASE ’24, page 79–89, New York, NY, USA. Association for Computing Machinery.

Dakhel, A. M., Majdinasab, V., Nikanjam, A., Khomh, F., Desmarais, M. C., and Jiang, Z. M. J. (2023). Github copilot ai pair programmer: Asset or liability? Journal of Systems and Software, 203:111734.

Finnie-Ansley, J., Denny, P., Becker, B. A., Luxton-Reilly, A., and Prather, J. (2022). The robots are coming: Exploring the implications of openai codex on introductory programming. In Proceedings of the 24th Australasian computing education conference, pages 10–19.

Guimaraes, E., Nascimento, N., Nelapati, A., and Shivalingaiah, C. (2025). Analyzing prominent llms: An empirical study of performance and complexity in solving leetcode problems. In 29th International Conference on Evaluation and Assessment in Software Engineering (EASE), page 7–16.

Kruskal, W. H. and Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American statistical Association, 47(260):583–621.

Liu, Y., Le-Cong, T., Widyasari, R., Tantithamthavorn, C., Li, L., Le, X.-B. D., and Lo, D. (2024). Refining chatgpt-generated code: Characterizing and mitigating code quality issues. ACM Trans. on Software Engineering and Methodology, 33(5):1–26.

Lopes, M. and Hora, A. (2022). How and why we end up with complex methods: A multi-language study. Empirical Software Engineering, 27:1–42.

Mastropaolo, A., Pascarella, L., Guglielmi, E., Ciniselli, M., Scalabrino, S., Oliveto, R., and Bavota, G. (2023a). On the robustness of code generation techniques: An empirical study on github copilot. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2149–2160. IEEE.

Mastropaolo, A., Pascarella, L., Guglielmi, E., Ciniselli, M., Scalabrino, S., Oliveto, R., and Bavota, G. (2023b). On the robustness of code generation techniques: An empirical study on github copilot. In 45th International Conference on Software Engineering (ICSE), pages 2149–2160.

Nguyen, N. and Nadi, S. (2022). An empirical evaluation of github copilot’s code suggestions. In The 2022 Mining Software Repositories Conference: MSR 2022: 18-20 May 2022, Virtual23-24 May 2022, Pittsburgh, Pennsylvania: Proceedings, pages 1–5. Association for Computing Machinery, ACM.

Oertel, J., Klünder, J., and Hebig, R. (2025). Don’t settle for the first! how many github copilot solutions should you check? Information and Software Technology, 183:107737.

Reeves, B., Sarsa, S., Prather, J., Denny, P., Becker, B. A., Hellas, A., Kimmel, B., Powell, G., and Leinonen, J. (2023). Evaluating the performance of code generation models for solving parsons problems with small prompt variations. In Procs of Conf. on Innovation and Technology in Computer Science Education, pages 299–305.

Rocha, O. V., Brito, A., Cleiton Tavares, L. X., and Assis, S. (2024). Analisando a qualidade do código em plataformas de cursos online abertos e massivos. In 12th Workshop on Software Visualization, Maintenance and Evolution (VEM). XV Brazilian Conference on Software: Theory and Practice (CBSoft), pages 1–12.

Rubio, C., Mella, F., Martínez, C., Segura, A., and Vidal, C. (2023). Exploring copilot github to automatically solve programming problems in computer science courses. In 2023 42nd IEEE International Conference of the Chilean Computer Science Society (SCCC), pages 1–8.

Silva, L. L., Silva, J. R. d., Montandon, J. E., Andrade, M., and Valente, M. T. (2024). Detecting code smells using chatgpt: Initial insights. In 18th International Symposium on Empirical Software Engineering and Measurement, page 400–406.

Su, H., Ai, J., Yu, D., and Zhang, H. (2023). An evaluation method for large language models’ code generation capability. In 2023 10th International Conference on Dependable Systems and Their Applications (DSA), pages 831–838.

Taecharungroj, V. (2023). “what can chatgpt do?” analyzing early reactions to the innovative ai chatbot on twitter. Big Data and Cognitive Computing, 7(1):35.

Vaithilingam, P., Zhang, T., and Glassman, E. L. (2022). Expectation vs experience: Evaluating the usability of code generation tools powered by large language models. In Chi conference on human factors in computing systems, pages 1–7.

Wang, J. and Chen, Y. (2023). A review on code generation with llms: Application and evaluation. In 2023 IEEE International Conference on Medical Artificial Intelligence (MedAI), pages 284–289.

Welsh, M. (2022). The end of programming. Commun. ACM, 66(1):34–35.