Analyzing the Quality and Effectiveness of LLM-Generated Code: A Study with Problems from the LeetCode Platform
Abstract
Large Language Models are revolutionizing software engineering by automating code generation. Despite this increase in developer productivity, concerns remain about the quality and accuracy of the generated code. This study evaluates the effectiveness and quality of 2,464 code solutions generated by GPT-4, Gemini, Claude 3 Haiku, and Llama for 616 LeetCode problems in Python, achieving accuracy rates above 50%. Our analysis reveals that model performance is sensitive to the problem’s formulation. Furthermore, while the code accumulates significant technical debt in aggregate (including 2,8K maintainability issues), the average incidence per individual solution remains low.References
Alberts, I. L., Mercolli, L., Pyka, T., Prenosil, G., Shi, K., Rominger, A., and AfsharOromieh, A. (2023). Large language models (llm) and chatgpt: what will the impact on nuclear medicine be? European journal of nuclear medicine and molecular imaging, 50(6):1549–1552.
Almeida, A., Xavier, L., and Valente, M. T. (2024). Automatic library migration using large language models: First results. In 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 1–7.
Alshahwan, N., Chheda, J., Finogenova, A., Gokkaya, B., Harman, M., Harper, I., Marginean, A., Sengupta, S., and Wang, E. (2024). Automated unit test improvement using large language models at meta. In 32nd ACM International Conference on the Foundations of Software Engineering (FSE), page 185–196.
Becker, B. A., Denny, P., Finnie-Ansley, J., Luxton-Reilly, A., Prather, J., and Santos, E. A. (2023). Programming is hard - or at least it used to be: Educational opportunities and challenges of ai code generation. In 54th ACM Technical Symposium on Computer Science Education V. 1, pages 500–506.
Billah, M. M., Roy, P. R., Codabux, Z., and Roy, B. (2024). Are large language models a threat to programming platforms? an exploratory study. In 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pages 292–301.
Canagasuriam, D. and Lukacik, E.-R. (2024). Chatgpt, can you take my job interview? examining artificial intelligence cheating in the asynchronous video interview. International Journal of Selection and Assessment.
Coignion, T., Quinton, C., and Rouvoy, R. (2024). A performance study of llm-generated code on leetcode. In 28th International Conference on Evaluation and Assessment in Software Engineering (EASE), EASE ’24, page 79–89, New York, NY, USA. Association for Computing Machinery.
Dakhel, A. M., Majdinasab, V., Nikanjam, A., Khomh, F., Desmarais, M. C., and Jiang, Z. M. J. (2023). Github copilot ai pair programmer: Asset or liability? Journal of Systems and Software, 203:111734.
Finnie-Ansley, J., Denny, P., Becker, B. A., Luxton-Reilly, A., and Prather, J. (2022). The robots are coming: Exploring the implications of openai codex on introductory programming. In Proceedings of the 24th Australasian computing education conference, pages 10–19.
Guimaraes, E., Nascimento, N., Nelapati, A., and Shivalingaiah, C. (2025). Analyzing prominent llms: An empirical study of performance and complexity in solving leetcode problems. In 29th International Conference on Evaluation and Assessment in Software Engineering (EASE), page 7–16.
Kruskal, W. H. and Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American statistical Association, 47(260):583–621.
Liu, Y., Le-Cong, T., Widyasari, R., Tantithamthavorn, C., Li, L., Le, X.-B. D., and Lo, D. (2024). Refining chatgpt-generated code: Characterizing and mitigating code quality issues. ACM Trans. on Software Engineering and Methodology, 33(5):1–26.
Lopes, M. and Hora, A. (2022). How and why we end up with complex methods: A multi-language study. Empirical Software Engineering, 27:1–42.
Mastropaolo, A., Pascarella, L., Guglielmi, E., Ciniselli, M., Scalabrino, S., Oliveto, R., and Bavota, G. (2023a). On the robustness of code generation techniques: An empirical study on github copilot. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2149–2160. IEEE.
Mastropaolo, A., Pascarella, L., Guglielmi, E., Ciniselli, M., Scalabrino, S., Oliveto, R., and Bavota, G. (2023b). On the robustness of code generation techniques: An empirical study on github copilot. In 45th International Conference on Software Engineering (ICSE), pages 2149–2160.
Nguyen, N. and Nadi, S. (2022). An empirical evaluation of github copilot’s code suggestions. In The 2022 Mining Software Repositories Conference: MSR 2022: 18-20 May 2022, Virtual23-24 May 2022, Pittsburgh, Pennsylvania: Proceedings, pages 1–5. Association for Computing Machinery, ACM.
Oertel, J., Klünder, J., and Hebig, R. (2025). Don’t settle for the first! how many github copilot solutions should you check? Information and Software Technology, 183:107737.
Reeves, B., Sarsa, S., Prather, J., Denny, P., Becker, B. A., Hellas, A., Kimmel, B., Powell, G., and Leinonen, J. (2023). Evaluating the performance of code generation models for solving parsons problems with small prompt variations. In Procs of Conf. on Innovation and Technology in Computer Science Education, pages 299–305.
Rocha, O. V., Brito, A., Cleiton Tavares, L. X., and Assis, S. (2024). Analisando a qualidade do código em plataformas de cursos online abertos e massivos. In 12th Workshop on Software Visualization, Maintenance and Evolution (VEM). XV Brazilian Conference on Software: Theory and Practice (CBSoft), pages 1–12.
Rubio, C., Mella, F., Martínez, C., Segura, A., and Vidal, C. (2023). Exploring copilot github to automatically solve programming problems in computer science courses. In 2023 42nd IEEE International Conference of the Chilean Computer Science Society (SCCC), pages 1–8.
Silva, L. L., Silva, J. R. d., Montandon, J. E., Andrade, M., and Valente, M. T. (2024). Detecting code smells using chatgpt: Initial insights. In 18th International Symposium on Empirical Software Engineering and Measurement, page 400–406.
Su, H., Ai, J., Yu, D., and Zhang, H. (2023). An evaluation method for large language models’ code generation capability. In 2023 10th International Conference on Dependable Systems and Their Applications (DSA), pages 831–838.
Taecharungroj, V. (2023). “what can chatgpt do?” analyzing early reactions to the innovative ai chatbot on twitter. Big Data and Cognitive Computing, 7(1):35.
Vaithilingam, P., Zhang, T., and Glassman, E. L. (2022). Expectation vs experience: Evaluating the usability of code generation tools powered by large language models. In Chi conference on human factors in computing systems, pages 1–7.
Wang, J. and Chen, Y. (2023). A review on code generation with llms: Application and evaluation. In 2023 IEEE International Conference on Medical Artificial Intelligence (MedAI), pages 284–289.
Welsh, M. (2022). The end of programming. Commun. ACM, 66(1):34–35.
Almeida, A., Xavier, L., and Valente, M. T. (2024). Automatic library migration using large language models: First results. In 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 1–7.
Alshahwan, N., Chheda, J., Finogenova, A., Gokkaya, B., Harman, M., Harper, I., Marginean, A., Sengupta, S., and Wang, E. (2024). Automated unit test improvement using large language models at meta. In 32nd ACM International Conference on the Foundations of Software Engineering (FSE), page 185–196.
Becker, B. A., Denny, P., Finnie-Ansley, J., Luxton-Reilly, A., Prather, J., and Santos, E. A. (2023). Programming is hard - or at least it used to be: Educational opportunities and challenges of ai code generation. In 54th ACM Technical Symposium on Computer Science Education V. 1, pages 500–506.
Billah, M. M., Roy, P. R., Codabux, Z., and Roy, B. (2024). Are large language models a threat to programming platforms? an exploratory study. In 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pages 292–301.
Canagasuriam, D. and Lukacik, E.-R. (2024). Chatgpt, can you take my job interview? examining artificial intelligence cheating in the asynchronous video interview. International Journal of Selection and Assessment.
Coignion, T., Quinton, C., and Rouvoy, R. (2024). A performance study of llm-generated code on leetcode. In 28th International Conference on Evaluation and Assessment in Software Engineering (EASE), EASE ’24, page 79–89, New York, NY, USA. Association for Computing Machinery.
Dakhel, A. M., Majdinasab, V., Nikanjam, A., Khomh, F., Desmarais, M. C., and Jiang, Z. M. J. (2023). Github copilot ai pair programmer: Asset or liability? Journal of Systems and Software, 203:111734.
Finnie-Ansley, J., Denny, P., Becker, B. A., Luxton-Reilly, A., and Prather, J. (2022). The robots are coming: Exploring the implications of openai codex on introductory programming. In Proceedings of the 24th Australasian computing education conference, pages 10–19.
Guimaraes, E., Nascimento, N., Nelapati, A., and Shivalingaiah, C. (2025). Analyzing prominent llms: An empirical study of performance and complexity in solving leetcode problems. In 29th International Conference on Evaluation and Assessment in Software Engineering (EASE), page 7–16.
Kruskal, W. H. and Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American statistical Association, 47(260):583–621.
Liu, Y., Le-Cong, T., Widyasari, R., Tantithamthavorn, C., Li, L., Le, X.-B. D., and Lo, D. (2024). Refining chatgpt-generated code: Characterizing and mitigating code quality issues. ACM Trans. on Software Engineering and Methodology, 33(5):1–26.
Lopes, M. and Hora, A. (2022). How and why we end up with complex methods: A multi-language study. Empirical Software Engineering, 27:1–42.
Mastropaolo, A., Pascarella, L., Guglielmi, E., Ciniselli, M., Scalabrino, S., Oliveto, R., and Bavota, G. (2023a). On the robustness of code generation techniques: An empirical study on github copilot. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2149–2160. IEEE.
Mastropaolo, A., Pascarella, L., Guglielmi, E., Ciniselli, M., Scalabrino, S., Oliveto, R., and Bavota, G. (2023b). On the robustness of code generation techniques: An empirical study on github copilot. In 45th International Conference on Software Engineering (ICSE), pages 2149–2160.
Nguyen, N. and Nadi, S. (2022). An empirical evaluation of github copilot’s code suggestions. In The 2022 Mining Software Repositories Conference: MSR 2022: 18-20 May 2022, Virtual23-24 May 2022, Pittsburgh, Pennsylvania: Proceedings, pages 1–5. Association for Computing Machinery, ACM.
Oertel, J., Klünder, J., and Hebig, R. (2025). Don’t settle for the first! how many github copilot solutions should you check? Information and Software Technology, 183:107737.
Reeves, B., Sarsa, S., Prather, J., Denny, P., Becker, B. A., Hellas, A., Kimmel, B., Powell, G., and Leinonen, J. (2023). Evaluating the performance of code generation models for solving parsons problems with small prompt variations. In Procs of Conf. on Innovation and Technology in Computer Science Education, pages 299–305.
Rocha, O. V., Brito, A., Cleiton Tavares, L. X., and Assis, S. (2024). Analisando a qualidade do código em plataformas de cursos online abertos e massivos. In 12th Workshop on Software Visualization, Maintenance and Evolution (VEM). XV Brazilian Conference on Software: Theory and Practice (CBSoft), pages 1–12.
Rubio, C., Mella, F., Martínez, C., Segura, A., and Vidal, C. (2023). Exploring copilot github to automatically solve programming problems in computer science courses. In 2023 42nd IEEE International Conference of the Chilean Computer Science Society (SCCC), pages 1–8.
Silva, L. L., Silva, J. R. d., Montandon, J. E., Andrade, M., and Valente, M. T. (2024). Detecting code smells using chatgpt: Initial insights. In 18th International Symposium on Empirical Software Engineering and Measurement, page 400–406.
Su, H., Ai, J., Yu, D., and Zhang, H. (2023). An evaluation method for large language models’ code generation capability. In 2023 10th International Conference on Dependable Systems and Their Applications (DSA), pages 831–838.
Taecharungroj, V. (2023). “what can chatgpt do?” analyzing early reactions to the innovative ai chatbot on twitter. Big Data and Cognitive Computing, 7(1):35.
Vaithilingam, P., Zhang, T., and Glassman, E. L. (2022). Expectation vs experience: Evaluating the usability of code generation tools powered by large language models. In Chi conference on human factors in computing systems, pages 1–7.
Wang, J. and Chen, Y. (2023). A review on code generation with llms: Application and evaluation. In 2023 IEEE International Conference on Medical Artificial Intelligence (MedAI), pages 284–289.
Welsh, M. (2022). The end of programming. Commun. ACM, 66(1):34–35.
Published
2025-09-22
How to Cite
AQUINO, Bernardo; BRITO, Aline; TAVARES, Cleiton; BOECHAT, Danilo; BATISTELI, João Pedro.
Analyzing the Quality and Effectiveness of LLM-Generated Code: A Study with Problems from the LeetCode Platform. In: WORKSHOP ON SOFTWARE VISUALIZATION, EVOLUTION AND MAINTENANCE (VEM), 13. , 2025, Recife/PE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 25-36.
DOI: https://doi.org/10.5753/vem.2025.14303.
