Comparing Structural Quality of Code Generated by LLMs: A Static Analysis of Code Smells

Eduardo Sousa; Erlene Santos; Alberto Sampaio; Carla Bezerra

doi:10.5753/eniac.2025.12470

Eduardo Sousa UFC
Erlene Santos UFC
Alberto Sampaio UFC
Carla Bezerra UFC

DOI: https://doi.org/10.5753/eniac.2025.12470

Resumo

The automation of code generation through Large Language Models (LLMs) has emerged as a promising approach to support software development. However, concerns remain regarding the structural quality of the code produced, particularly the presence of code smells that affect maintainability. This paper presents an empirical study comparing code smells in outputs from four major LLMs: ChatGPT, DeepSeek, Amazon CodeWhisperer, and GitHub Copilot. Our analysis of 64 code units, generated from open-source projects, revealed that 60.9% contained at least one code smell. The results show significant variation, with DeepSeek having the lowest incidence of smells (43.8%) and Amazon CodeWhisperer the highest (68.8%). Long Method was the most frequent smell, constituting 40% of all occurrences. These findings provide empirical evidence of the structural quality differences in LLM-generated code and highlight the need for rigorous, automated quality assurance.

Referências

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

Alfadel, M., Aljasser, K., and Alshayeb, M. (2020). Empirical study of the relationship between design patterns and code smells. Plos one, 15(4):e0231731.

Briand, L. C., Labiche, Y., and O’Sullivan, L. (2003). Impact analysis and change management of uml models. In International Conference on Software Maintenance, 2003. ICSM 2003. Proceedings., pages 256–265. IEEE.

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.

Chidamber, S. R. and Kemerer, C. F. (1994). A metrics suite for object oriented design. IEEE Transactions on software engineering, 20(6):476–493.

Cousot, P. and Cousot, R. (1977). Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In Proceedings of the 4th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, pages 238–252.

Danglot, B., Vera-Perez, O., Yu, Z., Zaidman, A., Monperrus, M., and Baudry, B. (2019). A snowballing literature study on test amplification. Journal of Systems and Software, 157:110398.

Ebert, C., Cain, J., Antoniol, G., Counsell, S., and Laplante, P. (2016). Cyclomatic complexity. IEEE software, 33(6):27–29.

Fowler, M. (2018). Refactoring: improving the design of existing code. Addison-Wesley Professional.

Gvero, T., Kuncak, V., Kuraj, I., and Piskac, R. (2013). Complete completion using types and weights. In Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation, pages 27–38.

Harman, M., Jia, Y., and Zhang, Y. (2015). Achievements, open problems and challenges for search based software testing. In 2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST), pages 1–12. IEEE.

Khomh, F., Di Penta, M., and Gueheneuc, Y.-G. (2009a). An exploratory study of the impact of code smells on software change-proneness. In 2009 16th Working Conference on Reverse Engineering, pages 75–84. IEEE.

Khomh, F., Vaucher, S., Guéhéneuc, Y.-G., and Sahraoui, H. (2009b). A bayesian approach for the detection of code and design smells. In 2009 Ninth International Conference on Quality Software, pages 305–314. IEEE.

Marinescu, R. (2004). Detection strategies: Metrics-based rules for detecting design flaws. In 20th IEEE International Conference on Software Maintenance, 2004. Proceedings., pages 350–359. IEEE.

Moha, N., Guéhéneuc, Y.-G., Duchien, L., and Le Meur, A.-F. (2009). Decor: A method for the specification and detection of code and design smells. IEEE Transactions on Software Engineering, 36(1):20–36.

Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y., Savarese, S., and Xiong, C. (2022). Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474.

Palomba, F., Bavota, G., Di Penta, M., Oliveto, R., Poshyvanyk, D., and De Lucia, A. (2014). Mining version histories for detecting code smells. IEEE Transactions on Software Engineering, 41(5):462–489.

Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., and Karri, R. (2025). Asleep at the keyboard? assessing the security of github copilot’s code contributions. Communications of the ACM, 68(2):96–105.

Rabinovich, M., Stern, M., and Klein, D. (2017). Abstract syntax networks for code generation and semantic parsing. arXiv preprint arXiv:1704.07535.

Sadik, A. R. and Govind, S. (2025). Benchmarking llm for code smells detection: Openai gpt-4.0 vs deepseek-v3. arXiv preprint arXiv:2504.16027.

Siddiq, M. L., Majumder, S. H., Mim, M. R., Jajodia, S., and Santos, J. C. (2022). An empirical study of code smells in transformer-based code generation techniques. In 2022 IEEE 22nd International Working Conference on Source Code Analysis and Manipulation (SCAM), pages 71–82. IEEE.

Silva, L. L., Silva, J. R. d., Montandon, J. E., Andrade, M., and Valente, M. T. (2024). Detecting code smells using chatgpt: Initial insights. In Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pages 400–406.

Tufano, M., Palomba, F., Bavota, G., Oliveto, R., Di Penta, M., De Lucia, A., and Poshyvanyk, D. (2017). When and why your code starts to smell bad (and whether the smells go away). IEEE Transactions on Software Engineering, 43(11):1063–1088.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Velasco, A., Rodriguez-Cardenas, D., Alif, L. R., Palacio, D. N., and Poshyvanyk, D. (2024). How propense are large language models at producing code smells? a benchmarking study. arXiv preprint arXiv:2412.18989.

Yamashita, A. and Moonen, L. (2012). Do code smells reflect important maintainability aspects? In 2012 28th IEEE international conference on software maintenance (ICSM), pages 306–315. IEEE.

Zhang, B., Liang, P., Feng, Q., Fu, Y., and Li, Z. (2024). Copilot-in-the-loop: Fixing code smells in copilot-generated python code using copilot. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, pages 2230–2234.

Comparing Structural Quality of Code Generated by LLMs: A Static Analysis of Code Smells

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)