Attentionsmelling: Using Large Language Models to Identify Code Smells

Anderson Gomes; Denis Sousa; Paulo Maia; Matheus Paixao

doi:10.5753/sbes.2025.9921

Anderson Gomes UECE
Denis Sousa UECE
Paulo Maia UECE
Matheus Paixao UECE

DOI: https://doi.org/10.5753/sbes.2025.9921

Resumo

Large Language Models (LLMs) are becoming essential tools in software engineering, automating tasks like code generation, unit testing, and code review. However, their potential to identify code smells, indicators of poor code quality, remains underexplored. This study evaluates GPT-4’s effectiveness in identifying three common code smells (Long Method, God Class, and Feature Envy) across four experimental setups ranging from using only source code and code smell definitions to leveraging additional context, metrics, and hyperparameter optimization. Our analysis revealed remarkable improvements across all metrics when improving the prompt with more information, with overall performance increasing by 64% in ROC Curve and 56% in F1-score. These results emphasize the impact of incorporating metrics and hyperparameter tuning into LLM prompts, enabling significant advancements in automated software quality assessment. This may lead to better coding practices, particularly related to identifying code smells.

Palavras-chave: Large Language Models, Code Smell Detection, Software Quality, Prompt Engineering, Software Maintenance, Neural Networks

Referências

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).

Lucas Aguiar, Matheus Paixao, Rafael Carmo, Matheus Freitas, Eliakim Gama, Antonio Leal, and Edson Soares. 2024. Multi-language Software Development in the LLM Era: Insights from Practitioners’ Conversations with ChatGPT. In Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. 489–495.

Wasi Uddin Ahmad, Saikat Chakraborty, and Baishakhi Ray. 2022. Multitask Learning for Code Review and Bug Detection Using Large Language Models. arXiv preprint arXiv:2204.01875 (2022). [link]

Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).

Checkstyle. 2024. Checkstyle Documentation. [link]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating Large Language Models Trained on Code. In Advances in Neural Information Processing Systems. [link]

Martin Fowler. 2018. Refactoring: improving the design of existing code. Addison-Wesley Professional.

Anderson Gomes, Denis Sousa, Paulo Maia, and Matheus Paixao. 2025. Attentionsmelling: Using Large Language Models to Identify Code Smells. [link]

Taojun Hu and Xiao-Hua Zhou. 2024. Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions. arXiv preprint arXiv:2404.09135 (2024).

Ryosuke Ishizue, Kazunori Sakamoto, Hironori Washizaki, and Yoshiaki Fukazawa. 2024. Improved Program Repair Methods using Refactoring with GPT Models. In Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1. 569–575.

Haoran Li et al. 2022. Coder: A Contextual Language Model for Code Documentation. arXiv preprint arXiv:2203.11329 (2022). [link]

Tao Lin, Xue Fu, Fu Chen, and Luqun Li. 2021. A novel approach for code smells detection based on deep leaning. In Applied Cryptography in Computer and Communications: First EAI International Conference, AC3 2021, Virtual Event, May 15-16, 2021, Proceedings 1. Springer, 171–174.

Tao Lin, Jianhua Gao, Xue Fu, and Yan Lin. 2015. A novel bug report extraction approach. In Algorithms and Architectures for Parallel Processing: ICA3PP International Workshops and Symposiums, Zhangjiajie, China, November 18-20, 2015, Proceedings 15. Springer, 771–780.

Keila Lucas, Rohit Gheyi, Elvys Soares, Márcio Ribeiro, and Ivan Machado. 2024. Evaluating Large Language Models in Detecting Test Smells. arXiv preprint arXiv:2407.19261 (2024).

Ggaliwango Marvin, Nakayiza Hellen, Daudi Jjingo, and Joyce Nakatumba-Nabende. 2023. Prompt engineering in large language models. In International conference on data intelligence and cognitive informatics. Springer, 387–402.

Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an llm to help with code understanding. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13.

José Pereira dos Reis, Fernando Brito e Abreu, and Glauco de Figueiredo Carneiro. 2022. Crowdsmelling: A preliminary study on using collective knowledge in code smells detection. Empirical Software Engineering 27, 3 (2022), 69.

Atsushi Shirafuji, Yusuke Oda, Jun Suzuki, Makoto Morishita, and Yutaka Watanobe. 2023. Refactoring programs using large language models with fewshot examples. In 2023 30th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 151–160.

Mohammed Latif Siddiq, Joanna Cecilia Da Silva Santos, Ridwanul Hasan Tanvir, Noshin Ulfat, Fahmid Al Rifat, and Vinícius Carvalho Lopes. 2024. Using large language models to generate junit tests: An empirical study. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering. 313–322.

Luciana Lourdes Silva, Jânio Silva, João Eduardo Montandon, Marcus Andrade, and Marco Tulio Valente. 2024. Detecting Code Smells using ChatGPT: Initial Insights. In Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement. 400–406.

SonarSource. 2024. SonarQube Documentation. [link]

Christophe Tribes, Sacha Benarroch-Lelong, Peng Lu, and Ivan Kobyzev. 2023. Hyperparameter optimization for large language model instruction-tuning. arXiv preprint arXiv:2312.00949 (2023).

SiyinWang, Shimin Li, Tianxiang Sun, Jinlan Fu, Qinyuan Cheng, Jiasheng Ye, Junjie Ye, Xipeng Qiu, and Xuanjing Huang. 2024. LLM can Achieve Self-Regulation via Hyperparameter Aware Generation. arXiv preprint arXiv:2402.11251 (2024).

Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. 2023. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).

Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, and Anders Wesslén. 2012. Experimentation in software engineering. Springer Science & Business Media.

Di Wu, Fangwen Mu, Lin Shi, Zhaoqiang Guo, Kui Liu, Weiguang Zhuang, Yuqi Zhong, and Li Zhang. 2024. iSMELL: Assembling LLMs with Expert Toolsets for Code Smell Detection and Refactoring. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1345–1357.

Dongjin Yu, Yihang Xu, Lehui Weng, Jie Chen, Xin Chen, and Quanxin Yang. 2022. Detecting and refactoring feature envy based on graph neural network. In 2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 458–469.

Shujian Zhang, Chengyue Gong, Lemeng Wu, Xingchao Liu, and Mingyuan Zhou. 2023. Automl-gpt: Automatic machine learning with gpt. arXiv preprint arXiv:2305.02499 (2023).

Yang Zhang, Chuyan Ge, Shuai Hong, Ruili Tian, Chunhao Dong, and Jingjing Liu. 2022. DeleSmell: Code smell detection based on deep learning and latent semantic analysis. Knowledge-Based Systems 255 (2022), 109737.

Mengyuan Zhu, Jiawei Wang, Xiao Yang, Yu Zhang, Linyu Zhang, Hongqiang Ren, Bing Wu, and Lin Ye. 2022. A review of the application of machine learning in water quality evaluation. Eco-Environment & Health 1, 2 (2022), 107–116.