An Empirical Study on the Effectiveness of Iterative LLM-Based Improvements for Static Analysis Issues
Abstract
Maintaining and evolving software systems often demands more effort than their initial development. Improving source code quality through automated support can significantly reduce technical debt, increase maintainability, and enhance developer productivity. This paper presents an experimental approach that integrates static analysis with Large Language Models (LLMs) to automate source code improvement. The proposed pipeline iteratively processes Java classes by extracting issues detected by SonarQube and transforming them into prompts for LLMs, which generate improved code versions. Each version is reanalyzed, and the process repeats until convergence or a predefined iteration limit is reached. The experimental setup includes multiple configurations combining two LLMs (GPT-4-mini and Gemini), variation in temperature, prompt style, and number of iterations. Evaluations were conducted using multiple Java datasets, with three repeated runs for the Commons Lang repository to identify behavioral patterns. The analysis focuses on the number of issues reduction, decrease in technical debt (measured in a SonarQube metric), and the evolution of issue severity. Functional correctness was assessed manually by inspecting and executing the improved code to ensure behavior preservation. The results demonstrate that combining SonarQube with LLMs is effective in reducing code issues—achieving over 58% average reduction in key scenarios—while preserving functionality. The iterative process proved successful in guiding the models to incrementally improve code quality based on real static analysis feedback. This work contributes a reproducible and extensible pipeline, offering insights into the impact of LLM configurations and supporting further research in the integration of AI and software quality engineering.
References
CodeGen. 2024. CodeGen AI Platform. [link]. Accessed: 2024-04-18.
Han Cui. 2024. Can large language model replace static analysis tools. In International Conference on Computer Network Security and Software Engineering (CNSSE 2024), Vol. 13175. SPIE, 320–325.
Igor Regis da Silva Simões and Elaine Venson. 2024. Evaluating Source Code Quality with Large Language Models: a comparative study. In Proceedings of the XXIII Brazilian Symposium on Software Quality. 103–113.
Sayed Mehdi Hejazi Dehaghani and Nafiseh Hajrahimi. 2013. Which factors affect software projects maintenance cost more? Acta Informatica Medica 21, 1 (2013), 63.
Aidan ZH Yang et al. 2024. Revisiting unnaturalness for automated program repair in the era of large language models. arXiv preprint arXiv:2404.15236 (2024).
Boshi Wang et al. 2023. Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 2717–2739. DOI: 10.18653/v1/2023.acl-long.153
Danilo Nikolić et al. 2021. Analysis of the tools for static code analysis. In 2021 20th International Symposium INFOTEH-JAHORINA (INFOTEH). IEEE, 1–6.
Greta Dolcetti et al. 2024. Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis. arXiv preprint arXiv:2412.14841 (2024).
Junyi Lu et al. 2023. Llama-reviewer: Advancing code review automation with large language models through parameter-efficient fine-tuning. In 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 647–658.
Khashayar Etemadi et al. 2022. Sorald: Automatic patch suggestions for sonarqube static analysis violations. IEEE Transactions on Dependable and Secure Computing 20, 4 (2022), 2794–2810.
Lakshya A. Agrawal et al. 2023. Monitor-guided decoding of code lms with static analysis of repository context. Advances in Neural Information Processing Systems 36 (2023), 32270–32298.
Lishui Fan et al. 2024. Exploring the capabilities of llms for code change related tasks. ACM Transactions on Software Engineering and Methodology (2024).
Mark Chen et al. 2021. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374 (2021).
Mohammad Mahdi Mohajer et al. 2024. Effectiveness of chatgpt for static analysis: How far are we?. In Proceedings of the 1st ACM International Conference on AIPowered Software. 151–160.
Maosheng Zhong et al. 2023. Codegen-test: An automatic code generation model integrating program test information. In 2023 2nd International Conference on Cloud Computing, Big Data Application and Software Engineering (CBASE). IEEE, 341–344.
Qianou Ma et al. 2024. How to teach programming in the ai era? using llms as a teachable agent for debugging. In International Conference on Artificial Intelligence in Education. Springer, 265–279.
Quanjun Zhang et al. 2024. A systematic literature review on large language models for automated program repair. arXiv preprint arXiv:2405.01466 (2024).
Shraddha Barke et al. 2023. Grounded Copilot: How Programmers Interact with Code-Generating Models. Commun. ACM 66, 6 (2023), 106–115.
Steven I Ross et al. 2023. The programmer’s assistant: Conversational interaction with a large language model for software development. In Proceedings of the 28th International Conference on Intelligent User Interfaces. 491–514.
Toufique Ahmed et al. 2023. Improving few-shot prompts with relevant static analysis products. arXiv preprint arXiv:2304.06815 (2023).
Xinyun Chen et al. 2023. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128 (2023).
Xinyi Hou et al. 2024. Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology 33, 8 (2024), 1–79.
Yufan Cai et al. 2025. Automated Program Refinement: Guide and Verify Code Large Language Model with Refinement Calculus. Proceedings of the ACM on Programming Languages 9, POPL (2025), 2057–2089.
Yu Hao et al. 2023. E&v: Prompting large language models to perform static analysis by pseudo-code execution and verification. arXiv preprint arXiv:2312.08477 (2023).
Yue Wang et al. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). 8696–8708.
Zhangyin Feng et al. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1536–1547.
Maryam Al-Hitmi Fida Zubair and Cagatay Catal. 2024. The use of large language models for program repair. Computer Standards & Interfaces (2024), 103951.
GitHub. 2024. GitHub Copilot. [link]. Accessed: 2024-04-18.
João Carlos Gonçalves and Marcelo de Almeida Maia. 2025. Data of An Empirical Study on the Effectiveness of Iterative LLM-Based Improvements for Static Analysis Issues. DOI: 10.5281/zenodo.15278368
Hugging Face. 2023. StarCoder and StarCoderBase: The next generation of code LLMs. [link]. Accessed: 2024-04-18.
Michael W. Hicks Jeffrey S. Foster and William Pugh. 2007. Improving software quality with static analysis. In Proceedings of the 7th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering. 83–84.
Aman Madaan. 2023. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems 36 (2023), 46534–46594.
Tom Mens and Tom Tourwé. 2004. A survey of software refactoring. IEEE Transactions on Software Engineering 30, 2 (2004), 126–139.
Meta. 2023. Code Llama: An Open Reproduction of Code-Related Large Language Models. [link]. Accessed: 2024-04-18.
Douglas C. Montgomery. 2017. Design and analysis of experiments. John wiley & sons.
OpenAI. 2021. OpenAI Codex. [link]. Accessed: 2024-04-18.
OpenAI. 2023. GPT-4 Technical Report. OpenAI Research. [link].
OpenAI. 2024. GPT-3 Apps: Applications Powered by OpenAI’s GPT-3. [link]. Accessed: 2024-04-18.
OpenAI. 2024. GPT-4 by OpenAI. [link]. Accessed: 2024-04-18.
Yuhao Wang Shuyang Jiang and Yu Wang. 2023. Selfevolve: A code evolution framework via large language models. arXiv preprint arXiv:2306.02907 (2023).
SonarSource. 2024. SonarSource - Continuous Code Quality. [link] Accessed: 2024-04-18.
Jones Yeboah and Saheed Popoola. 2023. Efficacy of static analysis tools for software defect detection on open-source projects. In 2023 International Conference on Computational Science and Computational Intelligence (CSCI). IEEE, 1588–1593.
