Measuring how changes in code readability attributes affect code quality evaluation by Large Language Models

Abstract


Code readability is one of the main aspects of code quality, influenced by various properties like identifier names, comments, code structure, and adherence to standards. However, measuring this attribute poses challenges in both industry and academia. While static analysis tools assess attributes such as code smells and comment percentage, code reviews introduce an element of subjectivity. This paper explores using Large Language Models (LLMs) to evaluate code quality attributes related to its readability in a standardized, reproducible, and consistent manner. We conducted a quasi-experiment study to measure the effects of code changes on Large Language Model (LLM)s’ interpretation regarding its readability quality attribute. Nine LLMs were tested, undergoing three interventions: removing comments, replacing identifier names with obscure names, and refactoring to remove code smells. Each intervention involved 10 batch analyses per LLM, collecting data on response variability. We compared the results with a known reference model and tool. The results showed that all LLMs were sensitive to the interventions, with agreement with the reference classifier being high for the original and refactored code scenarios. However, this agreement diverged for the other two interventions. The LLMs demonstrated a strong semantic sensitivity that the reference model did not fully capture. A thematic analysis of the LLMs’ reasoning confirmed their evaluations directly reflected the nature of each intervention. The models also exhibited response variability, with 9.37% to 14.58% of executions showing a standard deviation greater than zero, indicating response oscillation, though this did not always compromise the statistical significance of the results. LLMs demonstrated potential for evaluating semantic quality aspects, such as coherence between identifier names, comments, and documentation with code purpose. Further research is needed to compare these evaluations with human assessments and explore real-world application limitations, including cost factors.

Keywords: Code Quality, Code Comprehensibility, Static Analysis, Software Engineering, LLM, ChatGPT, Gemini, Llama, Claude

References

Darrell R. Raymond. “Reading source code”. In: Proceedings of the 1991 conference of the Centre for Advanced Studies on Collaborative research. CASCON ’91. Toronto, Ontario, Canada: IBM Press, Oct. 1991, pp. 3–16. (Visited on 05/02/2025).

Michael P. O’Brien, Jim Buckley, and Teresa M. Shaft. “Expectation based, inference-based, and bottom-up software comprehension”. en. In: Journal of Software Maintenance and Evolution: Research and Practice 16.6 (Nov. 2004), pp. 427–447. issn: 1532-060X, 1532-0618. DOI: 10.1002/smr.307. url: [link].

B. W. Boehm, J. R. Brown, and M. Lipow. “Quantitative evaluation of software quality”. In: Proceedings of the 2nd International Conference on Software Engineering. ICSE ’76. eventplace: San Francisco, California, USA.Washington, DC, USA: IEEE Computer Society Press, 1976, pp. 592–605.

Simone Scalabrino et al. “Automatically Assessing Code Understandability”. In: IEEE Transactions on Software Engineering 47.3 (Mar. 2019), pp. 595–613. issn: 0098-5589, 1939-3520, 2326-3881. DOI: 10.1109/TSE.2019.2901468.

Philippe Kruchten and Ipek Ozkaya. Managing Technical Debt Reducing Friction in Software Development. eng. OCLC: 1338840518. Sydney: Pearson Education, Limited, 2019. isbn: 978-0-13-564596-3.

Lan Cheng et al. “What improves developer productivity at google? code quality”. en. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Singapore Singapore: ACM, Nov. 2022, pp. 1302–1313. isbn: 978-1-4503-9413-0. DOI: 10.1145/3540250.3558940. url: [link].

Terese Besker et al. “The influence of Technical Debt on software developer morale”. en. In: Journal of Systems and Software 167 (Sept. 2020), p. 110586. issn: 01641212. DOI: 10.1016/j.jss.2020.110586. url: [link]. (visited on 08/09/2024).

Marvin Wyrich, Justus Bogner, and StefanWagner. “40 Years of Designing Code Comprehension Experiments: A Systematic Mapping Study”. en. In: ACM Computing Surveys 56.4 (Apr. 2024), pp. 1–42. issn: 0360-0300, 1557-7341. DOI: 10.1145/3626522. url: [link].

Raymond P.L. Buse and Westley R. Weimer. “A metric for software readability”. en. In: Proceedings of the 2008 international symposium on Software testing and analysis. Seattle WA USA: ACM, July 2008, pp. 121–130. isbn: 978-1-60558-050-0. DOI: 10.1145/1390630.1390647. url: [link] (visited on 01/27/2025).

Jonathan Dorn. “A general software readability model”. In: MCS Thesis 5 (2012), pp. 11–14. url: [link].

Simone Scalabrino et al. “Improving code readability models with textual features”. In: 2016 IEEE 24th International Conference on Program Comprehension (ICPC). Austin, TX, USA: IEEE, May 2016, pp. 1–10. isbn: 978-1-5090-1428-6. DOI: 10.1109/ICPC.2016.7503707. url: [link].

Simone Scalabrino et al. “A comprehensive model for code readability”. en. In: Journal of Software: Evolution and Process 30.6 (June 2018), e1958. DOI: 10.1002/smr.1958. url: [link].

Qing Mi et al. “An Inception Architecture-Based Model for Improving Code Readability Classification”. en. In: Proceedings of the 22nd International Conference on Evaluation and Assessment in Software Engineering 2018. Christchurch New Zealand: ACM, June 2018, pp. 139–144. isbn: 978-1-4503-6403-4. DOI: 10.1145/3210459.3210473. url: [link] (visited on 02/15/2025).

Qing Mi et al. “Towards using visual, semantic and structural features to improve code readability classification”. en. In: Journal of Systems and Software 193 (Nov. 2022), p. 111454. issn: 01641212. DOI: 10.1016/j.jss.2022.111454. url: [link].

Chao Hu et al. “How Effectively Do Code Language Models Understand Poor-Readability Code?” en. In: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. Sacramento CA USA: ACM, Oct. 2024, pp. 795–806. isbn: 979-8-4007-1248-7. DOI: 10.1145/3691620.3695072. url: [link] (visited on 05/31/2025).

Igor Regis Da Silva Simões and Elaine Venson. “Evaluating Source Code Quality with Large Languagem Models: a comparative study”. en. In: Proceedings of the XXIII Brazilian Symposium on Software Quality. Salvador Bahia Brazil: ACM, Nov. 2024, pp. 103–113. isbn: 979-8-4007-1777-2. DOI: 10.1145/3701625.3701650. url: [link] (visited on 01/01/2025).

Antonio Vitale et al. Personalized Code Readability Assessment: Are We There Yet? arXiv:2503.07870 [cs]. Mar. 2025. DOI: 10.48550/arXiv.2503.07870. url: [link] (visited on 03/23/2025).

Andreas Bexell. “Software Source Code Readability: A Mapping Study”. PhD thesis. Karlskrona, Sweden: Blekinge Institute of Technology, Aug. 2020. url: [link].

Xinyi Hou et al. “Large Language Models for Software Engineering: A Systematic Literature Review”. en. In: ACM Transactions on Software Engineering and Methodology 33.8 (Nov. 2024), pp. 1–79. issn: 1049-331X, 1557-7392. DOI: 10.1145/3695988. (Visited on 05/29/2025).

Raymond P L Buse and Westley R Weimer. “Learning a Metric for Code Readability”. In: IEEE Transactions on Software Engineering 36.4 (July 2010), pp. 546–558. DOI: 10.1109/TSE.2009.70. url: [link].

Jianyu Zhao et al. Understanding Programs by Exploiting (Fuzzing) Test Cases. Version Number: 2. 2023. DOI: 10.48550/ARXIV.2305.13592. url: [link].

Man-Fai Wong et al. “Natural Language Generation and Understanding of Big Code for AI-Assisted Programming: A Review”. en. In: Entropy 25.6 (June 2023), p. 888. issn: 1099-4300. DOI: 10.3390/e25060888.

Yue Wang et al. “CodeT5+: Open Code Large Language Models for Code Understanding and Generation”. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Dec. 2023. DOI: 10.18653/v1/2023.emnlp-main.68. url: [link].

Dung Nguyen Manh et al. “The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation”. In: Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023). Singapore: ACL Association for Computational Linguistics, Dec. 2023. url: [link].

Simone Scalabrino et al. “Automatically assessing code understandability: How far are we?” In: 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). Urbana, IL: IEEE, Oct. 2017, pp. 417–427. isbn: 978-1-5386-2684-9. DOI: 10.1109/ASE.2017.8115654. url: [link] (visited on 08/08/2024).

Asher Trockman et al. “"Automatically assessing code understandability" reanalyzed: combined metrics matter”. en. In: Proceedings of the 15th International Conference on Mining Software Repositories. Gothenburg Sweden: ACM, May 2018, pp. 314–318. DOI: 10.1145/3196398.3196441. url: [link].

Aditya Kanade et al. “Learning and Evaluating Contextual Embedding of Source Code”. In: Proceedings of the 37th International Conference on Machine Learning. Ed. by Hal Daumé III and Aarti Singh. Vol. 119. Proceedings of Machine Learning Research. PMLR, July 2020, pp. 5110–5121. url: [link].

Da Shen et al. Benchmarking Language Models for Code Syntax Understanding. Version Number: 1. 2022. DOI: 10.48550/ARXIV.2210.14473. url: [link].

Qing Mi et al. “Improving code readability classification using convolutional neural networks”. en. In: Information and Software Technology 104 (Dec. 2018), pp. 60–71. issn: 09505849. DOI: 10.1016/j.infsof.2018.07.006. url: [link].

Qing Mi et al. “An Enhanced Data Augmentation Approach to Support Multi-Class Code Readability Classification”. In: July 2022, pp. 48–53. DOI: 10.18293/SEKE2022-130. url: [link].

William R. Shadish, Thomas D. Cook, and Donald T. Campbell. Experimental and quasi-experimental designs for generalized causal inference. eng. Nachdr. Belmont, CA: Wadsworth Cengage Learning, 2001. isbn: 978-0-395-61556-0.

Claes Wohlin et al. Experimentation in Software Engineering. en. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012. isbn: 978-3-642-29043-5 978-3-642-29044-2. DOI: 10.1007/978-3-642-29044-2. url: [link].

Sarah Fakhoury et al. “Improving Source Code Readability: Theory and Practice”. In: 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC). Montreal, QC, Canada: IEEE, May 2019, pp. 2–12. isbn: 978-1-7281-1519-1. DOI: 10.1109/ICPC.2019.00014. url: [link].

Agnia Sergeyuk et al. “Reassessing Java Code Readability Models with a Human-Centered Approach”. en. In: Proceedings of the 32nd IEEE/ACM International Conference on Program Comprehension. Lisbon Portugal: ACM, Apr. 2024, pp. 225–235. isbn: 979-8-4007-0586-1. DOI: 10.1145/3643916.3644435. url: [link] (visited on 05/26/2025).

Yupeng Chang et al. “A Survey on Evaluation of Large Language Models”. en. In: ACM Transactions on Intelligent Systems and Technology 15.3 (June 2024), pp. 1–45. DOI: 10.1145/3641289. url: [link].

Jason Wei et al. Emergent Abilities of Large Language Models. arXiv:2206.07682 [cs]. Oct. 2022. url: [link].

Jules White et al. “A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT”. In: Monticello, Illinois, USA, Oct. 2023. url: [link].

Igor Regis. LLMSonarQuarkusAnalysis. [link]. 2024.

Martin Fowler and Kent Beck. Refactoring: improving the design of existing code. The Addison-Wesley object technology series. Reading, MA: Addison-Wesley, 1999. isbn: 978-0-201-48567-7.

Naser Al Madi et al. “From Novice to Expert: Analysis of Token Level Effects in a Longitudinal Eye Tracking Study”. In: 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC). Madrid, Spain: IEEE, May 2021, pp. 172–183. DOI: 10.1109/ICPC52881.2021.00025. url: [link].

Andrea Schankin et al. “Descriptive compound identifier names improve source code comprehension”. en. In: Proceedings of the 26th Conference on Program Comprehension. Gothenburg Sweden: ACM, May 2018, pp. 31–40. isbn: 978-1-4503-5714-2. DOI: 10.1145/3196321.3196332. url: [link].

Valentina Piantadosi et al. “Howdoes code readability change during software evolution?” en. In: Empirical Software Engineering 25.6 (Nov. 2020), pp. 5374–5412. issn: 1382-3256, 1573-7616. DOI: 10.1007/s10664-020-09886-9. url: [link].

Greg Guest, Kathleen MacQueen, and Emily Namey. Applied Thematic Analysis. 2455 Teller Road, Thousand Oaks California 91320 United States: SAGE Publications, Inc., 2012. isbn: 978-1-4129-7167-6 978-1-4833-8443-6. DOI: 10.4135/9781483384436.

Jevgenija Pantiuchina, Michele Lanza, and Gabriele Bavota. “Improving Code: The (Mis) Perception of Quality Metrics”. In: 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). Madrid: IEEE, Sept. 2018, pp. 80–91. isbn: 978-1-5386-7870-1. DOI: 10.1109/ICSME.2018.00017. url: [link].
Published
2025-09-22
SIMÕES, Igor Regis da Silva; VENSON, Elaine. Measuring how changes in code readability attributes affect code quality evaluation by Large Language Models. In: BRAZILIAN SYMPOSIUM ON SOFTWARE ENGINEERING (SBES), 39. , 2025, Recife/PE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 13-24. ISSN 2833-0633. DOI: https://doi.org/10.5753/sbes.2025.9603.