An Empirical Study of LLM-Based Source Code Quality Assessment under ISO/IEC 5055:2021

Daniel Pérez-Morera; Enrique Vílchez-Lizano; Keilor Rodriguez-Artavia; Christian Quesada-López; Marcelo Jenkins

doi:10.5753/cibse.2026.42436

Daniel Pérez-Morera Universidad de Costa Rica
Enrique Vílchez-Lizano Universidad de Costa Rica
Keilor Rodriguez-Artavia Universidad de Costa Rica
Christian Quesada-López Universidad de Costa Rica
Marcelo Jenkins Universidad de Costa Rica

DOI: https://doi.org/10.5753/cibse.2026.42436

Resumo

Nowadays, many software companies are required to meet high quality standards in the development of their applications. Rather than replacing traditional SAST tools based on fixed rules, approaches such as GemCA leverage a Large Language Model (LLM) to perform semantic reasoning over code in relation to the ISO/IEC 5055:2021 standard, providing a complementary decision-support mechanism for early-stage quality assessment. This paper presents GemCA and reports an empirical analysis evaluating its accuracy on a dataset of known code weaknesses. The goal is to examine the framework’s ability to apply the criteria defined by the ISO/IEC 5055:2021 standard. Results show that GemCA achieves an average accuracy of 81% across 15 repetitions, with significantly higher performance in C# than PHP. However, accuracy varies substantially across weakness categories (p = 0.0000), indicating sensitivity to CWE type and programming language. These findings highlight both the potential and the current limitations of LLM-based analysis for ISO/IEC 5055:2021 compliance.

Referências

Abtahi, S. M. and Azim, A. (2025). Augmenting Large Language Models with Static Code Analysis for Automated Code Quality Improvements. In 2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge), pages 82–92, Ottawa, ON, Canada. IEEE.

Arıkan, S. M., Koçak, A., and Alkan, M. (2024). Automating shareable cyber threat intelligence production for closed source software vulnerabilities: a deep learning based detection system. International Journal of Information Security, 23(5):3135–3151.

Ball, T., Chen, S., and Herley, C. (2024). Can we count on llms? the fixed-effect fallacy and claims of gpt-4 capabilities.

Boonstra, L. (2025). Prompt Engineering.

Borg, M. (2023). Requirements on Technical Debt: Dare to Specify Them! IEEE Software, 40(2):8–12.

Cao, D. and Jun, W. (2024). Llm-cloudsec: Large language model empowered automatic and deep vulnerability analysis for intelligent clouds. In IEEE INFOCOM 2024 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pages 1–6, Vancouver, BC, Canada. IEEE.

CISQ (2025). Software quality standards – iso 5055. Accessed: Apr. 14, 2025.

Curtis, B., Martin, R. A., and Douziech, P.-E. (2022). Measuring the Structural Quality of Software Systems. Computer, 55(3):87–90.

Ezenwoye, O., Pinconschi, E., and Roberts, E. (2024). Exploring ai for vulnerability detection and repair. In 2024 Cyber Awareness and Research Symposium (CARS), pages 1–9, Grand Forks, ND, USA. IEEE.

Fidalgo, A., Medeiros, I., Antunes, P., and Neves, N. (2020). Towards a Deep Learning Model for Vulnerability Detection on Web Application Variants. In 2020 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pages 465–476, Porto, Portugal. IEEE.

Fu, T., Ferrando, R., Conde, J., Arriaga, C., and Reviriego, P. (2024). Why do large language models (llms) struggle to count letters? GitHub Staff (2024). Octoverse: AI Leads Python to Top Language as the Number of Global Developers Surges. [link]. The GitHub Blog.

ISO (2021). Information technology - software measurement - software quality measurement - automated source code quality measures. Technical Report 5055, ISO/IEC.

Johnson, D., McDonald, J. T., Benton, R. G., and Bourrie, D. (2024). Effectiveness of image-based deep learning on token-level software vulnerability detection. In SoutheastCon 2024, pages 1054–1063.

Karg, L. M., Grottke, M., and Beckhaus, A. (2009). Conformance quality and failure costs in the software industry: An empirical analysis of open source software. In 2009 IEEE International Conference on Industrial Engineering and Engineering Management, pages 1386–1390.

Landis, J. R. and Koch, G. G. (1977). The Measurement of Observer Agreement for Categorical Data. Biometrics, 33(1):159.

Nikolov, R. (2021). ISO 5055 Standard Explained: Is Your Software Rock Solid, Efficient, and Safe? [link].

NIST (2025). Software assurance reference dataset.

Pichai, S. (2024). Introducing gemini 2.0: Our new ai model for the agentic era. [link]. Google Blog.

Stack Overflow (2025). 2025 Developer Survey. [link]. Annual Online Survey.

Stivalet, B. and Fong, E. (2016). Large scale generation of complex and faulty php test cases. In 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST), pages 409–415.

Wang, Y., Wang, X., Yu, H., Gao, F., Liu, X., and Wang, X. (2024). A study on c code defect detection with fine-tuned large language models. In 2024 31st Asia-Pacific Software Engineering Conference (APSEC), pages 437–441.

Zhang, C., Wang, L., Fan, D., Zhu, J., Zhou, T., Zeng, L., and Li, Z. (2024a). Vtt-llm: Advancing vulnerability-to-tactic-and-technique mapping through fine-tuning of large language model. Mathematics, 12(9).

Zhang, X., Cao, J., and You, C. (2024b). Counting ability of large language models and impact of tokenization.

Zhu, J., Ge, H., Zhou, Y., Jin, X., Luo, R., and Sun, Y. (2024). Detecting source code vulnerabilities using fine-tuned pre-trained llms. In 2024 IEEE 17th International Conference on Signal Processing (ICSP), pages 238–242.