A Case Study on Software Aging in LLM-Generated Python Applications

Gustavo Costa; Cesar Santos; Roberto Natella; Ermeson Andrade

doi:10.5753/wtf.2026.23233

Gustavo Costa UFRPE
Cesar Santos UFRPE
Roberto Natella GSSI
Ermeson Andrade UFRPE

DOI: https://doi.org/10.5753/wtf.2026.23233

Resumo

Large Language Models (LLMs) have increasingly been used to automatically generate software systems. However, the long-term operational behavior of these applications remains poorly understood. In particular, it is still unclear whether LLM-generated systems may exhibit progressive performance degradation and increasing resource consumption over time, characterizing symptoms of software aging during prolonged execution. This work presents an experimental case study investigating possible aging effects in four Python applications generated with ChatGPT based on scenarios from the BaxBench benchmark. The systems were executed under continuous workload while memory consumption and response time were monitored. Statistical analyses using the Mann-Kendall test and Sen’s slope estimator were applied to detect temporal trends in the collected data. The results indicate consistent growth in memory consumption across the evaluated applications, suggesting evidence of aging, while response time degradation was observed in only one case.

Referências

ABBASSI, A. A. et al. A taxonomy of inefficiencies in LLM-generated python code. In: Proceedings of the 41st IEEE International Conference on Software Maintenance and Evolution (ICSME 2025). [S.l.]: IEEE, 2025. p. 393–404.

ARAUJO, J. et al. Experimental evaluation of software aging effects on the eucalyptus cloud computing infrastructure. In: Proceedings of the Middleware 2011 Industry Track Workshop. [S.l.]: ACM, 2011. p. 4:1–4:7.

CHEN, P.; QI, Y.; HOU, D. Chaos: Accurate and realtime detection of aging-oriented failure using entropy. CoRR, abs/1502.00781, 2015. Disponível em: [link].

DORA, S. et al. The hidden risks of LLM-generated web application code: A security-centric evaluation of code generation capabilities in large language models. In: Information Systems Security. [S.l.]: Springer, 2025, (Lecture Notes in Computer Science, v. 16380). p. 27–37.

GARG, S. et al. A methodology for detection and estimation of software aging. In: Proceedings of the Ninth International Symposium on Software Reliability Engineering (ISSRE). [S.l.]: IEEE, 1998. p. 283–292.

GROTTKE, M.; JR., R. M.; TRIVEDI, K. S. The fundamentals of software aging. In: 2008 IEEE International Conference on Software Reliability Engineering Workshops (ISSRE Wksp). [S.l.]: IEEE, 2008. p. 1–6.

GROTTKE, M. et al. Analysis of software aging in a web server. IEEE Transactions on Reliability, v. 55, n. 3, p. 411–420, 2006.

JIANG, J. et al. A survey on large language models for code generation. ACM Transactions on Software Engineering and Methodology, v. 35, n. 2, p. 1–72, 2026.

JR., R. M.; FILHO, P. J. de F. An experimental study on software aging and rejuvenation in web servers. In: 30th Annual International Computer Software and Applications Conference (COMPSAC’06). [S.l.]: IEEE, 2006. p. 189–196.

KHOURY, R. et al. How secure is code generated by ChatGPT? In: 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC). [S.l.]: IEEE, 2023. p. 2445–2451.

LIU, F. et al. Beyond functional correctness: Exploring hallucinations in LLM-generated code. IEEE Transactions on Software Engineering, PP, n. 99, p. 1–21, 2026.

MANN, H. B. Nonparametric tests against trend. Econometrica, v. 13, n. 3, p. 245–259, 1945.

SANTOS, C.; ANDRADE, E.; NATELLA, R. Investigating software aging in llm-generated software systems. In: IEEE. 2025 IEEE 36th International Symposium on Software Reliability Engineering Workshops (ISSREW). [S.l.], 2025. p. 314–321.

SEN, P. K. Estimates of the regression coefficient based on kendall’s tau. Journal of the American Statistical Association, v. 63, n. 324, p. 1379–1389, 1968.

SILVA, R. J. M. et al. Adaptive detection of software aging under workload shift. In: Simpósio em Sistemas Computacionais de Alto Desempenho (SSCAD). [S.l.]: SBC, 2025. p. 242–253.

SUN, X. et al. Quality assurance of LLM-generated code: Addressing non-functional quality characteristics. CoRR, abs/2511.10271, 2025. Disponível em: [link].

TRIVEDI, K. S.; VAIDYANATHAN, K. Software aging and rejuvenation. In: Wiley Encyclopedia of Computer Science and Engineering. [S.l.]: Wiley, 2007.

VERO, M. et al. Baxbench: Can llms generate correct and secure backends? arXiv preprint arXiv:2502.11844, 2025.

YETIŞTIREN, B. et al. Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt. CoRR, abs/2304.10778, 2023. Disponível em: [link]