GEM: A Framework for Strengthening LLM-Generated Unit Tests Using Mutation Feedback

Arda Celik; Qusay H. Mahmoud

doi:10.5753/cibse.2026.42441

Arda Celik Ontario Tech University
Qusay H. Mahmoud Ontario Tech University

DOI: https://doi.org/10.5753/cibse.2026.42441

Resumo

Large Language Models (LLMs) are increasingly used for automated unit test generation and can produce executable tests with substantial structural coverage. However, recent empirical studies indicate that such tests often rely on weak or superficial assertions, leading to limited fault-detection capability despite extensive code coverage. This paper introduces GEM (Generate-Execute-Mutate), an automated framework that systematically strengthens test oracles to improve the mutation-based adequacy of LLM-generated unit tests. GEM integrates three stages into a unified pipeline: LLM-based test synthesis, execution-driven self-repair of failing tests, and mutation-guided oracle refinement. The framework follows a modular hexagonal architecture and supports multiple programming languages through pluggable adapters for test execution, coverage analysis, and mutation testing. GEM was evaluated on three established benchmarks across Python, Java, and C++, using multiple state-of-the-art LLMs, and was compared with the automated testing tool Pynguin. Experimental results reveal a persistent gap between coverage and mutation score in baseline LLM-generated tests. Under the evaluated setup, mutation-guided strengthening improved mutation scores on Python and yielded smaller gains on Java, while execution-driven self-repair improved executability across several model and dataset settings.

Referências

Bodicoat, A., Jahangirova, G., and Terragni, V. (2026). Understanding LLM-Driven Test Oracle Generation. arXiv:2601.05542 [cs].

Cabral, P., Arruda, J., Souza, C., and Pinto, V. (2025). Evaluating llm-generated unit tests with mutation testing: Chatgpt vs deepseek. In Anais do XXIV Simpósio Brasileiro de Qualidade de Software, pages 120–130, Porto Alegre, RS, Brasil. SBC.

Celik, A. and Mahmoud, Q. H. (2025). A Review of Large Language Models for Automated Test Case Generation. Machine Learning and Knowledge Extraction, 7(3).

Chen, Y., Hu, Z., Zhi, C., Han, J., Deng, S., and Yin, J. (2024). ChatUniTest: A Framework for LLM-Based Test Generation. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, pages 572–576, Porto de Galinhas Brazil. ACM.

Daka, E. and Fraser, G. (2014). A Survey on Unit Testing Practices and Problems. In 2014 IEEE 25th International Symposium on Software Reliability Engineering, pages 201–211, Naples, Italy. IEEE.

Dakhel, A. M., Nikanjam, A., Majdinasab, V., Khomh, F., and Desmarais, M. C. (2023). Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing. arXiv:2308.16557 [cs].

Facundo Molina, Gorla, A., and d’Amorim, M. (2025). Test Oracle Automation in the Era of LLMs. ACM Transactions on Software Engineering and Methodology, 34(5):1–24.

Fan, A., Gokkaya, B., Harman, M., Lyubarskiy, M., Sengupta, S., Yoo, S., and Zhang, J. M. (2023). Large Language Models for Software Engineering: Survey and Open Problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), pages 31–53, Melbourne, Australia. IEEE.

Foster, C., Gulati, A., Harman, M., Harper, I., Mao, K., Ritchey, J., Robert, H., and Sengupta, S. (2025). Mutation-Guided LLM-based Test Generation at Meta. arXiv:2501.12862 [cs].

Garousi, V., Felderer, M., Kuhrmann, M., Herkiloğlu, K., and Eldh, S. (2020). Exploring the industry’s challenges in software testing: An empirical study. Journal of Software: Evolution and Process, 32(8):e2251.

Gu, S., Zhang, Q., Li, K., Fang, C., Tian, F., Zhu, L., Zhou, J., and Chen, Z. (2025). TestART: Improving LLM-based Unit Testing via Co-evolution of Automated Generation and Repair Iteration. arXiv:2408.03095 [cs].

Hayet, I., Scott, A., and d’Amorim, M. (2025). ChatAssert: LLM-Based Test Oracle Generation With External Tools Assistance. IEEE Transactions on Software Engineering, 51(1):305–319.

Hossain, S. B. and Dwyer, M. B. (2025). TOGLL: Correct and Strong Test Oracle Generation with LLMS. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE), pages 1475–1487, Ottawa, ON, Canada. IEEE.

Huang, D., Zhang, J. M., Harman, M., Zhang, Q., Du, M., and Ng, S.-K. (2025). Benchmarking llms for unit test generation from real-world functions.

Lemieux, C., Inala, J. P., Lahiri, S. K., and Sen, S. (2023). CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 919–931, Melbourne, Australia. IEEE.

Molinelli, D., Di Grazia, L., Martin-Lopez, A., Ernst, M. D., and Pezzè, M. (2025). Do LLMs generate useful test oracles? an empirical study with an unbiased dataset. In ASE 2025: Proceedings of the 39th Annual International Conference on Automated Software Engineering, Seoul, South Korea.

Rasnayaka, S., Wang, G., Shariffdeen, R., and Iyer, G. N. (2024). An Empirical Study on Usage and Perceptions of LLMs in a Software Engineering Project. In Proceedings of the 1st International Workshop on Large Language Models for Code, pages 111–118, Lisbon Portugal. ACM.

Ravi, R., Bradshaw, D., Ruberto, S., Jahangirova, G., and Terragni, V. (2025). LLM-LOOP: Improving LLM-Generated Code and Tests Through Automated Iterative Feedback Loops. In 2025 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 930–934, Auckland, New Zealand. IEEE.

Sallou, J., Durieux, T., and Panichella, A. (2024). Breaking the Silence: the Threats of Using LLMs in Software Engineering. In Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results, pages 102–106, Lisbon Portugal. ACM.

Schäfer, M., Nadi, S., Eghbali, A., and Tip, F. (2024). An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. IEEE Transactions on Software Engineering, 50(1):85–105.

Straubinger, P. and Fraser, G. (2023). A Survey on What Developers Think About Testing. In 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), pages 80–90, Florence, Italy. IEEE.

Tang, Y., Liu, Z., Zhou, Z., and Luo, X. (2024). ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation. IEEE Transactions on Software Engineering, 50(6):1340–1359.

Tufano, M., Drain, D., Svyatkovskiy, A., and Sundaresan, N. (2022). Generating accurate assert statements for unit test cases using pretrained transformers. In Proceedings of the 3rd ACM/IEEE International Conference on Automation of Software Test, pages 54–64, Pittsburgh Pennsylvania. ACM.

Wang, G., Xu, Q., Briand, L. C., and Liu, K. (2025). Mutation-Guided Unit Test Generation with a Large Language Model. arXiv:2506.02954 [cs].

Wang, J., Huang, Y., Chen, C., Liu, Z., Wang, S., and Wang, Q. (2024). Software Testing With Large Language Models: Survey, Landscape, and Vision. IEEE Transactions on Software Engineering, 50(4):911–936.

Yang, L., Yang, C., Gao, S., Wang, W., Wang, B., Zhu, Q., Chu, X., Zhou, J., Liang, G., Wang, Q., and Chen, J. (2024). On the Evaluation of Large Language Models in Unit Test Generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, pages 1607–1619, Sacramento CA USA. ACM.

Yuan, Z., Liu, M., Ding, S., Wang, K., Chen, Y., Peng, X., and Lou, Y. (2024). Evaluating and Improving ChatGPT for Unit Test Generation. Proceedings of the ACM on Software Engineering, 1(FSE):1703–1726.

Zhang, Q., Fang, C., Gu, S., Shang, Y., Chen, Z., and Xiao, L. (2025). Large Language Models for Unit Testing: A Systematic Literature Review. arXiv:2506.15227 [cs].

Zheng, Q., Xia, X., Zou, X., Dong, Y., Wang, S., Xue, Y., Wang, Z., Shen, L., Wang, A., Li, Y., Su, T., Yang, Z., and Tang, J. (2023). Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5673–5684.