Evaluating LLM-Generated Unit Tests with Mutation Testing: ChatGPT vs DeepSeek
Resumo
Recent advances in Large Language Models (LLMs) have driven significant progress in automating software testing, particularly in generating unit tests. However, the effectiveness of these models in detecting real defects through mutation testing remains underexplored in the literature. This study aims to address this gap by evaluating the performance of ChatGPT (GPT-4o) and DeepSeek V3 in generating unit tests for six Java classes from the Defects4J dataset, covering different levels of cyclomatic complexity. The main objective is to investigate the ability of LLMs to maximize mutant coverage and elimination, while also analyzing the impact of code complexity and semantic factors related to execution failures. The methodology involved generating tests via structured prompts, executing them 5 times per class for both models, and performing quantitative analysis based on Mutation Coverage (MC) and Mutation Score (MS), as well as qualitative analysis of runtime failures. Results indicate that DeepSeek exhibits greater stability and effectiveness in eliminating mutants, whereas ChatGPT demonstrates broader applicability by producing valid test suites for a wider range of classes. Moreover, no significant correlation was found between cyclomatic complexity and compilation success, with failures primarily linked to semantic limitations of the models. This study presents both quantitative and qualitative evidence on the application of LLMs for automated test generation, offering insights for future AI-driven test engineering strategies.
Referências
Vahit Bayrı and Ece Demirel. 2023. AI-Powered Software Testing: The Impact of Large Language Models on Testing Methodologies. In 2023 4th International Informatics and Software Engineering Conference (IISEC). IEEE, Ankara, Turkiye, 1–4. DOI: 10.1109/IISEC59749.2023.10391027
Rajiv Chopra. 2018. Software Testing: A Self-Teaching Introduction (1st ed ed.). Mercury Learning & Information, Bloomfield.
Henry Coles and contributors. 2025. PIT: Mutation Testing for Java. [link]. Accessed: 13 April 2025.
Zheyuan Cui, Mert Demirer, Sonia Jaffe, Leon Musolff, Sida Peng, and Tobias Salz. 2024. The Effects of Generative AI on High Skilled Work: Evidence from Three Field Experiments with Software Developers. SSRN eLibrary (2024). DOI: 10.2139/ssrn.4945566
DEEPSEEK. 2024. Introducing DeepSeek-V3. [link]. Acesso em: 27 fev. 2025.
Márcio Eduardo Delamaro, José Carlos Maldonado, and Mario Jino. 2007. Introdução ao Teste de Software (4ª tiragem ed.). Elsevier Editora Ltda., Rio de Janeiro, Brasil.
R.A. DeMillo, R.J. Lipton, and F.G. Sayward. 1978. Hints on Test Data Selection: Help for the Practicing Programmer. Computer 11, 4 (April 1978), 34–41. DOI: 10.1109/C-M.1978.218136
Ionut Daniel Fagadau, Leonardo Mariani, Daniela Micucci, and Oliviero Riganelli. 2024. Analyzing Prompt Influence on Automated Method Generation: An Empirical Study with Copilot. In Proceedings of the 32nd IEEE/ACM International Conference on Program Comprehension (Lisbon, Portugal) (ICPC ’24). Association for Computing Machinery, New York, NY, USA, 24–34. DOI: 10.1145/3643916.3644409
Soneya Binta Hossain and Matthew B. Dwyer. 2025. TOGLL: Correct and Strong Test Oracle Generation with LLMS. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). 1475–1487. DOI: 10.1109/ICSE55347.2025.00098
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review. ACM Trans. Softw. Eng. Methodol. 33, 8, Article 220 (Dec. 2024), 79 pages. DOI: 10.1145/3695988
Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Development of Mutation Testing. IEEE Transactions on Software Engineering 37, 5 (Sept. 2011), 649–678. DOI: 10.1109/TSE.2010.62
JUnit. 2025. JUnit 5 User Guide. Disponível em: [link]. Acesso em: 2 mar. 2025..
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Tanaka. 2022. Large Language Models are Zero-Shot Reasoners. arXiv preprint arXiv:2205.11916 (2022). [link]
Thomas Laurent and Anthony Ventresque. 2019. PIT-HOM: An Extension of Pitest for Higher Order Mutation Analysis. In 2019 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW). IEEE, 82–89. DOI: 10.1109/ICSTW.2019.00036
Kefan Li and Yuan Yuan. 2024. Large Language Models as Test Case Generators: Performance Evaluation and Enhancement. arXiv preprint arXiv:2404.13340. arXiv:2404.13340 [cs.SE] [link].
Antonio Mastropaolo, Luca Pascarella, Emanuela Guglielmi, Matteo Ciniselli, Simone Scalabrino, Rocco Oliveto, and Gabriele Bavota. 2023. On the Robustness of Code Generation Techniques: An Empirical Study on GitHub Copilot. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 2149–2160. DOI: 10.1109/ICSE48619.2023.00181
Thomas J. McCabe. 1976. A Complexity Measure. IEEE Transactions on Software Engineering SE-2, 4 (1976), 308–320. DOI: 10.1109/TSE.1976.233837
Steve McConnell. 2004. Code Complete: A Practical Handbook of Software Construction (2nd ed.). Microsoft Press, Redmond, WA.
Ali Mili and Fairouz Tchier. 2015. Software Testing: Concepts and Operations. John Wiley & Sons, Inc.
Zifan Nan, Zhaoqiang Guo, Kui Liu, and Xin Xia. 2025. Test Intention Guided LLM-Based Unit Test Generation. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). 1026–1038. DOI: 10.1109/ICSE55347.2025.00243
Nhan Nguyen and Sarah Nadi. 2022. An Empirical Evaluation of GitHub Copilot’s Code Suggestions. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR). 1–5. DOI: 10.1145/3524842.3528470
OPENAI. 2024. Hello GPT-4o. [link]. Acesso em: 27 fev. 2025.
Shuyin Ouyang, Jie M. Zhang, Mark Harman, and Meng Wang. 2023. LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation. In Proceedings of the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). [link] arXiv:2308.02828.
Victor Sobreira, Thomas Durieux, Fernanda Madeiral, Martin Monperrus, and Marcelo A. Maia. 2018. Dissection of a Bug Dataset: Anatomy of 395 Patches from Defects4J. In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). 130–140. DOI: 10.1109/SANER.2018.8330203 arXiv:1801.06393 [cs].
SonarSource. 2025. Code Metrics - SonarQube Documentation. [link]. Accessed: 13 April 2025.
Zhao Tian, Honglin Shu, Dong Wang, Xuejie Cao, Yasutaka Kamei, and Junjie Chen. 2024. Large Language Models for Equivalent Mutant Detection: How Far Are We?. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA ’24) (Vienna, Austria). ACM, to appear. DOI: 10.1145/3650212.3680395
Frank Tip, Jonathan Bell, and Max Schäfer. 2024. LLMorpheus: Mutation Testing using Large Language Models. DOI: 10.48550/arXiv.2404.09952 arXiv:2404.09952 [cs].
Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software Testing With Large Language Models: Survey, Landscape, and Vision. IEEE Transactions on Software Engineering 50, 4 (April 2024), 911–936. DOI: 10.1109/TSE.2024.3368208
Zejun Wang, Kaibo Liu, Ge Li, and Zhi Jin. 2024. HITS: High-coverage LLMbased Unit Test Generation via Method Slicing. In Proceedings of the [Conference acronym]. ACM, New York, NY, USA. arXiv:2408.11324 [cs.SE] [link] To appear.
Tao Xiao, Hideaki Hata, Christoph Treude, and Kenichi Matsumoto. 2024. Generative AI for Pull Request Descriptions: Adoption, Impact, and Developer Interventions. Proceedings of the ACM on Software Engineering 1, FSE (2024), Article 47. DOI: 10.1145/3643773
Lin Yang, Chen Yang, Shutao Gao, Weijing Wang, Bo Wang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, and Junjie Chen. 2024. On the Evaluation of Large Language Models in Unit Test Generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE ’24). Association for Computing Machinery, New York, NY, USA, 1607–1619. DOI: 10.1145/3691620.3695529
Burak Yetistiren, Isik Ozsoy, and Eray Tuzun. 2022. Assessing the quality of GitHub copilot’s code generation. In Proceedings of the 18th International Conference on Predictive Models and Data Analytics in Software Engineering (Singapore, Singapore) (PROMISE 2022). Association for Computing Machinery, New York, NY, USA, 62–71. DOI: 10.1145/3558489.3559072
Gaolei Yi, Zizhao Chen, Zhenyu Chen, W. Eric Wong, and Nicholas Chau. 2023. Exploring the Capability of ChatGPT in Test Generation. In Proceedings of the 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security Companion (QRS-C). IEEE, Chiang Mai, Thailand, 72–80. DOI: 10.1109/QRS-C60940.2023.00013
Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. 2024. Evaluating and Improving ChatGPT for Unit Test Generation. Proceedings of the ACM on Software Engineering 1, FSE (2024), 76:1–76:24. DOI: 10.1145/3660783 Publication date: July 2024.
Jerrold H. Zar. 2010. Biostatistical Analysis (5th ed.). Prentice Hall, Upper Saddle River, NJ.
