LLMs as Test Generators: A Comparative Benchmarking Study

Esdras Caleb Oliveira Silva; Roberta de Souza Coelho; Lyrene Fernandes da Silva

doi:10.5753/sbes.2025.9618

Esdras Caleb Oliveira Silva UFRN
Roberta de Souza Coelho UFRN
Lyrene Fernandes da Silva UFRN

DOI: https://doi.org/10.5753/sbes.2025.9618

Resumo

Automated tests are a key practice adopted by the software industry to verify software quality. However, they are costly to develop and maintain. Recently, the use of LLMs to generate automated tests has been explored as a viable alternative. Ongoing efforts focus on improving generation by providing richer context and post-processing the output to correct errors, ensuring accurate results. However, small-scale open LLMs, capable of running on modest hardware, have received limited attention. This work compares large-scale LLMs (e.g., GPT and Gemini) with small-scale open-source models in terms of the number of tests generated and their quality, measured by the mutation score, the cyclomatic complexity of generated code, and the number of test smells on them. We evaluated 12 small-scale models against 6 large-scale ones and used EvoSuite to establish a baseline for code quality and the number of methods tested. Our results show that some small-scale LLMs perform well in test generation tasks. xLan, Gemma2, and DeepSeekCoder gave the best overall results, producing as many tests as large-scale models, with fewer smells and a better mutation score.

Palavras-chave: Automated Tests, LLM, Experimental Study, Benchmarking

Referências

Mistral AI. 2024. Open Codestral Mamba (open-codestral-mamba). [link]. [link] Open-source code model based on the Mamba2 architecture, released in July 2024.

Mistral AI. 2025. Codestral 25.01. [link]. [link] Code generation language model with optimized architecture and improved tokenizer.

Saranya Alagarsamy, Chakkrit Tantithamthavorn, and Aldeida Aleti. 2024. A3test: Assertion-augmented automated test case generation. Information and Software Technology 176 (2024), 107565.

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, and Thomas Wolf. 2025. SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language Model. arXiv:2502.02737 [cs.CL] [link]

Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, et al. 2022. Multi-lingual evaluation of code generation models. arXiv preprint arXiv:2210.14868 (2022).

Martin Juan José Bucher and Marco Martini. 2024. Fine-Tuned’Small’LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification. arXiv preprint arXiv:2406.08660 (2024).

Henry Coles, Matjaz Jurcenoks, Peter Reilly, and Emma Armstrong. 2016. PIT Mutation Testing Tool. [link] Available at: [link].

Google DeepMind. 2024. Gemini 1.5 Flash. [link]. [link] Lightweight multimodal language model optimized for speed and efficiency, featuring a 1 million token context window.

Google DeepMind. 2024. Gemini 1.5 Flash-8B. [link]. [link] Multimodal language model optimized for high-volume, lower-intelligence tasks; supports audio, image, video, and text inputs with a 1 million token context window.

Brad Everman, Trevor Villwock, Dayuan Chen, Noe Soto, Oliver Zhang, and Ziliang Zong. 2023. Evaluating the carbon impact of large language models at the inference stage. In 2023 IEEE international performance, computing, and communications conference (IPCCC). IEEE, 150–157.

Hugging Face. 2025. Hugging Face. [link]

Sorouralsadat Fatemi and Yuheng Hu. 2023. A comparative analysis of fine-tuned LLMs and few-shot learning of LLMs for financial sentiment analysis. arXiv preprint arXiv:2312.08725 (2023).

Gordon Fraser and Andrea Arcuri. 2014. A large-scale evaluation of automated unit test generation using evosuite. ACM Transactions on Software Engineering and Methodology (TOSEM) 24, 2 (2014), 1–42.

Xue-Yong Fu, Md Tahmid Rahman Laskar, Elena Khasanova, Cheng Chen, and Shashi Bhushan TN. 2024. Tiny titans: Can smaller large language models punch above their weight in the real world for meeting summarization? arXiv preprint arXiv:2402.00841 (2024).

Google. 2023. Google Java Style Guide. [link] Accessed: 2025-04-07.

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024).

D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong,W. Zhang, G. Chen, X. Bi, Y.Wu, Y. K. Li, F. Luo, Y. Xiong, andW. Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv abs/2401.14196 (2024). [link]

Geert Heyman, Rafael Huysegems, Pascal Justen, and Tom Van Cutsem. 2021. Natural language-guided programming. In Proceedings of the 2021 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. 39–55.

Siming Huang, Tianhao Cheng, Jason Klein Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J. Yang, J. H. Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan, Zhaoxiang Zhang, Jie Fu, Qian Liu, Ge Zhang, Zili Wang, Yuan Qi, Yinghui Xu, and Wei Chu. 2024. OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models. [link]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. 2024. Qwen2. 5-Coder Technical Report. arXiv preprint arXiv:2409.12186 (2024).

Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. 2023. How long can context length of open-source llms truly promise?. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.

Xiang-Jun Liu, Ping Yu, and Xiao-Xing Ma. 2024. An Empirical Study on Automated Test Generation Tools for Java: Effectiveness and Challenges. Journal of Computer Science and Technology 39, 3 (2024), 715–736.

Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin. 2023. Analyzing leakage of personally identifiable information in language models. In 2023 IEEE Symposium on Security and Privacy (SP). IEEE, 346–363.

Guillermo Marco, Luz Rello, and Julio Gonzalo. 2024. Small language models can outperform humans in short creative writing: A study comparing slms with humans and llms. arXiv preprint arXiv:2409.11547 (2024).

Matt Cone. 2020. Markdown Guide. [link].

Aishwarya Narasimhan, Krishna Prasad Agara Venkatesha Rao, et al. 2021. Cgems: A metric model for automatic code generation using gpt-3. arXiv preprint arXiv:2108.10168 (2021).

Tung Nguyen. 2024. Lizard - A Code Complexity Analyzer Without Pylint. [link] Accessed: 2024-04-06.

OpenAI. 2024. GPT-4o Mini. [link]. [link] Lightweight multimodal language model optimized for cost-efficiency and performance, supporting text and vision inputs with a 128K token context window.

Carlos Pacheco and Michael D Ernst. 2007. Randoop: feedback-directed random testing for Java. In Companion to the 22nd ACM SIGPLAN conference on Objectoriented programming systems and applications companion. 815–816.

Anthony Peruma, Khalid Almalki, Christian D. Newman, Mohamed Wiem Mkaouer, Ali Ouni, and Fabio Palomba. 2019. On the Distribution of Test Smells in Open Source Android Applications: An Exploratory Study. In Proceedings of the 29th Annual International Conference on Computer Science and Software Engineering (Toronto, Ontario, Canada) (CASCON ’19). IBM Corp., USA, 193–202.

Anthony Peruma, Khalid Almalki, Christian D. Newman, Mohamed Wiem Mkaouer, Ali Ouni, and Fabio Palomba. 2020. TsDetect: An Open Source Test Smells Detection Tool. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Virtual Event, USA) (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA, 1650–1654. DOI: 10.1145/3368089.3417921

Ohad Rubin, Jonathan Herzig, and Jonathan Berant. 2022. Learning To Retrieve Prompts for In-Context Learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (Eds.). Association for Computational Linguistics, Seattle, United States, 2655–2671. DOI: 10.18653/v1/2022.naacl-main.191

Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, and Vijay Gadepally. 2023. From words to watts: Benchmarking the energy costs of large language model inference. In 2023 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–9.

Arkadii Sapozhnikov, Mitchell Olsthoorn, Annibale Panichella, Vladimir Kovalenko, and Pouria Derakhshanfar. 2024. TestSpark: IntelliJ IDEA’s Ultimate Test Generation Companion. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. 30–34.

Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering 50, 1 (2023), 85–105.

Mohammed Latif Siddiq, Joanna Cecilia Da Silva Santos, Ridwanul Hasan Tanvir, Noshin Ulfat, Fahmid Al Rifat, and Vinícius Carvalho Lopes. 2024. Using large language models to generate junit tests: An empirical study. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering. 313–322.

Falcon-LLM Team. 2024. The Falcon 3 Family of Open Models. [link]

Gemma Team. 2024. Gemma. (2024). DOI: 10.34740/KAGGLE/M/3301

IBM Granite Team. 2024. Granite 3.1: Powerful Performance, Longer Context, and More. IBM Research Journal 47, 4 (2024), 22–29. [link]

Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, and Neel Sundaresan. 2020. Unit test case generation with transformers and focal context. arXiv preprint arXiv:2009.05617 (2020).

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software testing with large language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering (2024).

Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Guoyin Wang, Heng Li, Jiangcheng Zhu, Jianqun Chen, et al. 2024. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652 (2024).

Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. 2024. Evaluating and improving chatgpt for unit test generation. Proceedings of the ACM on Software Engineering 1, FSE (2024), 1703–1726.

Andreas Zeller. 2009. Why programs fail: a guide to systematic debugging. Morgan Kaufmann.

Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, et al. 2024. xlam: A family of large action models to empower ai agent systems. arXiv preprint arXiv:2409.03215 (2024).