Evaluating the Effectiveness and Cost-Efficiency of Large Language Models in Automated Unit Test Generation

Werney Ayala Luz Lira; Pedro de Alcântara dos SantosNeto; Guilherme Amaral Avelino; Luiz Fernando Mendes Osório

doi:10.5753/sbqs.2025.13853

Werney Ayala Luz Lira IFPI http://orcid.org/0000-0003-4198-8169
Pedro de Alcântara dos SantosNeto UFPI https://orcid.org/0000-0002-1554-8445
Guilherme Amaral Avelino UFPI https://orcid.org/0000-0002-8203-0638
Luiz Fernando Mendes Osório UFPI https://orcid.org/0009-0006-2052-4778

DOI: https://doi.org/10.5753/sbqs.2025.13853

Resumo

Software quality plays a crucial role in the development process, and one of the main strategies to ensure it is through software testing. However, testing can be a costly and time-consuming task, leading researchers to explore ways to automate it. This study analyzes the ability of large language models (LLMs) to generate unit tests automatically, evaluating their performance based on four key criteria: line coverage, branch coverage, hit rate, and cost. To achieve this, unit tests were generated using three different input formats: (i) source code only, (ii) source code with a docstring, and (iii) a detailed prompt with step-by-step instructions. The results show that all models produce tests with high line and branch coverage. However, hit rates vary depending on the input format. The lowest hit rate was 72.61%, obtained by ChatGPT-4o Mini when provided with only the source code. The highest hit rate was 90.69%, achieved by Gemini 1.5 Pro when given source code with a docstring, a performance close to the 96.81% hit rate of the dataset’s reference tests. The findings indicate that more affordable models, such as Gemini 1.5 Flash and GPT-4o Mini, can achieve competitive results when optimized prompts are used, making them cost-effective alternatives for unit test generation.

Palavras-chave: Software Testing, Automatic Software Testing, Large Language Models, LLM, Chatgpt, Gemini, Artificial Intelligence

Referências

S. S. Riaz Ahamed. 2010. Studying the Feasibility and Importance of Software Testing: An Analysis. arXiv:1001.4193 [cs.SE] [link]

Nadia Alshahwan, Jubin Chheda, Anastasia Finogenova, Beliz Gokkaya, Mark Harman, Inna Harper, Alexandru Marginean, Shubho Sengupta, and Eddy Wang. 2024. Automated Unit Test Improvement using Large Language Models at Meta. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (Porto de Galinhas, Brazil) (FSE 2024). Association for Computing Machinery, New York, NY, USA, 185–196. DOI: 10.1145/3663529.3663839

Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2015. The Oracle Problem in Software Testing: A Survey. IEEE Transactions on Software Engineering 41, 5 (2015), 507–525. DOI: 10.1109/TSE.2014.2372785

K. Beck. 2000. Extreme Programming Explained: Embrace Change. Addison-Wesley. [link]

Shreya Bhatia, Tarushi Gandhi, Dhruv Kumar, and Pankaj Jalote. 2024. Unit Test Generation using Generative AI : A Comparative Performance Analysis of Autogeneration Tools. arXiv:2312.10622 [cs.SE] [link]

Sonam Bhatia and Jyoteesh Malhotra. 2014. A survey on impact of lines of code on software complexity. In 2014 International Conference on Advances in Engineering Technology Research (ICAETR - 2014). 1–4. DOI: 10.1109/ICAETR.2014.7012875

Igor B Bourdonov, Alexander S Kossatchev, Victor V Kuliamin, and Alexander K Petrenko. 2002. UniTesK test suite architecture. In FME 2002: Formal Methods—Getting IT Right: International Symposium of Formal Methods Europe Copenhagen, Denmark, July 22–24, 2002 Proceedings. Springer, 77–88.

Denivan Campos, Luana Martins, and Ivan Machado. 2022. An empirical study on the influence of developers’ experience on software test code quality. In Anais do XXI Simpósio Brasileiro de Qualidade de Software (Curitiba/PR). SBC, Porto Alegre, RS, Brasil, 18–27. [link]

José Campos, Yan Ge, Nasser Albunian, Gordon Fraser, Marcelo Eler, and Andrea Arcuri. 2018. An empirical evaluation of evolutionary algorithms for unit test suite generation. Information and Software Technology 104 (2018), 207–235. DOI: 10.1016/j.infsof.2018.08.010

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG] [link]

Ermira Daka and Gordon Fraser. 2014. A Survey on Unit Testing Practices and Problems. In 2014 IEEE 25th International Symposium on Software Reliability Engineering. 201–211. DOI: 10.1109/ISSRE.2014.11

Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, and Michel C. Desmarais. 2024. Effective test generation using pre-trained Large Language Models and mutation testing. Information and Software Technology 171 (2024), 107468. DOI: 10.1016/j.infsof.2024.107468

Christof Ebert, James Cain, Giuliano Antoniol, Steve Counsell, and Phillip Laplante. 2016. Cyclomatic Complexity. IEEE Software 33, 6 (2016), 27–29. DOI: 10.1109/MS.2016.147

Daniel Elsner, Florian Hauer, Alexander Pretschner, and Silke Reimer. 2021. Empirically evaluating readily available information for regression test optimization in continuous integration. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual, Denmark) (ISSTA 2021). Association for Computing Machinery, New York, NY, USA, 491–504. DOI: 10.1145/3460319.3464834

Ali Erkan and Tunga Güngör. 2023. Analysis of Deep Learning Model Combinations and Tokenization Approaches in Sentiment Classification. IEEE Access 11 (2023), 134951–134968. DOI: 10.1109/ACCESS.2023.3337354

Gordon Fraser and Andrea Arcuri. 2011. Evolutionary Generation of Whole Test Suites. In International Conference On Quality Software (QSIC). IEEE Computer Society, Los Alamitos, CA, USA, 31–40. DOI: 10.1109/QSIC.2011.19

Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (Szeged, Hungary) (ESEC/FSE ’11). ACM, New York, NY, USA, 416–419. DOI: 10.1145/2025113.2025179

Gordon Fraser and Andrea Arcuri. 2013. Whole Test Suite Generation. IEEE Transactions on Software Engineering 39, 2 (2013), 276–291. DOI: 10.1109/TSE.2012.14

Yi Gao, Xing Hu, Zirui Chen, Xiaohu Yang, and Xin Xia. 2024. Unit Test Generation for Vulnerability Exploitation in Java Third-Party Libraries. arXiv:2409.16701 [cs.SE] [link]

Danielle Gonzalez, Joanna C.S. Santos, Andrew Popovich, Mehdi Mirakhorli, and Mei Nagappan. 2017. A Large-Scale Study on the Usage of Testing Patterns That Address Maintainability Attributes: Patterns for Ease of Modification, Diagnoses, and Comprehension. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). 391–401. DOI: 10.1109/MSR.2017.8

Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep learning. Vol. 1. MIT press Cambridge.

Giovanni Grano, Simone Scalabrino, Harald C. Gall, and Rocco Oliveto. 2018. An Empirical Investigation on the Readability of Manual and Generated Test Cases. In 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC). 348–3483.

Jay Graylin, Randy K SMITH, HALE David, Nicholas A KRAFT, WARD Charles, et al. 2009. Cyclomatic complexity and lines of code: Empirical evidence of a stable linear relationship. Journal of Software Engineering and Applications 2, 3 (2009), 137–143.

Paul Hamill. 2004. Unit test frameworks: tools for high-quality software development. " O’Reilly Media, Inc.".

Yi Hu, Hyeonjin Kim, Kai Ye, and Ning Lu. 2025. Applying fine-tuned LLMs for reducing data needs in load profile analysis. Applied Energy 377 (2025), 124666. DOI: 10.1016/j.apenergy.2024.124666

Saki Imai. 2022. Is GitHub Copilot a Substitute for Human Pair-programming? An Empirical Study. In 2022 IEEE/ACM 44th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 319–321. DOI: 10.1145/3510454.3522684

Shujuan Jiang, Miao Zhang, Yanmei Zhang, Rongcun Wang, Qiao Yu, and Jacky Wai Keung. 2021. An Integration Test Order Strategy to Consider Control Coupling. IEEE Transactions on Software Engineering 47, 7 (2021), 1350–1367. DOI: 10.1109/TSE.2019.2921965

Rabimba Karanjai, Aftab Hussain, Md Rafiqul Islam Rabin, Lei Xu, Weidong Shi, and Mohammad Amin Alipour. 2024. Harnessing the Power of LLMs: Automating Unit Test Generation for High-Performance Computing. arXiv:2407.05202 [cs.SE] [link]

Barbara Kitchenham. 2004. Procedures for Performing Systematic Reviews. Technical Report TR/SE-0401. Keele University, Keele, UK.

Roberto Latorre. 2013. Effects of developer experience on learning and applying unit test-driven development. IEEE Transactions on Software Engineering 40, 4 (2013), 381–395.

Werney Lira, Pedro Santos Neto, and Luiz Osorio. 2024. Uma análise do uso de ferramentas de geração de código por alunos de computação. In Anais do IV Simpósio Brasileiro de Educação em Computação (Evento Online). SBC, Porto Alegre, RS, Brasil, 63–71. DOI: 10.5753/educomp.2024.237427

Andrea Lops, Fedelucio Narducci, Azzurra Ragone, Michelantonio Trizio, and Claudio Bartolini. 2024. A System for Automated Unit Test Generation Using Large Language Models and Assessment of Generated Test Suites. arXiv:2408.07846 [cs.SE] [link]

Luiz Osorio, Pedro Santos Neto, Guilherme Avelino, and Werney Lira. 2025. An Evaluation of the Impact of Code Generation Tools on Software Development. In Anais do XXI Simpósio Brasileiro de Sistemas de Informação (Recife/PE). SBC, Porto Alegre, RS, Brasil, 625–634. DOI: 10.5753/sbsi.2025.246605

Wendkûuni C. Ouédraogo, Kader Kaboré, Haoye Tian, Yewei Song, Anil Koyuncu, Jacques Klein, David Lo, and Tegawendé F. Bissyandé. 2024. Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation. arXiv:2407.00225 [cs.SE] [link]

Carlos Pacheco and Michael D. Ernst. 2007. Randoop: Feedback-directed Random Testing for Java. In OOPSLA 2007 Companion, Montreal, Canada. ACM.

Rangeet Pan, Myeongsoo Kim, Rahul Krishna, Raju Pavuluri, and Saurabh Sinha. 2024. Multi-language Unit Test Generation using LLMs. arXiv:2409.03093 [cs.SE] [link]

Chanathip Pornprasit and Chakkrit Tantithamthavorn. 2024. Fine-tuning and prompt engineering for large language models-based code review automation. Information and Software Technology 175 (2024), 107523. DOI: 10.1016/j.infsof.2024.107523

R.S. Pressman. 2005. Software Engineering: A Practitioner’s Approach. Boston. [link]

Rudolf Ramler, Dietmar Winkler, and Martina Schmidt. 2012. Random Test Case Generation and Manual Unit Testing: Substitute or Complement in Retrofitting Tests for Legacy Code?. In 2012 38th Euromicro Conference on Software Engineering and Advanced Applications. 286–293. DOI: 10.1109/SEAA.2012.42

Linda Rosenberg, Ted Hammer, and Jack Shaw. 1998. Software metrics and reliability. In 9th international symposium on software reliability engineering.

Shexmo Santos, Raiane Fernandes, Marcos Santos, Michel Soares, Fabio Rocha, and Sabrina Marczak. 2024. Increasing Test Coverage by Automating BDD Tests in Proofs of Concepts (POCs) using LLM. In Anais do XXIII Simpósio Brasileiro de Qualidade de Software (Bahia/BA). SBC, Porto Alegre, RS, Brasil, 519–525. [link]

Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2024. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. IEEE Transactions on Software Engineering 50, 1 (2024), 85–105. DOI: 10.1109/TSE.2023.3334955

Domenico Serra, Giovanni Grano, Fabio Palomba, Filomena Ferrucci, Harald C. Gall, and Alberto Bacchelli. 2019. On the Effectiveness of Manual and Automatic Unit Test Generation: Ten Years Later. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). 121–125. DOI: 10.1109/MSR.2019.00028

J. Shore and S. Warden. 2021. The Art of Agile Development (2nd ed.). O’Reilly Media, Incorporated.

Daniil Stepanov and Dmitry Ivanov. 2023. Automated Unit Test Generation For Java Programs Using Fuzzing. In 2023 Ivannikov Ispras Open Conference (ISPRAS). 157–162. DOI: 10.1109/ISPRAS60948.2023.10508168

Ye Tian, Beibei Yin, and Chenglong Li. 2021. A Model-based Test Cases Generation Method for Spacecraft Software. In 2021 8th International Conference on Dependable Systems and Their Applications (DSA). 373–382. DOI: 10.1109/DSA52907.2021.00057

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H.Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc.

HanWang, Han Hu, Chunyang Chen, and Burak Turhan. 2024. Chat-like Asserts Prediction with the Support of Large Language Model. arXiv:2407.21429 [cs.SE] [link]

Zejun Wang, Kaibo Liu, Ge Li, and Zhi Jin. 2024. HITS: High-coverage LLMbased Unit Test Generation via Method Slicing. arXiv:2408.11324 [cs.SE] [link]

Qinyuan Ye, Maxamed Axmed, Reid Pryzant, and Fereshte Khani. 2024. Prompt Engineering a Prompt Engineer. arXiv:2311.05661 [cs.CL] [link]

Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. 2024. Evaluating and Improving ChatGPT for Unit Test Generation. Proc. ACM Softw. Eng. 1, FSE, Article 76 (July 2024), 24 pages. DOI: 10.1145/3660783

Zhiqiang Yuan, Yiling Lou, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, and Xin Peng. 2024. No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation. arXiv:2305.04207 [cs.SE] [link]