On the Energy Footprint of Using a Small Language Model for Unit Test Generation

Rafael S. Durelli; Andre T. Endo; Vinicius H. S. Durelli

doi:10.5753/sast.2025.14036

Rafael S. Durelli UFLA http://orcid.org/0000-0002-6343-7715
Andre T. Endo UFSCar https://orcid.org/0000-0002-8737-1749
Vinicius H. S. Durelli UFSCar https://orcid.org/0000-0002-5768-1850

DOI: https://doi.org/10.5753/sast.2025.14036

Resumo

Context. Manual unit test creation is a cognitively intensive and time-consuming activity, prompting researchers and practitioners to increasingly adopt automated testing tools. Recent advancements in language models have expanded automation possibilities, including unit test generation, yet these models raise substantial sustainability concerns due to their energy consumption compared to conventional, specialized tools. Goal. Our research investigates whether the energy overhead associated with employing a small language model (SLM) for unit test generation is justified compared to a conventional, lightweight testing tool. We compare and analyze the energy consumption incurred during test suite generation, as well as the fault-finding effectiveness of the resulting test suites, for an SLM (Phi-3.1 Mini 128k) and Pynguin, a purpose-built tool for unit test generation. Method.We posed two research questions: (i) What is the difference in energy usage between Phi and Pynguin during the generation of unit test suites for Python programs?; and (ii) To what extent do unit test suites generated by Phi and Pynguin differ in their fault-finding effectiveness? To rigorously address the first research question, we employed Bayesian Data Analysis (BDA). For the second research question, we conducted a complementary empirical analysis using descriptive statistics. Results. Our Bayesian analysis provides robust evidence indicating that Phi consistently consumes significantly more energy than Pynguin during test suite generation. Conclusions. These findings underscore significant sustainability concerns associated with employing even SLMs for routine Software Engineering tasks such as unit test generation. The results challenge the assumption of universal energy efficiency benefits from smaller-scale models and emphasize the necessity for careful energy consumption evaluations in the adoption of automated software testing approaches.

Palavras-chave: Unit test generation, energy consumption, language model

Referências

Azat Abdullin, Pouria Derakhshanfar, and Annibale Panichella. 2025. Test Wars: A Comparative Study of SBST, Symbolic Execution, and LLM-Based Approaches to Unit Test Generation. In IEEE Conference on Software Testing, Verification and Validation (ICST). ACM, NY, USA, 221–232.

Shaukat Ali, Lionel C. Briand, Hadi Hemmati, and Rajwinder K. Panesar-Walawege. 2010. A Systematic Review of the Application and Empirical Investigation of Search-Based Test Case Generation. IEEE Transactions on Software Engineering 36, 6 (2010), 742–762.

Saswat Anand, Edmund K. Burke, Tsong Yueh Chen, John Clark, Myra B. Cohen, Wolfgang Grieskamp, Mark Harman, Mary Jean Harrold, and Phil Mcminn. 2013. An orchestrated survey of methodologies for automated software test case generation. Journal of Systems and Software 86, 8 (2013), 1978–2001.

Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner. 2023. Exploring the Carbon Footprint of Hugging Face’s ML Models: A Repository Mining Study. In ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). ACM, NY, USA, 1–12.

Noel Cressie and Christopher K. Wikle. 2011. Statistics for Spatio-Temporal Data. John Wiley & Sons, Hoboken, NJ. 624 pages.

Yi Ding and Tianyao Shi. 2024. Sustainable LLM Serving: Environmental Implications, Challenges, and Opportunities : Invited Paper. In IEEE 15th International Green and Sustainable Computing Conference (IGSC). IEEE, 37–38.

Gordon Fraser and Andrea Arcuri. 2013. Whole Test Suite Generation. IEEE Transactions on Software Engineering 39, 2 (2013), 276–291.

Carlo A. Furia, Robert Feldt, and Richard Torkar. 2021. Bayesian Data Analysis in Empirical Software Engineering Research. IEEE Transactions on Software Engineering 47, 9 (2021), 1786–1810.

Carlo A. Furia, Richard Torkar, and Robert Feldt. 2022. Applying Bayesian Analysis Guidelines to Empirical Software Engineering Data: The Case of Programming Languages and Code Quality. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 3 (2022).

Gerd Gigerenzer. 2004. Mindless Statistics. The Journal of Socio-Economics 33, 5 (2004), 587–606.

Patrice Godefroid, Nils Klarlund, and Koushik Sen. 2005. DART: directed automated random testing. ACM SIGPLAN Notices 40, 6 (2005), 213–223.

Mladan Jovanović and Mark Campbell. 2024. Compacting AI: In Search of the Small Language Model. Computer 57, 8 (2024), 96–100.

Fitsum Kifetew, Davide Prandi, and Angelo Susi. 2025. On the Energy Consumption of Test Generation. In IEEE Conference on Software Testing, Verification and Validation (ICST). 360–370.

John K. Kruschke. 2013. Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General 142, 2 (2013), 573–603.

John K. Kruschke. 2015. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan (2nd revised ed.). Academic Press. 776 pages.

Kiran Lakhotia, Phil McMinn, and Mark Harman. 2010. An empirical investigation into branch coverage for C programs using CUTE and AUSTIN. Journal of Systems and Software 83, 12 (2010), 2379–2391.

Derrick Lin, James Koppel, Angela Chen, and Armando Solar-Lezama. 2017. QuixBugs: a multi-lingual program repair benchmark set based on the quixey challenge. In Proceedings Companion of the ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity (SPLASH). ACM, NY, USA, 55–56.

Alexandra S. Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. 2023. Estimating the carbon footprint of BLOOM, a 176B parameter language model. The Journal of Machine Learning Research 24, 1 (2023).

Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: Automated Unit Test Generation for Python. In IEEE/ACM 44th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 168–172.

Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. 2023. An empirical study of automated unit test generation for Python. Empirical Software Engineering 28, 2 (2023), 36.

Phil McMinn. 2004. Search-based software test data generation: a survey. Software Testing, Verification & Reliability 14, 2 (2004), 105–156.

Tim Menzies and Martin Shepperd. 2019. “Bad smells” in software analytics papers. Information and Software Technology 112 (2019), 35–47.

Glenford J. Myers, Corey Sandler, and Tom Badgett. 2011. The Art of Software Testing (3rd ed.). Wiley. 256 pages.

Adel Noureddine. 2022. PowerJoular and JoularJX: Multi-Platform Software Power Monitoring Tools. In 18th International Conference on Intelligent Environments. Biarritz, France.

OpenAI. 2025. ChatGPT. [link]. Large language model.

Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2024. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. IEEE Transactions on Software Engineering 50, 1 (2024), 85–105.

Rens van de Schoot, Sarah Depaoli, Ruth King, Bianca Kramer, Kaspar Märtens, Mahlet G. Tadesse, Marina Vannucci, Andrew Gelman, Duco Veen, Joukje Willemsen, and Christopher Yau. 2021. Bayesian statistics and modelling. Nature Reviews Methods Primers 1 (2021).

ScottW. VanderStoep and Deidre D. Johnson. 2008. Research Methods for Everyday Life: Blending Qualitative and Quantitative Approaches. Jossey-Bass. 352 pages.

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software Testing With Large Language Models: Survey, Landscape, and Vision. IEEE Transactions on Software Engineering 50, 4 (2024), 911–936.

C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and A. Wesslén. 2012. Experimentation in Software Engineering. Springer. 236 pages.

He Ye, Matias Martinez, Thomas Durieux, and Martin Monperrus. 2021. A comprehensive study of automatic program repair on the QuixBugs benchmark. Journal of Systems and Software 171 (2021).