Exploring ChatGPT Efficiency in Automatic Test Generation for Python: A Comparative Analysis

Resumo


Context: Large language models (LLMs) like ChatGPT have gained attention in automated software testing. This study evaluates ChatGPT-3.5-turbo’s ability to generate test sets for Python programs, comparing it with Pynguin and pre-existing test sets. Problem: Automated testing remains challenging for dynamically typed languages like Python, requiring adaptable tools for diverse code structures. Solution: We assessed ChatGPT-3.5-turbo’s test generation using different prompt configurations and temperature settings. Method: Using 40 Python programs, we generated Pytestcompliant tests via the OpenAI API, varying temperature settings (0.0 to 1.0). Tests were validated using Pytest, with coverage and mutation scores measured via Coverage, MutPy, and Cosmic-Ray. Pynguin-generated and pre-existing test sets served as baselines. Summary of Results: ChatGPT-3.5-turbo successfully generated valid tests for simpler programs, but averaged below 28% overall, with a low cost. Higher temperatures (0.5–1.0) improved results, but combining test cases from all temperatures introduces diversity in the LLM-generated test sets, making it possible to overcome both Pynguin and pre-existing test sets in terms of decision coverage and mutation score.

Palavras-chave: software testing, experimental software engineering, automated test generation, large language models, coverage testing, mutation Testing, testing tool

Referências

J. H. Andrews, L. C. Briand, and Y. Labiche. 2005. Is mutation an appropriate tool for testing experiments?. In XXVII International Conference on Software Engineering – ICSE’05. ACM Press, St. Louis, MO, USA, 402–411. DOI: 10.1145/1062455.1062530

Filipe Santos Araujo and Auri Marcelo Rizzo Vincenzi. 2020. How far are we from testing a program in a completely automated way, considering the mutation testing criterion at unit level?. In Anais do Simpósio Brasileiro de Qualidade de Software (SBQS). SBC, São Luiz, MA, 151–159. DOI: 10.1145/3439961.3439977

Austin Bingham. 2021. Cosmic Ray: mutation testing for Python. (Nov. 2021). [link]

R. A. DeMillo, R. J. Lipton, and F. G. Sayward. 1978. Hints on Test Data Selection: Help for the Practicing Programmer. IEEE Computer 11, 4 (April 1978), 34–43. DOI: 10.1109/C-M.1978.218136

Stéphane Ducasse, Manuel Oriol, and Alexandre Bergel. 2011. Challenges to support automated random testing for dynamically typed languages. In Proceedings of the International Workshop on Smalltalk Technologies (IWST ’11). Association for Computing Machinery, New York, NY, USA, 1–6. DOI: 10.1145/2166929.2166938

Lucca Renato Guerino, Pedro Henrique Kuroishi, Ana Cristina Ramada Paiva, and Auri Marcelo Rizzo Vincenzi. 2024. Static and Dynamic Comparison of Mutation Testing Tools for Python. In Proceedings of the XXIII Brazilian Symposium on Software Quality (SBQS ’24). Association for Computing Machinery, New York, NY, USA, 199–209. DOI: 10.1145/3701625.3701659

Lucca Renato Guerino and Auri M. R. Vincenzi. 2023. An Experimental Study Evaluating Cost, Adequacy Effectiveness of Pynguin’s Test Sets. In 8th Brazilian Symposium on Systematic and Automated Software Testing – SAST’2023. ACM Press, Campo Grande, MS, 5–14. DOI: 10.1145/3624032.3624034

Philipp Hossner, Konrad Hałas, Steven Myint, and Andreas Mueller. 2021. MutPy: a mutation testing tool for Python 3.x source code. (Nov. 2021). [link]

Stephan Lukasczyk. 2019. Generating Tests to Analyse Dynamically-Typed Programs. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, San Diego, CA, USA, 1226–1229. DOI: 10.1109/ASE.2019.00146 ISSN: 2643-1572.

Stephan Lukasczyk, Florian Kroiß, and Gordon Fraser. 2023. An empirical study of automated unit test generation for Python. Empirical Software Engineering 28, 2 (Jan. 2023), 36. DOI: 10.1007/s10664-022-10248-w

Goran Petrović, Marko Ivanković, Gordon Fraser, and René Just. 2021. Does Mutation Testing Improve Testing Practices?. In Proceedings of the 43rd International Conference on Software Engineering (ICSE ’21). IEEE Press, Madrid, Spain, 910–921. DOI: 10.1109/ICSE43902.2021.00087 Place: Madrid, Spain.

June Sallou, Thomas Durieux, and Annibale Panichella. 2024. Breaking the Silence: the Threats of Using LLMs in Software Engineering. In Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER’24). Association for ComputingMachinery, New York, NY, USA, 102–106. DOI: 10.1145/3639476.3639764 event-place: Lisbon, Portugal.

Douglas C. Schmidt. 2025. Software Testing in the Generative AI Era: A Practitioner’s Playbook. Computer 58, 7 (2025), 147–152. DOI: 10.1109/MC.2025.3562940

TIOBE Software BV. 2024. TIOBE Index. [link]

Auri M. R. Vincenzi, Tiago Bachiega, Daniel G. de Oliveira, Simone R. S. de Souza, and José C. Maldonado. 2016. The Complementary Aspect of Automatically and Manually Generated Test Case Sets. In Proceedings of the 7th International Workshop on Automating Test Case Design, Selection, and Evaluation (A-TEST 2016, Vol. 1). ACM, Seattle, USA, 23–30. DOI: 10.1145/2994291.2994295 event-place: Seattle, WA, USA.

Sebastian Vogl, Sebastian Schweikl, Gordon Fraser, Andrea Arcuri, Jose Campos, and Annibale Panichella. 2021. EvoSuite at the SBST 2021 Tool Competition. In 2021 IEEE/ACM 14th International Workshop on Search-Based Software Testing (SBST). IEEE, Madrid, Spain, 28–29. DOI: 10.1109/SBST52555.2021.00012

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software Testing With Large Language Models: Survey, Landscape, and Vision. IEEE Trans. Softw. Eng. 50, 4 (Feb. 2024), 911–936. DOI: 10.1109/TSE.2024.3368208 Publisher: IEEE Press.

Quanjun Zhang, Weifeng Sun, Chunrong Fang, Bowen Yu, Hongyan Li, Meng Yan, Jianyi Zhou, and Zhenyu Chen. 2024. Exploring Automated Assertion Generation via Large Language Models. ACM Trans. Softw. Eng. Methodol. 34, 13 (Oct. 2024), 25. DOI: 10.1145/3699598 Place: New York, NY, USA Publisher: Association for Computing Machinery.
Publicado
04/11/2025
GUERINO, Lucca Renato; VINCENZI, Auri Marcelo Rizzo. Exploring ChatGPT Efficiency in Automatic Test Generation for Python: A Comparative Analysis. In: SIMPÓSIO BRASILEIRO DE QUALIDADE DE SOFTWARE (SBQS), 24. , 2025, São José dos Campos/SP. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 88-98. DOI: https://doi.org/10.5753/sbqs.2025.14558.