Detecting Test Smells in Python Test Code Generated by LLM: An Empirical Study with GitHub Copilot

  • Victor Anthony Alves UFC
  • Cristiano Santos UFBA
  • Carla Bezerra UFC
  • Ivan Machado UFBA

Resumo


Writing unit tests is a time-consuming and labor-intensive development practice. Consequently, various techniques for automatically generating unit tests have been studied. Among them, the use of Large Language Models (LLMs) has recently emerged as a popular approach for automatically generating tests from natural language descriptions. Although many recent studies are dedicated to measuring the ability of LLMs to write valid unit tests, few evaluate the quality of these generated tests. In this context, this study aims to measure the quality of the test codes generated by GitHub Copilot in Python by detecting test smells in the test cases generated. To do this, we used approaches to generating unit tests by LLMs that have already been applied in the literature and collected a sample of 194 unit test cases in 30 Python test files. We then measured them using tools specialized in detecting test smells in Python. Finally, we conducted an evaluation of these test cases with software developers and software quality assurance professionals. Our results indicated that 47.4% of the tests generated by Copilot had at least one test smell detected, with a lack of documentation in the assertions being the most common quality problem. These findings indicate that although GitHub Copilot can generate valid unit tests, quality violations are still frequently found in these codes.

Palavras-chave: Test Smells, Large Language Models, Python Test Code

Referências

M. Moein Almasi, Hadi Hemmati, Gordon Fraser, Andrea Arcuri, and Janis Benefelds. 2017. An Industrial Evaluation of Unit Test Generation: Finding Real Faults in a Financial Application. In 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP). 263–272. DOI: 10.1109/ICSE-SEIP.2017.27

Moritz Beller, Georgios Gousios, Annibale Panichella, and Andy Zaidman. 2015. In When, how, and why developers (do not) test in their IDEs (Bergamo, Italy) (ESEC/FSE 2015). Association for Computing Machinery, New York, NY, USA, 179–190. DOI: 10.1145/2786805.2786843

Alexandru Bodea. 2022. Pytest-Smell: a smell detection tool for Python unit tests. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (<conf-loc>, <city>Virtual</city>, <country>South Korea</country>, </conf-loc>) (ISSTA 2022). Association for Computing Machinery, New York, NY, USA, 793–796. DOI: 10.1145/3533767.3543290

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Detecting Test Smells in Python Test Code Generated by LLM: Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. CoRR abs/2107.03374 (2021). arXiv:2107.03374 [link]

Ermira Daka, José Campos, Gordon Fraser, Jonathan Dorn, and Westley Weimer. 2015. Modeling readability to improve unit tests. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (Bergamo, Italy) (ESEC/FSE 2015). Association for Computing Machinery, New York, NY, USA, 107–118. DOI: 10.1145/2786805.2786838

Ermira Daka and Gordon Fraser. 2014. A Survey on Unit Testing Practices and Problems. In 2014 IEEE 25th International Symposium on Software Reliability Engineering. 201–211. DOI: 10.1109/ISSRE.2014.11

Khalid El Haji, Carolin Brandt, and Andy Zaidman. 2024. In Using GitHub Copilot for Test Generation in Python: An Empirical Study (Lisbon, Portugal) (AST ’24). ACM, New York, NY, USA, 11. DOI: 10.1145/3644032.3644443

Daniel Fernandes, Ivan Machado, and Rita Maciel. 2022. TEMPY: Test Smell Detector for Python. In Proceedings of the XXXVI Brazilian Symposium on Software Engineering (<conf-loc>, <city>Virtual Event</city>, <country>Brazil</country>, </conf-loc>) (SBES ’22). Association for Computing Machinery, New York, NY, USA, 214–219. DOI: 10.1145/3555228.3555280

Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (Szeged, Hungary) (ESEC/FSE ’11). Association for Computing Machinery, New York, NY, USA, 416–419. DOI: 10.1145/2025113.2025179

Gordon Fraser, Matt Staats, Phil McMinn, Andrea Arcuri, and Frank Padberg. 2015. Does Automated Unit Test Generation Really Help Software Testers? A Controlled Empirical Study. ACM Trans. Softw. Eng. Methodol. 24, 4, Article 23 (sep 2015), 49 pages. DOI: 10.1145/2699688

Danielle Gonzalez, Joanna C.S. Santos, Andrew Popovich, Mehdi Mirakhorli, and Mei Nagappan. 2017. A Large-Scale Study on the Usage of Testing Patterns That Address Maintainability Attributes: Patterns for Ease of Modification, Diagnoses, and Comprehension. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). 391–401. DOI: 10.1109/MSR.2017.8

D. Graham, R. Black, and E. van Veenendaal. 2021. Foundations of Software Testing ISTQB Certification, 4th edition. Cengage Learning. [link]

Emilia Hansson and Oliwer Ellréus. 2023. Code Correctness and Quality in the Era of AI Code Generation : Examining ChatGPT and GitHub Copilot. , 69 pages. [link]

V. Khorikov. 2020. Unit Testing Principles, Practices, and Patterns: Effective testing styles, patterns, and reliable automation for unit testing, mocking, and integration testing with examples in C#. Manning. [link]

Dong Jae Kim. 2020. An Empirical Study on the Evolution of Test Smell. In 2020 IEEE/ACM 42nd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 149–151.

Dong Jae Kim, Tse-Hsun Chen, and Jinqiu Yang. 2021. The secret life of test smells-an empirical study on test smell evolution and maintenance. Empirical Software Engineering 26 (2021).

Chun Li. 2022. In Mobile GUI test script generation from natural language descriptions using pre-trained model (Pittsburgh, Pennsylvania) (MOBILESoft ’22). Association for Computing Machinery, New York, NY, USA, 112–113. DOI: 10.1145/3524613.3527809

Stephan Lukasczyk and Gordon Fraser. 2022. Pynguin: automated unit test generation for Python. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 168–172. DOI: 10.1145/3510454.3516829

P. Maragathavalli. 2011. Search-based software test data generation using evolutionary computation. ArXiv abs/1103.0125 (2011). [link]

Ariana Martino, Michael Iannelli, and Coleen Truong. 2023. Knowledge Injection to Counter Large Language Model (LLM) Hallucination. In The Semantic Web: ESWC 2023 Satellite Events, Catia Pesquita, Hala Skaf-Molli, Vasilis Efthymiou, Sabrina Kirrane, Axel Ngonga, Diego Collarana, Renato Cerqueira, Mehwish Alam, Cassia Trojahn, and Sven Hertling (Eds.). Springer Nature Switzerland, Cham, 182–185.

Phil McMinn. 2004. Search-based software test data generation: a survey: Research Articles. Softw. Test. Verif. Reliab. 14, 2 (jun 2004), 105–156.

Shabnam Mirshokraie, Ali Mesbah, and Karthik Pattabiraman. 2015. JSEFT: Automated Javascript Unit Test Generation. In 2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST). 1–10. DOI: 10.1109/ICST.2015.7102595

Fabio Palomba, Andy Zaidman, and Andrea De Lucia. 2018. Automatic Test Smell Detection Using Information Retrieval Techniques. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). 311–322. DOI: 10.1109/ICSME.2018.00040

Zedong Peng, Xuanyi Lin, Michelle Simon, and Nan Niu. 2021. Unit and regression tests of scientific software: A study on SWMM. Journal of Computational Science 53 (2021), 101347. DOI: 10.1016/j.jocs.2021.101347

P. Runeson. 2006. A survey of unit testing practices. IEEE Software 23, 4 (2006), 22–29. DOI: 10.1109/MS.2006.91

Railana Santana, Luana Martins, Larissa Rocha, Tássio Virgínio, Adriana Cruz, Heitor Costa, and Ivan Machado. 2020. RAIDE: a tool for Assertion Roulette and Duplicate Assert identification and refactoring. In Proceedings of the XXXIV Brazilian Symposium on Software Engineering. 374–379.

Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2024. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. IEEE Transactions on Software Engineering 50, 1 (2024), 85–105. DOI: 10.1109/TSE.2023.3334955

Domenico Serra, Giovanni Grano, Fabio Palomba, Filomena Ferrucci, Harald C. Gall, and Alberto Bacchelli. 2019. On the Effectiveness of Manual and Automatic Unit Test Generation: Ten Years Later. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). 121–125. DOI: 10.1109/MSR.2019.00028

Mohammed Latif Siddiq, Joanna C. S. Santos, Ridwanul Hasan Tanvir, Noshin Ulfat, Fahmid Al Rifat, and Vinicius Carvalho Lopes. 2024. Using Large Language Models to Generate JUnit Tests: An Empirical Study. arXiv:2305.00418 [cs.SE]

Michele Tufano, Dawn Drain, Alexey Svyatkovskiy, Shao Kun Deng, and Neel Sundaresan. 2021. Unit Test Case Generation with Transformers and Focal Context. arXiv:2009.05617 [cs.SE]

Michele Tufano, Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, Andrea De Lucia, and Denys Poshyvanyk. 2016. An empirical investigation into the nature of test smells. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (Singapore, Singapore) (ASE ’16). Association for Computing Machinery, New York, NY, USA, 4–15. DOI: 10.1145/2970276.2970340

Arie van Deursen, Leon Moonen, Alex van den Bergh, and Gerard Kok. 2001. Refactoring Test Code. Proceedings 2nd International Conference on Extreme Programming and Flexible Processes in Software Engineering (XP2001) (may 2001).

Tongjie Wang, Yaroslav Golubev, Oleg Smirnov, Jiawei Li, Timofey Bryksin, and Iftekhar Ahmed. 2022. In PyNose: a test smell detector for python (Melbourne, Australia) (ASE ’21). IEEE Press, 593–605. DOI: 10.1109/ASE51524.2021.9678615

Tao Xie and David Notkin. 2006. Tool-assisted unit test generation and selection based on operational abstractions. Automated Software Engineering Journal 13, 3 (July 2006), 345–371.

Burak Yetistiren, Isik Ozsoy, and Eray Tuzun. 2022. In Assessing the quality of GitHub copilot’s code generation (Singapore, Singapore) (PROMISE 2022). Association for Computing Machinery, New York, NY, USA, 62–71. DOI: 10.1145/3558489.3559072

Burak Yetiştiren, Işık Özsoy, Miray Ayerdem, and Eray Tüzün. 2023. In Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT. DOI: 10.48550/arXiv.2304.10778

Shengcheng Yu, Chunrong Fang, Yuchen Ling, Chentian Wu, and Zhenyu Chen. 2023. In LLM for Test Script Generation and Migration: Challenges, Capabilities, and Opportunities. 206–217. DOI: 10.1109/QRS60937.2023.00029
Publicado
30/09/2024
ALVES, Victor Anthony; SANTOS, Cristiano; BEZERRA, Carla; MACHADO, Ivan. Detecting Test Smells in Python Test Code Generated by LLM: An Empirical Study with GitHub Copilot. In: SIMPÓSIO BRASILEIRO DE ENGENHARIA DE SOFTWARE (SBES), 38. , 2024, Curitiba/PR. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 578-584. DOI: https://doi.org/10.5753/sbes.2024.3561.