An Empirical Study on the Detection of Test Smells in Test Codes Generated by GitHub Copilot

  • Victor Anthony Alves UFC
  • Carla Bezerra UFC
  • Ivan Machado UFBA

Resumo


Various techniques for automatically generating unit tests have been studied. The use of Large Language Models (LLMs) has recently emerged as a popular approach for automatic test generation from natural language descriptions. This study aims to measure the quality of the test codes produced by LLMs by detecting test smells in the test cases generate. To do this, we proposed an empirical study and a quality assessment methodology to be performed for each LLM that generates code. In our preliminary results, we applied these procedures with GitHub Copilot and obtained significant data on the quality of test codes. These findings indicate that although GitHub Copilot can generate valid unit tests, quality violations are still frequently found in these codes.

Referências

Beller, M., Gousios, G., Panichella, A., and Zaidman, A. (2015). In When, how, and why developers (do not) test in their IDEs, ESEC/FSE 2015, page 179–190, New York, NY, USA. Association for Computing Machinery.

Bodea, A. (2022). Pytest-smell: a smell detection tool for python unit tests. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2022, page 793–796, New York, NY, USA. Association for Computing Machinery.

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, W. (2021). Evaluating large language models trained on code. CoRR, abs/2107.03374.

Daka, E., Campos, J., Fraser, G., Dorn, J., and Weimer, W. (2015). Modeling readability to improve unit tests. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2015, page 107–118, New York, NY, USA. Association for Computing Machinery.

El Haji, K., Brandt, C., and Zaidman, A. (2024). In Using GitHub Copilot for Test Generation in Python: An Empirical Study, AST ’24, page 11, New York, NY, USA. ACM.

Fernandes, D., Machado, I., and Maciel, R. (2022). Tempy: Test smell detector for python. In Proceedings of the XXXVI Brazilian Symposium on Software Engineering, SBES ’22, page 214–219, New York, NY, USA. Association for Computing Machinery.

Graham, D., Black, R., and van Veenendaal, E. (2021). Foundations of Software Testing ISTQB Certification, 4th edition. Cengage Learning.

Hansson, E. and Ellréus, O. (2023). Code correctness and quality in the era of ai code generation : Examining chatgpt and github copilot.

Khorikov, V. (2020). Unit Testing Principles, Practices, and Patterns: Effective testing styles, patterns, and reliable automation for unit testing, mocking, and integration testing with examples in C. Manning.

Kim, D. J. (2020). An empirical study on the evolution of test smell. In 2020 IEEE/ACM 42nd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), pages 149–151.

Kim, D. J., Chen, T.-H., and Yang, J. (2021). The secret life of test smells-an empirical study on test smell evolution and maintenance. Empirical Software Engineering, 26.

Li, C. (2022). In Mobile GUI test script generation from natural language descriptions using pre-trained model, MOBILESoft ’22, page 112–113, New York, NY, USA. Association for Computing Machinery.

Martino, A., Iannelli, M., and Truong, C. (2023). Knowledge injection to counter large language model (llm) hallucination. In Pesquita, C., Skaf-Molli, H., Efthymiou, V., Kirrane, S., Ngonga, A., Collarana, D., Cerqueira, R., Alam, M., Trojahn, C., and Hertling, S., editors, The Semantic Web: ESWC 2023 Satellite Events, pages 182–185, Cham. Springer Nature Switzerland.

Palomba, F., Zaidman, A., and De Lucia, A. (2018). Automatic test smell detection using information retrieval techniques. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 311–322.

Peng, Z., Lin, X., Simon, M., and Niu, N. (2021). Unit and regression tests of scientific software: A study on swmm. Journal of Computational Science, 53:101347.

Runeson, P. (2006). A survey of unit testing practices. IEEE Software, 23(4):22–29.

Santana, R., Martins, L., Rocha, L., Virgínio, T., Cruz, A., Costa, H., and Machado, I. (2020). Raide: a tool for assertion roulette and duplicate assert identification and refactoring. In Proceedings of the XXXIV Brazilian Symposium on Software Engineering, pages 374–379.

Schäfer, M., Nadi, S., Eghbali, A., and Tip, F. (2024). An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering, 50(1):85–105.

Serra, D., Grano, G., Palomba, F., Ferrucci, F., Gall, H. C., and Bacchelli, A. (2019). On the effectiveness of manual and automatic unit test generation: Ten years later. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pages 121–125.

Siddiq, M. L., Santos, J. C. S., Tanvir, R. H., Ulfat, N., Rifat, F. A., and Lopes, V. C. (2024). Using large language models to generate junit tests: An empirical study.

Tufano, M., Drain, D., Svyatkovskiy, A., Deng, S. K., and Sundaresan, N. (2021). Unit test case generation with transformers and focal context.

Tufano, M., Palomba, F., Bavota, G., Di Penta, M., Oliveto, R., De Lucia, A., and Poshyvanyk, D. (2016). An empirical investigation into the nature of test smells. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE ’16, page 4–15, New York, NY, USA. Association for Computing Machinery.

van Deursen, A., Moonen, L., van den Bergh, A., and Kok, G. (2001). Refactoring test code. Proceedings 2nd International Conference on Extreme Programming and Flexible Processes in Software Engineering (XP2001).

Wang, T., Golubev, Y., Smirnov, O., Li, J., Bryksin, T., and Ahmed, I. (2022). In PyNose: a test smell detector for python, ASE ’21, page 593–605. IEEE Press.

Xie, T. and Notkin, D. (2006). Tool-assisted unit test generation and selection based on operational abstractions. Automated Software Engineering Journal, 13(3):345–371.

Yetistiren, B., Ozsoy, I., and Tuzun, E. (2022). In Assessing the quality of GitHub copilot’s code generation, PROMISE 2022, page 62–71, New York, NY, USA. Association for Computing Machinery.

Yetiştiren, B., Özsoy, I., Ayerdem, M., and Tüzün, E. (2023). In Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT.

Yu, S., Fang, C., Ling, Y., Wu, C., and Chen, Z. (2023). In LLM for Test Script Generation and Migration: Challenges, Capabilities, and Opportunities, pages 206–217.
Publicado
30/09/2024
ALVES, Victor Anthony; BEZERRA, Carla; MACHADO, Ivan. An Empirical Study on the Detection of Test Smells in Test Codes Generated by GitHub Copilot. In: CONCURSO DE TRABALHOS DE INICIAÇÃO CIENTÍFICA - CONGRESSO BRASILEIRO DE SOFTWARE: TEORIA E PRÁTICA (CBSOFT), 15. , 2024, Curitiba/PR. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 69-78. DOI: https://doi.org/10.5753/cbsoft_estendido.2024.4102.