Improving JavaScript Test Quality with Large Language Models: Lessons from Test Smell Refactoring

Gabriel Amaral; Henrique Gomes; Eduardo Figueiredo; Carla Bezerra; Larissa Rocha

doi:10.5753/sbes.2025.11568

Gabriel Amaral UEFS
Henrique Gomes UFMG
Eduardo Figueiredo UFMG
Carla Bezerra UFC
Larissa Rocha UNEB / UEFS

DOI: https://doi.org/10.5753/sbes.2025.11568

Resumo

Test smells—poor design choices in test code—can hinder test maintainability, clarity, and reliability. Prior studies have proposed rulebased detection tools and manual refactoring strategies, most focus on statically typed languages such as Java. In this paper, we investigate the potential of Large Language Models (LLMs) to automatically refactor test smells in JavaScript, a dynamically typed and widely used language with limited prior research in this area. We conducted an empirical study using GitHub Copilot Chat and Amazon CodeWhisperer to refactor 148 test smell instances across 10 real-world JavaScript projects. Our evaluation assessed smell removal effectiveness, behavioral preservation, introduction of new smells, and structural code quality based on six software metrics. Results show that Copilot removed 58.78% of the smells successfully, outperforming Whisperer’s 47.30%, while both tools preserved test behavior in most cases. However, both also introduced new smells, highlighting current limitations. Our findings reveal the strengths and trade-offs of LLM-based refactoring and provide insights for building more reliable and smell-aware testing tools for JavaScript.

Palavras-chave: Test Smells, Large Language Models (LLMs), JavaScript

Referências

Wajdi Aljedaani, Anthony Peruma, Ahmed Aljohani, Mazen Alotaibi, Mohamed Wiem Mkaouer, Ali Ouni, Christian D. Newman, Abdullatif Ghallab, and Stephanie Ludi. 2021. Test Smell Detection Tools: A Systematic Mapping Study. In Proceedings of the 25th Int. Conf. on Evaluation and Assessment in Software Engineering. 170–180.

Yasaman Amannejad, Vahid Garousi, Rob Irving, and Zahra Sahaf. 2014. A Search-Based Approach for Cost-Effective Software Test Automation Decision Support and an Industrial Case Study. In 2014 IEEE Seventh International Conference on Software Testing, Verification and Validation Workshops. 302–311.

Gabriele Bavota, Abdallah Qusef, Rocco Oliveto, Andrea De Lucia, and David Binkley. 2012. An empirical analysis of the distribution of unit test smells and their impact on software maintenance. In 2012 28th IEEE International Conference on Software Maintenance (ICSM). 56–65.

Manuel Breugelmans and Bart Van Rompaey. 2008. Testq: Exploring structural and maintenance characteristics of unit test suites. In WASDeTT-1: 1st International Workshop on Advanced Software Development Tools and Techniques. Citeseer, 11.

I. Burnstein, T. Suwanassart, and R. Carlson. 1996. Developing a Testing Maturity Model for software test process evaluation and improvement. In Proceedings International Test Conference 1996. Test and Design Validity. 581–589.

Stephen Cass. 2024. The Top Programming Languages 2024. IEEE Spectrum (August 2024). [link] Accessed: 2025-04-28.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).

Jonathan Cordeiro, Shayan Noei, and Ying Zou. 2024. An Empirical Study on the Code Refactoring Capability of Large Language Models. arXiv:2411.02320 [cs.SE] [link]

Ozren Dabic, Emad Aghajani, and Gabriele Bavota. 2021. Sampling Projects in GitHub for MSR Studies. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). 560–564.

Jonas De Bleser, Dario Di Nucci, and Coen De Roover. 2019. SoCRATES: Scala radar for test smells. In Proceedings of the Tenth ACM SIGPLAN Symposium on Scala. Association for Computing Machinery, 22–26.

Sakina Fatima, Hadi Hemmati, and Lionel Briand. 2024. Flakyfix: Using large language models for predicting flaky test fix categories and test code repair. IEEE Transactions on Software Engineering (2024).

Yi Gao, Xing Hu, Xiaohu Yang, and Xin Xia. 2025. Context-Enhanced LLM-Based Framework for Automatic Test Refactoring. In International Conference on the Foundations of Software Engineering (FSE 2025).

GitHub. 2024. About GitHub Copilot Individual. [link] Accessed: 2024-09-29.

Sacha Greif and Eric Burel. 2024. The State of JavaScript 2024: Testing - Jest. [link]. Online; accessed 25 April 2025. Survey run by Devographics from Nov 13 to Dec 10, 2024 with 14,015 responses. Results published on Dec 16, 2024..

Navid Bin Hasan, Md Ashraful Islam, Junaed Younus Khan, Sanjida Senjik, and Anindya Iqbal. 2025. Automatic High-Level Test Case Generation using Large Language Models. In 2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). IEEE Computer Society.

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology (2024).

Dalton Jorge, Patricia Machado, and Wilkerson Andrade. 2021. Investigating Test Smells in JavaScript Test Code. In Proceedings of the 6th Brazilian Symposium on Systematic and Automated Software Testing. Association for Computing Machinery, 36–45.

Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. 22199–22213.

Negar Koochakzadeh and Vahid Garousi. 2010. Tecrevis: a tool for test coverage and test redundancy visualization. In International Academic and Industrial Conference on Practice and Research Techniques. Springer, 129–136.

Keila Lucas, Rohit Gheyi, Elvys Soares, Márcio Ribeiro, and Ivan Machado. 2024. Evaluating large language models in detecting test smells. arXiv preprint arXiv:2407.19261 (2024).

Rian Melo, Pedro Simões, Rohit Gheyi, Marcelo d’Amorim, Márcio Ribeiro, Gustavo Soares, Eduardo Almeida, and Elvys Soares. 2025. Agentic SLMs: Hunting Down Test Smells. arXiv:2504.07277 [cs.SE] [link]

Gerard Meszaros, Shaun M. Smith, and Jennitta Andrea. [n. d.]. The Test Automation Manifesto. In Extreme Programming and Agile Methods - XP/Agile Universe 2003 (2003). Springer Berlin Heidelberg, 73–81.

Nhan Nguyen and Sarah Nadi. 2022. An empirical evaluation of GitHub copilot’s code suggestions. In Proceedings of the 19th International Conference on Mining Software Repositories (MSR). 1–5.

Henriquer Nunes, Eduardo Figueiredo, Larissa Soares, Sarah Nadi, Fischer Ferreira, and Geanderson Esteves. 2025. Evaluating the Effectiveness of LLMs in Fixing Maintainability Issues in Real-World Projects. In 32th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE.

Jhonatan Oliveira, Luigi Mateus, Tássio Virgínio, and Larissa Rocha. 2024. SNUTS.js: Sniffing Nasty Unit Test Smells in Javascript. In Anais do XXXVIII Simpósio Brasileiro de Engenharia de Software. SBC, Porto Alegre, RS, Brasil, 720–726.

Fabio Palomba, Dario Di Nucci, Annibale Panichella, Rocco Oliveto, and Andrea De Lucia. 2016. On the diffusion of test smells in automatically generated test code: an empirical study. In Proceedings of the 9th International Workshop on Search-Based Software Testing. Association for Computing Machinery, 5–14.

Anthony Peruma, Khalid Almalki, Christian D Newman, Mohamed Wiem Mkaouer, Ali Ouni, and Fabio Palomba. 2020. Tsdetect: An open source test smells detection tool. In Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 1650–1654.

Adriano Pizzini. 2022. Behavior-based test smells refactoring: toward an automatic approach to refactoring eager test and lazy test smells. In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings. Association for Computing Machinery, New York, NY, USA, 261–263.

Adriano Pizzini, Sheila Reinehr, and Andreia Malucelli. 2023. Sentinel: A process for automatic removing of Test Smells. In Proceedings of the XXII Brazilian Symposium on Software Quality. Association for Computing Machinery, New York, NY, USA, 80–89.

Rudolf Ramler and Klaus Wolfmaier. 2006. Economic perspectives in test automation: balancing automated and manual testing with opportunity cost. In Proceedings of the 2006 International Workshop on Automation of Software Test. Association for Computing Machinery, 85–91.

Stefan Reichhart, Tudor Gîrba, and Stéphane Ducasse. 2007. Rule-based Assessment of Test Quality. J. Object Technol. 6, 9 (2007), 231–251.

Railana Santana, Luana Martins, Tássio Virgínio, Larissa Rocha, Heitor Costa, and Ivan Machado. 2024. An empirical evaluation of RAIDE: A semi-automated approach for test smells detection and refactoring. Science of Computer Programming 231 (2024), 103013.

Andrew Costa Silva. 2022. Identificação e Caracterização de Test Smells em JavaScript. Instituto de Ciencias Exatas e Informática - Pontifícia Universidade 138 (2022), 52–81.

Elvys Soares, Márcio Ribeiro, Rohit Gheyi, Guilherme Amaral, and André Santos. 2023. Refactoring Test Smells With JUnit 5: Why Should Developers Keep Up-to-Date? IEEE Transactions on Software Engineering 49, 3 (2023), 1152–1170.

A. Team. 2023. Copilotchat. [link] Accessed: 2025-04-25.

Michele Tufano, Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, Andrea De Lucia, and Denys Poshyvanyk. 2016. An empirical investigation into the nature of test smells. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. Association for Computing Machinery, 4–15.

Tássio Virgínio, Luana Martins, Larissa Rocha, Railana Santana, Adriana Cruz, Heitor Costa, and Ivan Machado. 2020. JNose: Java Test Smell Detector. In Proceedings of the XXXIV Brazilian Symposium on Software Engineering. Association for Computing Machinery, 564–569.

Tássio Virgínio, Luana Martins, Railana Santana, Adriana Cruz, Larissa Rocha, Heitor Costa, and Ivan Machado. 2021. On the test smells detection: an empirical study on the JNose Test accuracy. Journal of Software Engineering Research and Development 9, 1 (Sep. 2021), 8:1 – 8:14.

Tongjie Wang, Yaroslav Golubev, Oleg Smirnov, Jiawei Li, Timofey Bryksin, and Iftekhar Ahmed. 2021. PyNose: A Test Smell Detector For Python. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). 593–605.

Claes Wohlin, Per Runeson, Martin Höst, Magnus C. Ohlsson, Björn Regnell, and Anders Wesslén. 2012. Experimentation in software engineering. Springer Science & Business Media.

Tingyu Xie, Qi Li, Jian Zhang, Yan Zhang, Zuozhu Liu, and HongweiWang. 2023. Empirical Study of Zero-Shot NER with ChatGPT. arXiv:2310.10035 [cs.CL] [link]

Shengcheng Yu, Chunrong Fang, Yuchen Ling, Chentian Wu, and Zhenyu Chen. 2023. LLM for Test Script Generation and Migration: Challenges, Capabilities, and Opportunities. In 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS). 206–217.

Albert Ziegler, Eirini Kalliamvakou, X Alice Li, Andrew Rice, Devon Rifkin, Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. 2024. Measuring GitHub Copilot’s Impact on Productivity. Commun. ACM 67, 3 (2024), 54–63.