Investigating the Performance of Small Language Models in Detecting Test Smells in Manual Test Cases

Keila Lucas; Rohit Gheyi; Márcio Ribeiro; Fabio Palomba; Luana Martins; Elvys Soares

doi:10.5753/sbes.2025.11572

Keila Lucas UFCG
Rohit Gheyi UFCG
Márcio Ribeiro UFAL
Fabio Palomba University of Salerno
Luana Martins University of Salerno
Elvys Soares IFAL

DOI: https://doi.org/10.5753/sbes.2025.11572

Resumo

Manual testing, in which testers follow natural language instructions to validate system behavior, remains crucial for uncovering issues not easily captured by automation. However, these test cases often suffer from test smells, quality issues such as ambiguity, redundancy, or missing checks that reduce test reliability and maintainability. While detection tools exist, they typically require manual rule definition and lack scalability. This study investigates the potential of Small Language Models (SLMs) for automatically detecting test smells. We evaluate Gemma3, Llama3.2, and Phi-4 on 143 real-world Ubuntu test cases, covering seven types of test smells. Phi-4 achieved the best results, reaching a 𝑝𝑎𝑠𝑠@2 of 97% in detecting sentences with test smells, while Gemma3 and Llama3.2 reached approximately 91%. Beyond detection, SLMs autonomously explained issues and suggested improvements, even without explicit prompt instructions. They enabled low-cost, concept-driven identification of diverse test smells without relying on extensive rule definitions or syntactic analysis. These findings highlight the potential of SLMs as efficient tools that preserve data privacy and can improve test quality in real-world scenarios.

Palavras-chave: Test Smells, Small Language Models, Manual Testing

Referências

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. 2024. Phi-4 technical report. arXiv preprint arXiv:2412.08905 (2024).

David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. 1985. A learning algorithm for Boltzmann machines. Cognitive science 9, 1 (1985), 147–169.

Manoel Aranda, Naelson Oliveira, Elvys Soares, Márcio Ribeiro, Davi Romão, Ullyanne Patriota, Rohit Gheyi, Emerson Souza, and Ivan Machado. 2024. A Catalog of Transformations to Remove Smells From Natural Language Tests. In International Conference on Evaluation and Assessment in Software Engineering. ACM, 7–16.

Manoel Aranda, Naelson Oliveira, Elvys Soares, Márcio Ribeiro, Davi Romão, Ullyanne Patriota, Rohit Gheyi, Emerson Souza, and Ivan Machado. 2024. Manual Test Alchemist. [link] Retrieved April 19, 2024.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).

Benedikt Hauptmann. 2016. Reducing system testing effort by focusing on commonalities in test procedures. Ph. D. Dissertation. Technische Universität München.

Benedikt Hauptmann, Maximilian Junker, Sebastian Eder, Lars Heinemann, Rudolf Vaas, and Peter Braun. 2013. Hunting for smells in natural language tests. In International Conference on Software Engineering. IEEE Computer Society, 1217–1220.

Yutai Hou, Hongyuan Dong, Xinghao Wang, Bohan Li, and Wanxiang Che. 2022. MetaPrompting: Learning to learn better prompts. arXiv preprint arXiv:2209.11486 (2022).

Ollama Inc. 2025. Ollama: Run large language models locally. [link]

Katharina Juhnke, Alexander Nikic, and Matthias Tichy. 2021. Clustering Natural Language Test Case Instructions as Input for Deriving Automotive Testing DSLs. J. Object Technol. 20, 3 (2021), 5–1.

Nildo Silva Junior, Luana Martins, Larissa Rocha, Heitor Costa, and Ivan Machado. 2021. How are test smells treated in the wild? A tale of two empirical studies. Journal of Software Engineering Research and Development 9 (2021), 9–1.

Nildo Silva Junior, Larissa Rocha, Luana Almeida Martins, and Ivan Machado. 2020. A survey on test practitioners’ awareness of test smells. In Iberoamerican Conference on Software Engineering. Curran Associates, 462–475.

Zekun Li, Baolin Peng, Pengcheng He, Michel Galley, Jianfeng Gao, and Xifeng Yan. 2023. Guiding Large Language Models via Directional Stimulus Prompting. arXiv:2302.11520 [cs.CL] [link]

Zhenyan Lu, Xiang Li, Dongqi Cai, Rongjie Yi, Fangming Liu, Xiwen Zhang, Nicholas D Lane, and Mengwei Xu. 2024. Small language models: Survey, measurements, and insights. arXiv preprint arXiv:2409.15790 (2024).

Keila Lucas, Rohit Gheyi, Márcio Ribeiro, Fabio Palomba, Luana Martins, and Elvys Soares. 2025. Investigating the Performance of Small Language Models in Detecting Test Smells in Manual Test Cases (artifacts). DOI: 10.5281/zenodo.15484997.

Keila Lucas, Rohit Gheyi, Elvys Soares, Márcio Ribeiro, and Ivan Machado. 2024. Evaluating Large Language Models in Detecting Test Smells. In Brazilian Symposium on Software Engineering. 672–678.

Rian Melo, Pedro Simões, Rohit Gheyi, Marcelo d’Amorim, Márcio Ribeiro, Gustavo Soares, Eduardo Almeida, and Elvys Soares. 2025. Agentic SLMs: Hunting Down Test Smells. arXiv:2504.07277 [cs.SE] [link]

Meta AI. 2024. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models. [link] Accessed: 2025-05-20.

Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. 2025. An empirical study of the non-determinism of chatgpt in code generation. ACM Transactions on Software Engineering and Methodology 34, 2 (2025), 1–28.

Max Peeperkorn, Tom Kouwenhoven, Dan Brown, and Anna Jordanous. 2024. Is temperature the creativity parameter of large language models? arXiv preprint arXiv:2405.00492 (2024).

Myron David Lucena Campos Peixoto, Davy de Medeiros Baia, Nathalia Nascimento, Paulo Alencar, Baldoino Fonseca, and Márcio Ribeiro. 2024. On the Effectiveness of LLMs for Manual Test Verifications. arXiv preprint arXiv:2409.12405 (2024).

Kostadin Rajkovic and Eduard Enoiu. 2022. Nalabs: Detecting bad smells in natural language requirements and test specifications. arXiv preprint arXiv:2202.05641 (2022).

June Sallou, Thomas Durieux, and Annibale Panichella. 2024. Breaking the Silence: the Threats of Using LLMs in Software Engineering. In International Conference on Software Engineering - New Ideas and Emerging Results. ACM/IEEE.

Elvys Soares, Manoel Aranda, Naelson Oliveira, Márcio Ribeiro, Rohit Gheyi, Emerson Souza, Ivan Machado, André L. M. Santos, Baldoino Fonseca, and Rodrigo Bonifácio. 2023. Manual Tests Do Smell! Cataloging and Identifying Natural Language Test Smells. In International Symposium on Empirical Software Engineering and Measurement. IEEE, 1–11.

Gabriela Soares, Vanessa Santos, Márcio Ribeiro, Luana Martins, Valeria Pontillo, Manoel Aranda III, Rohit Gheyi, Ivan Machado, and Fabio Palomba. 2025. On the Harmfulness of Test Smells in Manual System Testing: A Controlled Experiment. In International Symposium on Empirical Software Engineering and Measurement. IEEE.

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. 2025. Gemma 3 Technical Report. arXiv preprint arXiv:2503.19786 (2025).

Ubuntu. 2024. Ubuntu Manual Tests in Launchpad. [link]

Chien van Nguyen, Xuan Shen, Ryan Aponte, Yu Xia, Samyadeep Basu, Zhengmian Hu, Jian Chen, Mihir Parmar, Sasidhar Kunapuli, Joe Barrow, et al. 2024. A survey of small language models. arXiv preprint arXiv:2410.20011 (2024).

Alvaro Veizaga, Seung Yeob Shin, and Lionel C Briand. 2024. Automated smell detection and recommendation in natural language requirements. IEEE Transactions on Software Engineering 50, 4 (2024), 695–720.

Di Wu, Fangwen Mu, Lin Shi, Zhaoqiang Guo, Kui Liu, Weiguang Zhuang, Yuqi Zhong, and Li Zhang. 2024. iSMELL: Assembling LLMs with Expert Toolsets for Code Smell Detection and Refactoring. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1345–1357.

Yanming Yang, Xing Hu, Xin Xia, and Xiaohu Yang. 2024. The lost world: Characterizing and detecting undiscovered test smells. ACM Transactions on Software Engineering and Methodology 33, 3 (2024), 1–32.