Avaliação do Uso de LLMs na Geração de Casos de Teste a Partir de User Stories: Um Estudo Experimental em Contexto Educacional com Análise de Test Smells

Juliana B. Lima; Márcia Sampaio Lima

doi:10.5753/wei.2026.21084

Juliana B. Lima UEA
Márcia Sampaio Lima UEA

DOI: https://doi.org/10.5753/wei.2026.21084

Resumo

Introdução & objetivo: Este estudo investiga o uso educacional de LLMs em testes de software, focando na geração de casos de teste (CTs) a partir de User Stories e análise de qualidade via test smells. Etapas: Envolvendo 25 estudantes, o estudo comparou a geração de CTs manualmente e a assistida por ChatGPT, seguida da identificação de test smells. Resultados: A maioria dos CTs gerados pelo ChatGPT foi considerada útil (83%), mas apenas 31% foram considerados novos na perspectiva dos estudantes. Casos manuais apresentaram maior prevalência de smells como “Resultado Esperado Genérico” (44% vs. 0%). Os participantes avaliaram a experiência positivamente para aprendizagem e apontaram preocupações com a dependência da ferramenta. O estudo apresenta evidências experimentais sobre LLMs na educação em teste de software.

Palavras-chave: Casos de Teste, IA Generativa, Ensino em computação

Referências

Alagarsamy, S., Sridhar, V., Krishnan, R., and Nandagopal, M. (2025). Enhancing large language models for text-to-testcase generation. Information and Software Technology, 180:107625.

Aranda, M., Oliveira, N., Soares, E., Ribeiro, M., Romão, D., Patriota, U., and Machado, I. (2024). A catalog of transformations to remove smells from natural language tests. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering (EASE 2024), pages 7–16.

Braun, V. and Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3(2):77–101.

Cohn, M. (2004). User Stories Applied: For Agile Software Development. Addison-Wesley Professional.

Cohn, M. (2024). User stories and user story examples. Mountain Goat Software.

Crompton, H. (2023). Artificial intelligence in higher education: The state of the field. Computers and Education: Artificial Intelligence, 4:100160.

Dutta, S. and Bhowmick, S. S. (2025). User stories: Does chatgpt do it better? In Proceedings of the 27th International Conference on Enterprise Information Systems (ICEIS 2025), pages 167–178.

Haldar, S., Pierce, M., and Capretz, L. F. (2025). Exploring the integration of generative AI tools in software testing education: A case study on ChatGPT and Copilot for preparatory testing artifacts in postgraduate learning. IEEE Access, 13:46070–46090.

Jalil, S., Rafi, S., LaToza, T. D., Moran, K., and Lam, W. (2023). ChatGPT and software testing education: Promises & perils. In 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pages 4130–4137. IEEE.

Kuhail, M. A., Farooq, S., Hammad, R., and Bahsoon, R. (2022). User story quality in practice: A case study. Journal of Systems and Software, 188:111269.

Li, Y., Wang, S., and Nguyen, T. N. (2025). Evaluating large language models for software testing. Journal of Systems and Software, 222:112345.

Lima, J. B. and Lima, M. S. (2025). Supplementary materials: Experimental evaluation of chatgpt for test case generation and quality analysis through test smells in software testing education. Figshare Repository.

Manzoni, F. S., Rodrigues, R., and Rocha, A. C. O. (2024). Exploring the use of chatgpt for the generation of user story based test cases: An experimental study. In Proceedings of the XXXVIII Brazilian Symposium on Software Engineering (SBES 2024).

Mezzaro, S., Gambi, A., and Fraser, G. (2024). An empirical study on how large language models impact software testing learning. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering (EASE 2024), pages 555–564. ACM.

Misu, M. R. H., Ahasan, K., Rahman, A., and Sakib, K. (2025). Test smell: A parasitic energy consumer in software testing. Empirical Software Engineering, 30(2):1–42.

Ouédraogo, W. C., Li, Y., Kaboré, K., Tang, X., Koyuncu, A., Klein, J., Lo, D., and Bissonandé, T. F. (2024). Test smells in LLM-generated unit tests. arXiv preprint arXiv:2410.10628.

Pitts, G., Marcus, V., and Motamedi, S. (2025). Student perspectives on the benefits and risks of AI in education. arXiv preprint arXiv:2502.01715.

Queiroz, F. K. and Lima, M. S. (2025). Uso do chatgpt na priorização de requisitos: Uma experiência educacional em engenharia de software. In Anais do XIV Congresso Brasileiro de Informática na Educação (EduComp 2025).

Raharjana, I. K., Siahaan, D., and Fatichah, C. (2021). User stories and natural language processing: A systematic literature review. IEEE Access, 9:53811–53826.

Rodrigues, R., Manzoni, F. S., and Rocha, A. C. O. (2024). Exploring the use of large language models in requirements engineering education: An experience report with chatgpt 3.5. In Anais do XXXVI Simpósio Brasileiro de Engenharia de Software (SBES 2024).

Santana Jr., E. G., Santos Junior, J. P., Almeida, E. P., Ahmed, I., Silveira Neto, P. A. M., and Almeida, E. S. (2025). Evaluating llms effectiveness in detecting and correcting test smells: An empirical study. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE 2025).