Automated Generation of Exploratory Test Cases Using Prompt Chaining and Reflective Evaluation

Igor Lima; André Carvalho; Yan Soares; Gabriel Pacheco; Andrés Peralta; Hallyson Melo; Nikson Ferreira; Bruno Costa

doi:10.5753/sbes.2025.11599

Igor Lima UFAM
André Carvalho UFAM
Yan Soares UFAM
Gabriel Pacheco UFAM
Andrés Peralta UFAM
Hallyson Melo INDT
Nikson Ferreira INDT
Bruno Costa INDT

DOI: https://doi.org/10.5753/sbes.2025.11599

Resumo

Exploratory testing uncovers latent defects, but the design of Exploratory Test Cases (ETC) is largely manual, incurring high costs, reliance on expert knowledge, and limited reproducibility. We propose IGEX, a fully automated approach to generate ETCs using Large Language Models. IGEX models test generation as a structured chain of prompts, leveraging Chain of Thought reasoning and learning in context. To ensure quality, a Reflective Evaluator scores ETCs according to expert criteria, triggering refinements as needed. In line with Whittaker’s hybrid ETC definition, our method combines structured scripting with tester-driven exploratory detours, enabling automation. In experiments with 300 Android test scenarios, IGEX achieved 97.67% accuracy before reflection and 100% after iterative evaluation. Although dataset-agnostic and extensible, current validation is limited to mobile applications. Furthermore, IGEX inherits the challenges of LLM, including evaluation subjectivity and computational costs. These results demonstrate the potential of LLM for scalable exploratory testing with reduced manual overhead.

Palavras-chave: Exploratory Test Cases, Large Language Models, Prompt Chaining, Automated Test Generation, Test Automation, Reflective Evaluation Mechanism

Referências

W. R. Adrion, M. A. Branstad, and J. C. Cherniavsky. 1982. Validation, Verification, and Testing of Computer Software. Comput. Surveys 14, 2 (June 1982), 159–192. DOI: 10.1145/356876.356879

Wasif Afzal, Ahmad Nauman Ghazi, Juha Itkonen, Richard Torkar, Anneliese Andrews, and Khurram Bhatti. 2014. An Experiment on the Effectiveness and Efficiency of Exploratory Testing. Empirical Software Engineering 20 (06 2014). DOI: 10.1007/s10664-014-9301-4

Meta AI. 2024. Llama 3 Technical Report. [link]. Accessed: 2024-05-30.

F. Asplund. 2018. Exploratory testing: Do contextual factors influence software fault identification? Ph.D. Dissertation. KTH Royal Institute of Technology, Stockholm, Sweden.

J. Bach. 2000. Session-Based Test Management. Software Testing and Quality Engineering (STQE), vol. 2, no. 6.

J. Bach. 2004. Exploratory Testing. 253–265 pages. Chapter in edited book.

K. Brush. 2024. Test Case Definition. [link] Accessed: Oct. 31, 2024.

I. E. F. Costa, A. C. dos Santos, F. A. B. Silva, and E. B. Araujo. 2023. Using Active Methodologies for Teaching and Learning of Exploratory Test Design and Execution. Education Sciences, vol. 13, art. no. 115. Retrieved May 12, 2025 from DOI: 10.3390/educsci13020115

R. D. Craig and S. P. Jaskiel. 2002. Systematic Software Testing. Artech House Publishers, Boston.

Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. 2024. A Survey on In-context Learning. arXiv:2301.00234 [cs.CL] [link]

M. Ellims, J. Bridges, and D. C. Ince. 2006. The Economics of Unit Testing. Empirical Software Engineering, vol. 11, no. 1, pp. 5–31. Retrieved May 12, 2025 from DOI: 10.1007/s10664-006-7585-6

T. Ericson, A. Subotic, and S. Ursing. 1997. Tim: A Test Improvement Model. Software Testing, Verification and Reliability, vol. 7, no. 4, pp. 229–246. Retrieved May 12, 2025 from DOI: 10.1002/(SICI)1099-1689(199712)7:4<229::AIDSTVR148>3.0.CO;2-V

B. Hailpern and P. Santhanam. 2002. Software Debugging, Testing, and Verification. IBM Systems Journal, vol. 41, no. 1, pp. 4–12. Retrieved May 12, 2025 from DOI: 10.1147/sj.411.0004

J. Itkonen and M. V. Mäntylä. 2014. Are test cases needed? Replicated comparison between exploratory and test-case-based software testing. Empirical Software Engineering 19 (2014), 303–342. DOI: 10.1007/s10664-013-9266-8

J. Itkonen and K. Rautiainen. 2005. Exploratory testing: A multiple case study. In Proc. 2005 Int. Symp. Empirical Software Engineering (ISESE). 1–10. DOI: 10.1109/ISESE.2005.1541817

M. Keployio. 2024. Exploring test case generators: Revolutionizing software testing. [link]. Accessed: May 21, 2024.

W. Makondo et al. 2016. Exploratory Test Oracle Using MLP Neural Networks. In Proceedings of the International Conference on Advances in Computing, Communications and Informatics (ICACCI).

Y. Nishi and Y. Shibasaki. 2021. Boosted Exploratory Test Architecture. In IEEE International Conference on Software Testing, Verification and ValidationWorkshops (ICSTW).

H. H. Olsson, H. Alahyari, and J. Bosch. 2012. Climbing the ’stairway to heaven’ – a multiple-case study exploring barriers in the transition from agile development towards continuous deployment of software. In Proc. 38th EUROMICRO Conf. Software Engineering and Advanced Applications (SEAA). 392–399.

A. M. Sami et al. 2024. A Tool for Test Case Scenarios Generation Using LLMs. arXiv preprint arXiv:2406.07021 (2024). [link]

J. Sternerson and M. Firoozfam. 2010. Statistical Testing using Automated Randomization. Technical report or workshop paper, details unavailable.

Y. Su et al. 2024. Enhancing Exploratory Testing by Large Language Model and Knowledge Graph. In Proceedings of the 46th International Conference on Software Engineering (ICSE).

A. Tinkham and C. Kaner. 2003. Exploring Exploratory Testing. [link]. Accessed: Nov. 4, 2024.

J. Wei, X. Wang, D. Schuurmans, and Q. Le. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv preprint arXiv:2201.11903. Retrieved May 12, 2025 from [link]

J. A. Whittaker. 2011. Exploratory Software Testing: Tips, Tricks, Tours, and Techniques to Guide Test Design. Retrieved May 12, 2025 from [link] Accessed online.

T. Wu, L. Wang, X. Liu, Z. Hu, and M. Sun. 2022. PromptChainer: Chaining Large Language Model Prompts through Visual Programming. arXiv preprint arXiv:2203.06566. Retrieved May 12, 2025 from [link]

D. Xu. 2015. An Automated Test Generation Technique for Software QA. Master’s thesis. Boise State University.

W. X. Zhao, J. Li, Y. He, X. Yan, M. Zhou, and H. Chen. 2024. A Survey of Large Language Models. arXiv preprint arXiv:2303.18223. Retrieved May 12, 2025 from [link]