ARTeMIS: Agent-based Rewriting and Test Case Management with Intelligent Supervision

Gabriel Carvalho; Andrés D. Peralta; André Carvalho; Yan Soares; Igor Lima; Nikson Ferreira; Hallyson Melo

doi:10.5753/sbqs.2025.15058

Gabriel Carvalho UFAM
Andrés D. Peralta UFAM
André Carvalho UFAM
Yan Soares UFAM
Igor Lima UFAM
Nikson Ferreira Nokia Institute of Technology
Hallyson Melo Nokia Institute of Technology

DOI: https://doi.org/10.5753/sbqs.2025.15058

Abstract

The quality of Test Case (TC) scripts is essential for the execution and automation of test scenarios on mobile devices. It is common to find TC components that are out of date, poorly written or inconsistent with documentation standards. In this work, we propose ARTeMIS, a modular Multi-Agent framework designed to automate the rewriting and validation of non-standardized TC components, integrating semantic retrieval, supervised classification, structured prompting and iterative rule-based validation using Large Language Models (LLMs). Our framework is capable of standardizing these three TC components: Summary, Initial Setup and Test Steps. ARTeMIS assigns each component to specialized agents for classification, rewriting and validation, enabling syntactic consistency and semantic accuracy with minimal human intervention. We performed experiments using, in addition to ARTeMIS, three other LLM-based techniques well established in the literature: Zero-Shot, Few-Shot and Retrieval-Augmented Generation (RAG). Our experiments demonstrated the feasibility and extensibility of our approach, which achieved a higher accuracy compared to the other techniques, highlighting the potential of agent-based architectures to standardize and automate TCs in continuous industrial testing environments.

Keywords: Automated Validation, LLMs, Multi-Agent Systems, Prompt Engineering, RAG and Test Automation

References

M. Baqar and R. Khanda. 2024. The Future of Software Testing: AI-Powered Test Case Generation and Validation. 19 pages. [link]

S. Bhatia, T. Gandhi, D. Kumar, and P. Jalote. 2024. System Test Case Design from Requirements Specifications: Insights and Challenges of Using ChatGPT. [link]

Alaa Elhabbash, John McDermid, Annika Menzel, Tom Crick, Robert Alexander, and Neil Walkinshaw. 2024. From Requirements to Test Cases: An NLP-Based Approach for High-Performance ECU Test Case Automation. arXiv preprint (2024). [link]

D. Frister and J. Hoffmann. 2024. Generating Software Tests for Mobile Applications Using Fine-Tuned Large Language Models. In 2024 IEEE/ACM International Conference on Automation of Software Test (AST). 76–77. DOI: 10.1145/3644032.3644454

Junxiao Han, Chuan Xu, Valerio Terragni, Haoye Zhu, Junjie Wu, and Lingming Zhang. 2024. ChatUniTest: A Framework for LLM-Based Test Generation. In Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE). ACM, 572–576. DOI: 10.1145/3663529.3663801

R. Jayshree and V. Saravanan. 2021. A Review on Test Automation for Test Cases Generation using NLP Techniques. Turkish Journal of Computer and Mathematics Education (TURCOMAT) 12, 10 (2021), 2687–2693. [link]

Cristian Jimenez-Romero, Alper Yegenoglu, and Christian Blum. 2025. Multi-Agent Systems Powered by Large Language Models: Applications in Swarm Intelligence. DOI: 10.48550/arXiv.2503.03800

Heiko Koziolek, Virendra Ashiwal, Soumyadip Bandyopadhyay, and K. R. Chandrika. 2024. Automated Control Logic Test Case Generation using Large Language Models. In Proceedings of the 29th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA). IEEE, Padova, Italy, 1–8. DOI: 10.1109/ETFA61755.2024.10711016

K. Li and Y. Yuan. 2024. Large Language Models as Test Case Generators: Performance Evaluation and Enhancement. [link]

X. Li, Y. Wang, Z. Zhang, and L. Chen. 2024. LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency. Journal of Database Management 45 (2024), 123–135. [link]

A. Mathur, D. Patel, S. Pradhan, R. Regunathan, and P. Soni. 2023. Automated Test Case Generation Using T5 and GPT-3. In Proceedings of the 2023 9th International Conference on Advanced Computing and Communication Systems (ICACCS). 1986–1992. DOI: 10.1109/ICACCS57279.2023.10112971

Wendkûuni C. Ouédraogo, Kader Kaboré, Haoye Tian, Yewei Song, Anil Koyuncu, Jacques Klein, David Lo, and Tegawendé F. Bissyandé. 2024. Large-scale, Independent and Comprehensive Study of the Power of LLMs for Test Case Generation. arXiv (2024). [link]

C. Paduraru, M. Zavelca, and A. Stefanescu. 2025. Agentic AI for Behavior-Driven Development Testing Using Large Language Models. In Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART. INSTICC, SciTePress, 805–815. DOI: 10.5220/0013374400003890

Ye Shang, Quanjun Zhang, Chunrong Fang, Siqi Gu, Jianyi Zhou, and Zhenyu Chen. 2024. A Large-scale Empirical Study on Fine-tuning Large Language Models for Unit Testing. arXiv (2024). [link]

L. Shu, L. Luo, J. Hoskere, Y. Zhu, Y. Liu, S. Tong, J. Chen, and L. Meng. 2024. RewriteLM: An Instruction-Tuned Large Language Model for Text Rewriting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18970–18980. [link]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention Is All You Need. In Advances in Neural Information Processing Systems, Vol. 30. [link]

Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. 2025. TESTEVAL: Benchmarking Large Language Models for Test Case Generation. [link]

Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. 2024. Evaluating and Improving ChatGPT for Unit Test Generation. Proceedings of the ACM on Software Engineering 1, FSE (2024), 76:1–76:24. DOI: 10.1145/3660783

J. Zhang, Y. Chen, H. Huang, Q. Chen, Y. Wang, J. He, L. Liu, and Y. Ma. 2024. Supporting meta-model-based language evolution and rapid prototyping with automated grammar transformation. Journal of Systems and Software 212 (2024), 111846. [link]

Quanjun Zhang, Ye Shang, Chunrong Fang, Siqi Gu, Jianyi Zhou, and Zhenyu Chen. 2024. TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models. arXiv (2024). [link]

Yaqin Zhou, Zhe Wang, Ziyuan Li, Lingming Zhang, and Jun Wang. 2023. LLM4Test: A Study and Dataset on Using Large Language Models for Test Case Generation. In Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1–12. DOI: 10.1109/ASE56732.2023.00012