Automated Generation of End-to-End Web Test Cases via a Generic AI Agent: A Comparative Study of DeepSeek V3 and Claude Sonnet 4

  • Caio E. O. Monteiro UFSCar
  • Lucca R. Guerino UFSCar
  • Guilherme F. Fernandes UFG
  • Marcos H. Pereira UFG
  • Juliana P. de Souza-Zinader UFG
  • Renata D. Braga UFG
  • Viviane C. B. Pocivi UFG
  • Auri M. R. Vincenzi Universidade do Porto

Resumo


Web applications are widespread and can be accessed from anywhere, in theory, using aweb browser on a computer or smartphone. Primarily due to the diversity of web browsers and frameworks available for developing web application interfaces, testing such applications is a challenging task. With the advent of large language models, several works are utilizing them to automate software engineering tasks, including test case generation. This use of LLMs for test case generation prioritizes unit testing. More recently, we have seen the advent of Generic Artificial Intelligence Agents, which are tools that utilize LLMs and also possess the ability to run additional tools, such as cloning repositories, navigating websites, and compiling programs. In this work, which is part of a research and development project, we evaluate a specific Generic AI Agent Assistant regarding its capability to navigate web applications and create fully automated end-to-end test cases, utilizing Selenium WebDriver and JUnit 5 framework. Results show that, considering a set of nine websites, in overall end-to-end test case generation, Suna configured with DeepSeek V3 produced 165 successful test cases out of 481 generated tests, a success rate of 34.3%. On the other hand, Suna configured with Claude Sonnet 4 produced 336 successful test cases out of 479 generated tests, a success rate of 70.1%, which is very impressive, mainly due to the complexity of creating end-to-end testing. In terms of cost, we used a free and a paid LLM model. The paid model generates successful test cases at an average price of $ 0.15 per test case.

Palavras-chave: web applications, test web applications, automatic test case generation, end-to-end testing

Referências

Faten Imad Ali, Tariq Emad Ali, and Ziad Tarik Al-Dahan. 2023. Private Backend Server Software-Based Telehealthcare Tracking and Monitoring System. Int. J. Online Biomed. Eng. 19, 1 (2023), 119–134.

Paul Ammann and Jeff Offutt. 2008. Introduction to Software Testing (1 ed.). Cambridge University Press, Cambridge, Reino Unido.

Sebastian Balsam and Deepti Mishra. 2025. Web application testing – Challenges and opportunities. Journal of Systems and Software 219 (2025), 112186. DOI: 10.1016/j.jss.2024.112186

Eric J. Braude and Michael E. Bernstein. 2010. Software engineering: modern approaches (2 ed.). John Wiley & Sons, Hoboken, NJ. bibtex*[ owner=auri;timestamp=2010.08.15].

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. 2024. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 15, 3 (March 2024), 45. DOI: 10.1145/3641289 Place: New York, NY, USA Publisher: Association for Computing Machinery.

Pratiksha D Dutonde, Shivani S Mamidwar, Monali Sunil Korvate, Sumangala Bafna, and Dhiraj D Shirbhate. 2022. Website development technologies: A review. Int. J. Res. Appl. Sci. Eng. Technol 10, 1 (2022), 359–366.

Khalid El Haji, Carolin Brandt, and Andy Zaidman. 2024. Using GitHub Copilot for Test Generation in Python: An Empirical Study. In Proceedings of the 5th ACM/IEEE International Conference on Automation of Software Test (AST 2024) (AST ’24). Association for Computing Machinery, New York, NY, USA, 45–55. DOI: 10.1145/3644032.3644443 event-place: Lisbon, Portugal.

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. 2023. Large Language Models for Software Engineering: Survey and Open Problems. arXiv:2310.03533 [cs.SE] [link]

Chen Gao, Xiaochong Lan, Nian Li, Yuan Yuan, Jingtao Ding, Zhilun Zhou, Fengli Xu, and Yong Li. 2024. Large language models empowered agent-based modeling and simulation: a survey and perspectives. Humanities and Social Sciences Communications 11, 1 (Sept. 2024), 1259. DOI: 10.1057/s41599-024-03611-3

Jia He, Reshmi Ghosh, Kabir Walia, Jieqiu Chen, Tushar Dhadiwal, April Hazel, and Chandra Inguva. 2024. Frontiers of Large Language Model-Based Agentic Systems - Construction, Efficacy and Safety. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM ’24). Association for Computing Machinery, New York, NY, USA, 5526–5529. DOI: 10.1145/3627673.3679105 event-place: Boise, ID, USA.

Junda He, Christoph Treude, and David Lo. 2025. LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead. ACM Trans. Softw. Eng. Methodol. 34, 5 (May 2025), 30. DOI: 10.1145/3712003 Place: New York, NY, USA Publisher: Association for Computing Machinery.

Chris Kerslake and Ouldooz Baghban Karimi. 2021. Project-based Learning of Web Systems Architecture. In Proceedings of the 26th ACM Conference on Innovation and Technology in Computer Science Education V. 2 (Virtual Event, Germany) (ITiCSE ’21). Association for Computing Machinery, New York, NY, USA, 656. DOI: 10.1145/3456565.3460067

Iva Kertusha, Gebremariem Assress, Onur Duman, and Andrea Arcuri. 2025. A Survey on Web Testing: On the Rise of AI and Applications in Industry. [link]

Michael Konstantinou, Renzo Degiovanni, Jie M. Zhang, Mark Harman, and Mike Papadakis. 2025. YATE: The Role of Test Repair in LLM-Based Unit Test Generation. [link]

Maurizio Leotta, Hafiz Zeeshan Yousaf, Filippo Ricca, and Boni Garcia. 2024. AIGenerated Test Scripts forWeb E2E Testing with ChatGPT and Copilot: A Preliminary Study. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering (EASE ’24). Association for Computing Machinery, New York, NY, USA, 339–344. DOI: 10.1145/3661167.3661192 event-place: Salerno, Italy.

Mengyuan Li, Lei Jia, Xiangzhen Chen, Yongxin Li, Dan Zhao, Lina Zhang, Tongqian Zhao, and Jun Xu. 2024. Web system-assisted ratiometric fluorescent probe embedded with machine learning for intelligent detection of pefloxacin. Sensors and Actuators B: Chemical 407 (2024), 135491.

Yihao Li, Pan Liu, Haiyang Wang, Jie Chu, and W. Eric Wong. 2025. Evaluating large language models for software testing. Computer Standards & Interfaces 93 (2025), 103942. DOI: 10.1016/j.csi.2024.103942

Yikang Lu, Alberto Aleta, Chunpeng Du, Lei Shi, and Yamir Moreno. 2024. LLMs and generative agent-based models for complex systems research. Physics of Life Reviews 51 (2024), 283–293. DOI: 10.1016/j.plrev.2024.10.013

Jose-Manuel Martinez-Caro, Antonio-Jose Aledo-Hernandez, Antonio Guillen-Perez, Ramon Sanchez-Iborra, and Maria-Dolores Cano. 2018. A Comparative Study of Web Content Management Systems. Information 9, 2 (2018), 15 pages. DOI: 10.3390/info9020027

George Murazvu, Simon Parkinson, Saad Khan, Na Liu, and Gary Allen. 2024. A survey on factors preventing the adoption of automated software testing: A principal component analysis approach. Software 3, 1 (2024), 1–27.

Michel Nass, Emil Alégroth, and Robert Feldt. 2024. Improving Web Element Localization by Using a Large Language Model. Software Testing, Verification and Reliability 34, 7 (2024), e1893. DOI: 10.1002/stvr.1893

Dario Olianas, Diego Clerissi, Maurizio Leotta, and Filippo Ricca. 2025. TESTQUEST: A Web Gamification Tool to Improve Locators and Page Objects Quality. arXiv:2505.24756 [cs.SE] [link]

Ipek Ozkaya. 2023. Application of Large Language Models to Software Engineering Tasks: Opportunities, Risks, and Implications. IEEE Software 40, 3 (2023), 4–8. DOI: 10.1109/MS.2023.3248401

Ruwei Pan and Hongyu Zhang. 2025. Modularization is Better: Effective Code Generation with Modular Prompting. [link]

Elis Pelivani and Betim Cico. 2021. A comparative study of automation testing tools for web applications. In 2021 10th Mediterranean Conference on Embedded Computing (MECO). IEEE, Piscataway, NJ, 1–6.

Sanka Rasnayaka, Guanlin Wang, Ridwan Shariffdeen, and Ganesh Neelakanta Iyer. 2024. An Empirical Study on Usage and Perceptions of LLMs in a Software Engineering Project. In Proceedings of the 1st International Workshop on Large Language Models for Code (LLM4Code ’24). Association for Computing Machinery, New York, NY, USA, 111–118. DOI: 10.1145/3643795.3648379 event-place: Lisbon, Portugal.

Filippo Ricca, Maurizio Leotta, and Andrea Stocco. 2019. Chapter Three - Three Open Problems in the Context of E2E Web Testing and a Vision: NEONATE. In Advances in Computers, Atif M. Memon (Ed.). Vol. 113. Elsevier, London, United Kingdom, 89–133. DOI: 10.1016/bs.adcom.2018.10.005

Sujoy Roychowdhury, Giriprasad Sridhara, A. K. Raghavan, Joy Bose, Sourav Mazumdar, Hamender Singh, Srinivasan Bajji Sugumaran, and Ricardo Britto. 2025. Static Program Analysis Guided LLM Based Unit Test Generation. [link]

Daniel Russo, Sebastian Baltes, Niels van Berkel, Paris Avgeriou, Fabio Calefato, Beatriz Cabrero-Daniel, Gemma Catolino, Jürgen Cito, Neil Ernst, Thomas Fritz, Hideaki Hata, Reid Holmes, Maliheh Izadi, Foutse Khomh, Mikkel Baun Kjærgaard, Grischa Liebel, Alberto Lluch Lafuente, Stefano Lambiase, Walid Maalej, Gail Murphy, Nils Brede Moe, Gabrielle O’Brien, Elda Paja, Mauro Pezzè, John Stouby Persson, Rafael Prikladnicki, Paul Ralph, Martin Robillard, Thiago Rocha Silva, Klaas-Jan Stol, Margaret-Anne Storey, Viktoria Stray, Paolo Tell, Christoph Treude, and Bogdan Vasilescu. 2024. Generative AI in Software Engineering Must Be Human-Centered: The Copenhagen Manifesto. Journal of Systems and Software 216 (2024), 112115. DOI: 10.1016/j.jss.2024.112115

June Sallou, Thomas Durieux, and Annibale Panichella. 2024. Breaking the Silence: the Threats of Using LLMs in Software Engineering. In Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER’24). Association for ComputingMachinery, New York, NY, USA, 102–106. DOI: 10.1145/3639476.3639764 event-place: Lisbon, Portugal.

Douglas C. Schmidt. 2025. Software Testing in the Generative AI Era: A Practitioner’s Playbook. Computer 58, 7 (2025), 147–152. DOI: 10.1109/MC.2025.3562940.

Viktoria Stray, Geir Kjetil Hanssen, Astri Barbala, Darja Šmite, and Klaas-Jan Stol. 2025. What is Generative AI good for? Introduction to the special issue on Generative AI in software engineering. Information and Software Technology 187 (Nov. 2025), 107857. DOI: 10.1016/j.infsof.2025.107857

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, \Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17, Vol. 1). Curran Associates Inc., Red Hook, NY, USA, 6000–6010. event-place: Long Beach, California, USA.

Zhiyuan Wan, Yun Zhang, Xin Xia, Yi Jiang, and David Lo. 2023. Software architecture in practice: Challenges and opportunities. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, New York, NY, 1457–1469.

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software Testing With Large Language Models: Survey, Landscape, and Vision. IEEE Trans. Softw. Eng. 50, 4 (Feb. 2024), 911–936. DOI: 10. 1109/TSE.2024.3368208 Publisher: IEEE Press.

Siyi Wang, Sinan Wang, Yujia Fan, Xiaolei Li, and Yepang Liu. 2024. Leveraging Large Vision-Language Model for Better Automatic Web GUI Testing. In 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE Computer Society, Los Alamitos, CA, USA, 125–137. DOI: 10.1109/ICSME58944.2024.00022

Yuqing Wang, Mika V Mäntylä, Zihao Liu, Jouni Markkula, and Päivi Raulamojurvanen. 2022. Improving test automation maturity: A multivocal literature review. Software Testing, Verification and Reliability 32, 3 (2022), e1804.

James A. Whittaker. 2009. Exploratory Software Testing: Tips, Tricks, Tours, and Techniques to Guide Test Design. Addison-Wesley Professional, Upper Saddle River, NJ.

Muhammad Nouman Zafar, Wasif Afzal, and Eduard Enoiu. 2022. Evaluating system-level test generation for industrial software: A comparison between manual, combinatorial and model-based testing. In Proceedings of the 3rd ACM/IEEE international conference on automation of software test. ACM, New York, NY, 148–159.
Publicado
10/11/2025
MONTEIRO, Caio E. O.; GUERINO, Lucca R.; FERNANDES, Guilherme F.; PEREIRA, Marcos H.; SOUZA-ZINADER, Juliana P. de; BRAGA, Renata D.; POCIVI, Viviane C. B.; VINCENZI, Auri M. R.. Automated Generation of End-to-End Web Test Cases via a Generic AI Agent: A Comparative Study of DeepSeek V3 and Claude Sonnet 4. In: BRAZILIAN SYMPOSIUM ON MULTIMEDIA AND THE WEB (WEBMEDIA), 31. , 2025, Rio de Janeiro/RJ. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 57-66. DOI: https://doi.org/10.5753/webmedia.2025.16046.