GenIA-E2ETest: A Generative AI-Based Approach for End-to-End Test Automation

Elvis Júnior; Alan Valejo; Jorge Valverde-Rebaza; Vânia de Oliveira Neves

doi:10.5753/sbes.2025.9927

Elvis Júnior UFF
Alan Valejo UFSCar
Jorge Valverde-Rebaza Tecnologico de Monterrey
Vânia de Oliveira Neves UFF

DOI: https://doi.org/10.5753/sbes.2025.9927

Resumo

Software testing is essential to ensure system quality, but it remains time-consuming and error-prone when performed manually. Although recent advances in Large Language Models (LLMs) have enabled automated test generation, most existing solutions focus on unit testing and do not address the challenges of end-to-end (E2E) testing, which validates complete application workflows from user input to final system response. This paper introduces GenIAE2ETest, which leverages generative AI to generate executable E2E test scripts from natural language descriptions automatically. We evaluated the approach on two web applications, assessing completeness, correctness, adaptation effort, and robustness. Results were encouraging: the scripts achieved an average of 77% for both element metrics, 82% for precision of execution, 85% for execution recall, required minimal manual adjustments (average manual modification rate of 10%), and showed consistent performance in typical web scenarios. Although some sensitivity to context-dependent navigation and dynamic content was observed, the findings suggest that GenIA-E2ETest is a practical and effective solution to accelerate E2E test automation from natural language, reducing manual effort and broadening access to automated testing.

Palavras-chave: End-to-End Testing, Generative AI, Software Testing Automation, E2E

Referências

Maroun Ayli, Youssef Bakouny, Nader Jalloul, and Rima Kilany. 2024. Enhancing the Resiliency of Automated Web Tests with Natural Language. In Proceedings of the 2024 4th International Conference on Artificial Intelligence, Automation and Algorithms. 63–69.

Sebastian Balsam and Deepti Mishra. 2024. Web application testing—Challenges and opportunities. Journal of Systems and Software (2024), 112186.

Victor R Basili1 Gianluigi Caldiera and H Dieter Rombach. 1994. The goal question metric approach. Encyclopedia of software engineering (1994), 528–532.

Xiaoning Chang, Zheheng Liang, Yifei Zhang, Lei Cui, Zhenyue Long, Guoquan Wu, Yu Gao, Wei Chen, Jun Wei, and Tao Huang. 2023. A reinforcement learning approach to generating test cases for web applications. In 2023 IEEE/ACM International Conference on Automation of Software Test (AST). IEEE, 13–23.

Yinghao Chen, Zehao Hu, Chen Zhi, Junxiao Han, Shuiguang Deng, and Jianwei Yin. 2024. Chatunitest: A framework for llm-based test generation. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 572–576.

Ermira Daka, José Campos, Gordon Fraser, Jonathan Dorn, and Westley Weimer. 2015. Modeling readability to improve unit tests. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. 107–118.

Marcio Delamaro, Mario Jino, and Jose Maldonado. 2013. Introdução ao teste de software. Elsevier Brasil.

Yuetang Deng, Phyllis Frankl, and Jiong Wang. 2004. Testing web database applications. ACM SIGSOFT Software Engineering Notes 29, 5 (2004), 1–10.

Martin Fowler. 2020. Broad Stack Test. [link] Accessed: April 24, 2025.

Giovanni Grano, Fabio Palomba, Dario Di Nucci, Andrea De Lucia, and Harald C Gall. 2019. Scented since the beginning: On the diffuseness of test smells in automatically generated test code. Journal of Systems and Software 156 (2019), 312–327.

Chaitanya Kallepalli and Jeff Tian. 2001. Measuring and modeling usage and reliability for statistical web testing. IEEE transactions on software engineering 27, 11 (2001), 1023–1036.

Pavneet Singh Kochhar, Tegawendé F Bissyandé, David Lo, and Lingxiao Jiang. 2013. An empirical study of adoption of software testing in open source projects. In 2013 13th International Conference on Quality Software. IEEE, 103–112.

Maurizio Leotta, Boni García, Filippo Ricca, and Jim Whitehead. 2023. Challenges of end-to-end testing with selenium WebDriver and how to face them: A survey. In 2023 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 339–350.

Isela Mendoza, Fernando Silva Filho, Gustavo Medeiros, Aline Paes, and Vânia O Neves. 2024. Comparative Analysis of Large Language Model Tools forAutomated Test Data Generation from BDD. In Simpósio Brasileiro de Engenharia de Software (SBES). SBC, 280–290.

Paul Ralph, Nauman bin Ali, Sebastian Baltes, Domenico Bianculli, Jessica Diaz, Yvonne Dittrich, Neil Ernst, Michael Felderer, Robert Feldt, Antonio Filieri, et al. 2020. Empirical standards for software engineering research. arXiv preprint arXiv:2010.03525 (2020).

Sara E Sprenkle, Lori L Pollock, and Lucy M Simko. 2013. Configuring effective navigation models and abstract test cases for web applications by analysing user behaviour. Software Testing, Verification and Reliability 23, 6 (2013), 439–464.

Paolo Tonella and Filippo Ricca. 2004. Statistical testing of web applications. Journal of Software Maintenance and Evolution: Research and Practice 16, 1-2 (2004), 103–127.

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software testing with large language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering (2024).

Siyi Wang, Sinan Wang, Yujia Fan, Xiaolei Li, and Yepang Liu. 2024. Leveraging Large Vision-Language Model for Better Automatic Web GUI Testing. In 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 125–137.

Wenhua Wang, Yu Lei, Sreedevi Sampath, Raghu Kacker, Rick Kuhn, and James Lawrence. 2009. A combinatorial approach to building navigation graphs for dynamic web applications. In 2009 IEEE International Conference on Software Maintenance. IEEE, 211–220.

ZejunWang, Kaibo Liu, Ge Li, and Zhi Jin. 2024. HITS: High-coverage LLM-based Unit Test Generation via Method Slicing. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1258–1268.

Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. 2023. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).

Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012. Experimentation in software engineering. Vol. 236. Springer.

TongshuangWu, Michael Terry, and Carrie Jun Cai. 2022. Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In Proceedings of the 2022 CHI conference on human factors in computing systems. 1–22.

Lin Yang, Chen Yang, Shutao Gao,WeijingWang, BoWang, Qihao Zhu, Xiao Chu, Jianyi Zhou, Guangtai Liang, Qianxiang Wang, et al. 2024. On the evaluation of large language models in unit test generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1607–1619.

Yan Zheng, Yi Liu, Xiaofei Xie, Yepang Liu, Lei Ma, Jianye Hao, and Yang Liu. 2021. Automatic web testing using curiosity-driven reinforcement learning. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 423–435.