Evaluating LLMs for Multimodal GUI Test Generation in Android Applications

Nayse Fagundes; Leopoldo Teixeira

doi:10.5753/sast.2025.13852

Nayse Fagundes UFPE http://orcid.org/0000-0002-3915-3245
Leopoldo Teixeira UFPE https://orcid.org/0000-0002-6154-1666

DOI: https://doi.org/10.5753/sast.2025.13852

Resumo

Graphical User Interface (GUI) testing is an important task in mobile application development but remains time-consuming when done manually. With the rise of Large Language Models (LLMs), there is growing interest in their potential to automate software development tasks, including GUI test generation. This study investigates the ability of LLMs to generate GUI test intentions and scripts for Android applications using multimodal inputs, such as screenshots and structured UI data. We present an approach that combines visual and textual input from eight open-source Android apps and evaluate the performance of four LLMs. The results show significant variation in the models’ ability to generate GUI tests: Claude 3 Sonnet produced the most detailed and complete test sequences, GPT-4o generated simpler test scripts with fewer test intentions and user interactions, focusing on more basic user flows, while Gemini 2.5 and Gemma 3 presented moderate and similar results. These findings indicate that while LLMs can aid GUI test automation, their effectiveness varies significantly across models.

Palavras-chave: Testing, GUI, LLMs

Referências

David Adamo, Md Khorrom Khan, Sreedevi Koppula, and Renée Bryce. 2018. Reinforcement learning for android gui testing. In Proceedings of the 9th ACM SIGSOFT international workshop on automating TEST case design, selection, and evaluation. 2–8.

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35 (2022), 23716–23736.

Domenico Amalfitano, Anna Rita Fasolino, Porfirio Tramontana, Bryan Dzung Ta, and Atif M Memon. 2014. MobiGUITAR: Automated model-based testing of mobile apps. IEEE software 32, 5 (2014), 53–59.

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732 (2021).

Young-Min Baek and Doo-Hwan Bae. 2016. Automated model-based android gui testing using multi-level gui comparison criteria. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. 238–249.

Ishan Banerjee, Bao Nguyen, Vahid Garousi, and Atif Memon. 2013. Graphical user interface (GUI) testing: Systematic mapping and repository. Information and Software Technology 55, 10 (2013), 1679–1694.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).

Kewei Cheng, Jingfeng Yang, Haoming Jiang, Zhengyang Wang, Binxuan Huang, Ruirui Li, Shiyang Li, Zheng Li, Yifan Gao, Xian Li, et al. 2024. Inductive or deductive? Rethinking the fundamental reasoning abilities of LLMs. arXiv preprint arXiv:2408.00114 (2024).

D Crow and BJ Jansen. 1998. The graphical user interface: An introduction. SIGCHI Bulletin 30, 3 (1998), 24–28.

Chongzhou Fang, Ning Miao, Shaurya Srivastav, Jialin Liu, Ruoyu Zhang, Ruijie Fang, Ryan Tsang, Najmeh Nazari, Han Wang, Houman Homayoun, et al. 2024. Large language models for code analysis: Do {LLMs} really do their job?. In 33rd USENIX Security Symposium (USENIX Security 24). 829–846.

Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large language models are fewshot testers: Exploring llm-based general bug reproduction. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2312–2323.

Kristian Kolthoff, Felix Kretzer, Christian Bartelt, Alexander Maedche, and Simone Paolo Ponzetto. 2024. Interlinking user stories and GUI prototyping: A semi-automatic LLM-based approach. In 2024 IEEE 32nd International Requirements Engineering Conference (RE). IEEE, 380–388.

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097.

Yihao Li, Pan Liu, Haiyang Wang, Jie Chu, and W Eric Wong. 2025. Evaluating large language models for software testing. Computer Standards & Interfaces 93 (2025), 103942.

Guangyi Liu, Pengxiang Zhao, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, Hao Wang, et al. 2025. Llm-powered gui agents in phone automation: Surveying progress and prospects. arXiv preprint arXiv:2504.19838 (2025).

Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2024. Make llm a testing expert: Bringing human-like interaction to mobile gui testing via functionality-aware decisions. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13.

Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Yuekai Huang, Jun Hu, and Qing Wang. 2024. Unblind text inputs: predicting hint-text of text input in mobile apps via LLM. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–20.

Wendy L Martinez. 2011. Graphical user interfaces. Wiley Interdisciplinary Reviews: Computational Statistics 3, 2 (2011), 119–133.

Reto Meier. 2012. Professional Android 4 application development. John Wiley & Sons.

Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. 2022. Automatic generation of programming exercises and code explanations using large language models. In Proceedings of the 2022 ACM Conference on International Computing Education Research-Volume 1. 27–43.

Mohammed Latif Siddiq, Joanna Cecilia Da Silva Santos, Ridwanul Hasan Tanvir, Noshin Ulfat, Fahmid Al Rifat, and Vinícius Carvalho Lopes. 2024. Using large language models to generate junit tests: An empirical study. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering. 313–322.

Ting Su, Guozhu Meng, Yuting Chen, Ke Wu, Weiming Yang, Yao Yao, Geguang Pu, Yang Liu, and Zhendong Su. 2017. Guided, stochastic model-based GUI testing of Android apps. In Proceedings of the 2017 11th joint meeting on foundations of software engineering. 245–256.

Anthony Meng Huat Tiong, Junnan Li, Boyang Li, Silvio Savarese, and Steven CH Hoi. 2022. Plug-and-play vqa: Zero-shot vqa by conjoining large pretrained models with zero training. arXiv preprint arXiv:2210.08773 (2022).

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

ChenxuWang, Tianming Liu, Yanjie Zhao, Minghui Yang, and HaoyuWang. 2025. LLMDroid: Enhancing Automated Mobile App GUI Testing Coverage with Large Language Model Guidance. Proceedings of the ACM on Software Engineering 2, FSE (2025), 1001–1022.

Zixuan Wang, Chi-Keung Tang, and Yu-Wing Tai. 2024. Audio-agent: Leveraging llms for audio generation, editing and composition. arXiv preprint arXiv:2410.03335 (2024).

Xusheng Xiao, Xiaoyin Wang, Zhihao Cao, Hanlin Wang, and Peng Gao. 2019. Iconintent: automatic identification of sensitive ui widgets based on icon classification for android apps. In ACM 41st International Conference on Software Engineering (ICSE). IEEE, 257–268.

Weiran Yang, Zhenyu Chen, Zebao Gao, Yunxiao Zou, and Xiaoran Xu. 2014. GUI testing assisted by human knowledge: Random vs. functional. Journal of Systems and Software 89 (2014), 76–86.

Juyeon Yoon, Seah Kim, Somin Kim, Sukchul Jung, and Shin Yoo. 2025. Integrating LLM-Based Text Generation with Dynamic Context Retrieval for GUI Testing. In 2025 IEEE Conference on Software Testing, Verification and Validation (ICST). IEEE, 394–405.

Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao Xie. 2024. Codereval: A benchmark of pragmatic code generation with generative pre-trained models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12.

Xiaohua Zhai, XiaoWang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. 2022. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18123–18133.

Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, Yang Li, et al. 2023. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5673–5684.

Maosheng Zhong, ZhixiangWang, Gen Liu, Youde Chen, Huizhu Liu, and Ruping Wu. 2023. Codegen-test: An automatic code generation model integrating program test information. In 2023 2nd International Conference on Cloud Computing, Big Data Application and Software Engineering (CBASE). IEEE, 341–344.