Image2Test: Using ChatGPT to Build Manual Tests from Screenshots

Manoel Aranda III; Antônio Wagner; Eduardo Barros; Márcio Ribeiro; Alessandro Garcia; Fabio Palomba; Baldoino Fonseca; Ivan Machado

doi:10.5753/sast.2025.14358

Manoel Aranda III UFAL http://orcid.org/0000-0001-9540-1605
Antônio Wagner UFAL
Eduardo Barros UFAL https://orcid.org/0000-0002-5838-9405
Márcio Ribeiro UFAL https://orcid.org/0000-0002-4293-4261
Alessandro Garcia PUC-Rio
Fabio Palomba University of Salerno https://orcid.org/0000-0001-9337-5116
Baldoino Fonseca UFAL https://orcid.org/0000-0002-0730-0319
Ivan Machado UFBA https://orcid.org/0000-0001-9027-2293

DOI: https://doi.org/10.5753/sast.2025.14358

Resumo

Background:Website layouts often change with new design trends and front-end frameworks. Quality assurance is necessary during these changes, but manual testing takes much time and money. Manual tests are the standard way to maintain quality, but they are slow and expensive. Changes in graphical interfaces can cause errors or break features, which affects quality.Writing manual tests is the most time-consuming part of the process. Aims: This paper presents a tool that uses ChatGPT to create Natural Language Tests from screenshots and operator instructions. The goal is to reduce the time spent on manual test creation and to maintain quality in both the tests and the application. Method:We used two evaluation methods. First, we conducted a survey with 18 software testing professionals and students to compare tests made by ChatGPT and by humans. Second, we used Natural Language Processing techniques to measure the similarity between ChatGPT-generated tests and human-made tests. Results: The qualitative analysis showed that ChatGPT tests exceed human tests in completeness by a difference of 5.56%, achieving 36.11% acceptance rate. Human tests exceeded ChatGPT tests in clarity by 6.95%, reaching 41.67% acceptance rate. The quantitative analysis found that 66.7% of ChatGPT tests shared over 50% similarity with human tests. Conclusions: Our tool can help automate the creation of software tests. The similarity between AI-generated and human-made tests shows that this approach can save time and reduce costs, while keeping test quality at an acceptable level. This framework can help maintain quality during changes in website layouts and application development.

Palavras-chave: Natural Language Test, Manual Test, Test Case Generation, Software Testing, LLMs, ChatGPT

Referências

[n. d.]. Forms Service. [link]. [Accessed 15-06-2025].

[n. d.]. Git - git-rev-list Documentation — git-scm.com. [link]. [Accessed 15-06-2025].

[n. d.]. Image2Test: Using ChatGPT to Build Manual Tests from Screenshots. [link]

[n. d.]. OpenAI Platform — platform.openai.com. [link]. [Accessed 15-06-2025].

[n. d.]. State of Continuous Testing Report | BlazeMeter — perfecto.io. [link]. [Accessed 10-02-2025].

[n. d.]. Streamlit Docs. docs.streamlit.io

[n. d.]. The State of Quality Report 2022. [link]. [Accessed 16-06-2025].

[n. d.]. Top Websites Ranking - Most Visited Websites in February 2025. similarweb.com/top-websites/.

[n. d.]. World Quality Report 2023-24 — capgemini.com. [link]. [Accessed 16-06-2025].

Roni Amelan. 1996. Software testing blamed for Ariane failure. Nature 382, 6590 (01 Aug 1996), 386–386. DOI: 10.1038/382386a0

Vasco Amorim, Armindo Fernandes, and Vitor Filipe. 2025. Analyzing the Impact of the CrowdStrike Tech Outage on Airport Operations and Future Resilience Strategies. Procedia Computer Science 256 (2025), 633–640.

Manoel Aranda, Naelson Oliveira, Elvys Soares, Márcio Ribeiro, Davi Romão, Ullyanne Patriota, Rohit Gheyi, Emerson Souza, and Ivan Machado. 2024. A Catalog of Transformations to Remove Smells From Natural Language Tests. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering (Salerno, Italy) (EASE ’24). Association for Computing Machinery, New York, NY, USA, 7–16. DOI: 10.1145/3661167.3661225

Boris Beizer. 1990. Software testing techniques (2nd ed.). Van Nostrand Reinhold Co., USA.

Shreya Bhatia, Tarushi Gandhi, Dhruv Kumar, and Pankaj Jalote. 2024. Unit Test Generation using Generative AI : A Comparative Performance Analysis of Autogeneration Tools. 54–61. DOI: 10.1145/3643795.3648396

Andreas Bruns, Andreas Kornstadt, and Dennis Wichmann. 2009.Web application tests with selenium. IEEE software 26, 5 (2009), 88–91.

José Campos, Andrea Arcuri, Gordon Fraser, and Rui Abreu. 2014. Continuous test generation: Enhancing continuous integration with automated test generation. In Proceedings of the 29th ACM/IEEE international conference on Automated software engineering. 55–66.

Marcantonio Catelani, Lorenzo Ciani, Valeria L Scarano, and Alessandro Bacioccola. 2011. Software automated testing: A solution to maximize the test plan coverage and to increase software reliability and quality in use. Computer Standards & Interfaces 33, 2 (2011), 152–158.

Ting Chen, Xiao-song Zhang, Shi-ze Guo, Hong-yuan Li, and YueWu. 2013. State of the art: Dynamic symbolic execution for automated test generation. Future Generation Computer Systems 29, 7 (2013), 1758–1773.

Arghavan Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, and Michel Desmarais. 2024. Effective test generation using pre-trained Large Language Models and mutation testing. Information and Software Technology 171 (04 2024), 107468. DOI: 10.1016/j.infsof.2024.107468

A. Fergusson. 2016. Designing online experiments using Google Forms + Random Redirect Tool. [link]. Accessed: 2025-06-15.

Jannik Fischbach, Julian Frattini, Andreas Vogelsang, Daniel Mendez, Michael Unterkalmsteiner, Andreas Wehrle, Pablo Restrepo Henao, Parisa Yousefi, Tedi Juricic, Jeannette Radduenz, and Carsten Wiecher. 2023. Automatic creation of acceptance tests by extracting conditionals from requirements: NLP approach and case study. Journal of Systems and Software 197 (2023), 111549. DOI: 10.1016/j.jss.2022.111549

Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: automatic test suite generation for object-oriented software. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering (Szeged, Hungary) (ESEC/FSE ’11). Association for Computing Machinery, New York, NY, USA, 416–419. DOI: 10.1145/2025113.2025179

Kenish Rajesh Halani, Kavita, and Rahul Saxena. 2021. Critical Analysis of Manual Versus Automation Testing. In 2021 International Conference on Computational Performance Evaluation (ComPE). 132–135. DOI: 10.1109/ComPE53109.2021.9752388

Benedikt Hauptmann, Maximilian Junker, Sebastian Eder, Lars Heinemann, Rudolf Vaas, and Peter Braun. 2013. Hunting for smells in natural language tests. In ICSE 2013. 1217–1220.

Kseniia Horina and Karatanov Oleksandr. 2023. Advantages of Automated Testing of Medical Applications and Information Systems Using Gherkin and Behavior-Driven Development. In Conference on Integrated Computer Technologies in Mechanical Engineering–Synergetic Engineering. Springer, 379–391.

Herb Krasner. 2021. The cost of poor software quality in the US: A 2020 report. Proc. Consortium Inf. Softw. QualityTM (CISQTM) 2 (2021), 3.

Hareton KN Leung, Li Liao, and Yuzhong Qu. 2007. Automated support of software quality improvement. International Journal of Quality & Reliability Management 24, 3 (2007), 230–243.

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. [link]

Clementine Nebut, Franck Fleurey, Yves Le Traon, and Jean-Marc Jézéquel. 2006. Automatic Test Generation: A Use Case Driven Approach. Software Engineering, IEEE Transactions on 32 (04 2006), 140–155. DOI: 10.1109/TSE.2006.22

Wendkuuni C. Ouedraogo, Kader Kabore, Haoye Tian, Yewei Song, Anil Koyuncu, Jacques Klein, David Lo, and Tegawende F. Bissyande. 2024. LLMs and Prompting for Unit Test Generation: A Large-Scale Evaluation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (Sacramento, CA, USA) (ASE ’24). Association for Computing Machinery, New York, NY, USA, 2464–2465. DOI: 10.1145/3691620.3695330

Myron Peixoto, Davy Baía, Nathalia Nascimento, Paulo Alencar, Baldoino Fonseca, and Márcio Ribeiro. 2025. On the Effectiveness of LLMs for Manual Test Verifications. In 2025 IEEE/ACM International Workshop on Deep Learning for Testing and Testing for Deep Learning (DeepTest). 45–52. DOI: 10.1109/DeepTest66595.2025.00012

Kostadin Rajkovic and Eduard Paul Enoiu. 2022. NALABS: Detecting Bad Smells in Natural Language Requirements and Test Specifications. Technical Report. Mälardalen Real-Time Research Centre, Mälardalen University. [link]

Prerana Pradeepkumar Rane. 2017. Automatic Generation of Test Cases for Agile using Natural Language Processing. Master’s thesis. Virginia Polytechnic Institute and State University, Blacksburg, VA. [link] Advisors: Thomas L. Martin, A. Lynn Abbott, Steven R. Harrison.

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).

Gerard Salton, Edward A. Fox, and Harry Wu. 1983. Extended Boolean information retrieval. Commun. ACM 26, 11 (Nov. 1983), 1022–1036. DOI: 10.1145/182.358466

Elvys Soares, Manoel Aranda, Naelson Oliveira, Márcio Ribeiro, Rohit Gheyi, Emerson Souza, Ivan Machado, André Santos, Baldoino Fonseca, and Rodrigo Bonifácio. 2023. Manual Tests Do Smell! Cataloging and Identifying Natural Language Test Smells. In ESEM 2023. 1–11.

Elvys Soares, Márcio Ribeiro, and André Santos. 2024. A Multimethod Study of Test Smells: Cataloging Removal and NewTypes. In Proceedings of the XXIII Brazilian Symposium on Software Quality (SBQS ’24). Association for Computing Machinery, New York, NY, USA, 676–686. DOI: 10.1145/3701625.3701699

Manoel Terceiro, Antônio Wagner, Eduardo Barros, Márcio Ribeiro, Baldoino Fonseca, Alessandro Garcia, Fabio Palomba, and Ivan Machado. 2025. Image2Test: Using ChatGPT to Build Manual Tests from Screenshots. (8 2025). DOI: 10.6084/m9.figshare.29482043.v1

Ubuntu. 2024. Ubuntu Manual Tests. [link]

Ravi Prakash Verma and Md Rizwan Beg. 2013. Generation of test cases from software requirements using natural language processing. In 2013 6th International Conference on Emerging Trends in Engineering and Technology. IEEE, 140–147.

Chunhui Wang, Fabrizio Pastore, Arda Goknil, and Lionel C Briand. 2020. Automatic generation of acceptance test cases from use case specifications: an nlp-based approach. IEEE Transactions on Software Engineering (2020).

Yuqing Wang, Mika V Mäntylä, Zihao Liu, and Jouni Markkula. 2022. Test automation maturity improves product quality—Quantitative study of open source projects using continuous integration. Journal of Systems and Software 188 (2022).

Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. 2023. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).

Dianxiang Xu, Weifeng Xu, Michael Kent, Lijo Thomas, and Linzhang Wang. 2014. An automated test generation technique for software quality assurance. IEEE transactions on reliability 64, 1 (2014), 247–268.

Tao Yue, Shaukat Ali, and Man Zhang. 2015. RTCM: a natural language based, automated, and practical test case generation framework. In Proceedings of the 2015 International Symposium on Software Testing and Analysis (Baltimore, MD, USA) (ISSTA 2015). Association for Computing Machinery, New York, NY, USA, 397–408. DOI: 10.1145/2771783.2771799

Yixue Zhao, Saghar Talebipour, Kesina Baral, Hyojae Park, Leon Yee, Safwat Ali Khan, Yuriy Brun, Nenad Medvidović, and Kevin Moran. 2022. Avgust: automating usage-based test generation from videos of app executions. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Singapore, Singapore) (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 421–433. DOI: 10.1145/3540250.3549134