Combining TSL and LLM to Automate REST API Testing: A Comparative Study

Thiago Barradas; Aline Paes; Vânia de Oliveira Neves

doi:10.5753/sbes.2025.9670

Thiago Barradas UFF
Aline Paes UFF
Vânia de Oliveira Neves UFF

DOI: https://doi.org/10.5753/sbes.2025.9670

Resumo

The effective execution of tests for REST APIs remains a considerable challenge for development teams, driven by the inherent complexity of distributed systems, the multitude of possible scenarios, and the limited time available for test design. Exhaustive testing of all input combinations is impractical, often resulting in undetected failures, high manual effort, and limited test coverage. To address these issues, we introduce RestTSLLM, an approach that uses Test Specification Language (TSL) in conjunction with Large Language Models (LLMs) to automate the generation of test cases for REST APIs. The approach targets two core challenges: the creation of test scenarios and the definition of appropriate input data. The proposed solution integrates prompt engineering techniques with an automated pipeline to evaluate various LLMs on their ability to generate tests from OpenAPI specifications. The evaluation focused on metrics such as success rate, test coverage, and mutation score, enabling a systematic comparison of model performance. The results indicate that the best-performing LLMs – Claude 3.5 Sonnet (Anthropic), Deepseek R1 (Deepseek), Qwen 2.5 32b (Alibaba), and Sabiá 3 (Maritaca) – consistently produced robust and contextually coherent REST API tests. Among them, Claude 3.5 Sonnet outperformed all other models across every metric, emerging in this study as the most suitable model for this task. These findings highlight the potential of LLMs to automate the generation of tests based on API specifications.

Palavras-chave: Test Automation, Large Language Models, Integration Testing, REST API Testing, AI in Software Testing, Test Generation

Referências

Hugo Abonizio, Thales Sales Almeida, Thiago Laitz, Roseval Malaquias Junior, Giovana Kerche Bonás, Rodrigo Nogueira, and Ramon Pires. 2024. Sabiá-3 Technical Report. [link]

Nadia Alshahwan, Mark Harman, and Alexandru Marginean. 2023. Software Testing Research Challenges: An Industrial Perspective. In 2023 IEEE Conference on Software Testing, Verification and Validation. DOI: 10.1109/ICST57152.2023.00008

Abhineet Anand and Azeem Uddin. 2019. Importance of software testing in the process of software development. International Journal for Scientfic Research and Development (IJSRD) (2019). [link]

Vaggelis Atlidakis, Patrice Godefroid, and Marina Polishchuk. 2019. REST-ler: Stateful REST API Fuzzing. In Proceedings of the 41st International Conference on Software Engineering (ICSE). IEEE Press. DOI: 10.1109/ICSE.2019.00083

Omer Aydin, Enis Karaarslan, Fatih Safa Erenay, and Nebojsa Bacanin. 2025. Generative AI in AcademicWriting: A Comparison of DeepSeek, Qwen, ChatGPT, Gemini, Llama, Mistral, and Gemma. [link]

Thiago Barradas. 2025. Integration Test Generation - LLM Efficiency - Repository. [link] Accessed on: June 26, 2025.

Lenz Belzner, Thomas Gabor, and Martin Wirsing. 2023. Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study. In AISoLA. DOI: 10.1007/978-3-031-46002-9_23

BotPress. 2025. Discover models by Popularity. [link] Accessed on: January 25, 2025.

Mohamed Boukhlif, Nassim Kharmoum, and Mohamed Hanine. 2024. LLMS for intelligent software testing: a comparative study. In Proceedings of the 7th International Conference on Networking, Intelligent Systems and Security. [link]

Davide Corradini, Amedeo Zampieri, Michele Pasqua, and Mariano Ceccato. 2021. Empirical Comparison of Black-box Test Case Generation Tools for RESTful APIs. [link]

D. Corradini, A. Zampieri, M. Pasqua, and M. Ceccato. 2021. Restats: A Test Coverage Tool for RESTful APIs. In 37th IEEE International Conference on Software Maintenance and Evolution (ICSME). [link]

Hasan Erdal. 2025. Shortener API - Repository. [link] Accessed on: March 24, 2025.

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large Language Models for Software Engineering: Survey and Open Problems. In 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). [link]

Fortune Business Insights. 2023. Dot Net Development Service Market Size, Share, Growth. [link] Accessed: January 12, 2025.

David Fowler. 2025. Todo API - Repository. [link] Accessed on: March 24, 2025.

Amid Golmohammadi, Man Zhang, and Andrea Arcuri. 2022. Testing RESTful APIs: A Survey. [link]

Amid Golmohammadi, Man Zhang, and Andrea Arcuri. 2023. Testing RESTful APIs: A Survey. ACM Trans. Softw. Eng. Methodol. (2023). DOI: 10.1145/3617175

Evandro Gomes. 2025. Supermarket API - Repository. [link] Accessed on: March 24, 2025.

Desta Haileselassie Hagos, Rick Battle, and Danda B. Rawat. 2024. Recent Advances in Generative AI and Large Language Models: Current Status, Challenges, and Perspectives. [link]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology (2024).

This is Definition. 2025. What’s the most popular LLM? [link] Accessed on: January 20, 2025.

Kush Jain, Goutamkumar Tulajappa Kalburgi, Claire Le Goues, and Alex Groce. 2023. Mind the Gap: The Difference Between Coverage and Mutation Score Can Guide Testing Efforts. [link]

Tanu Jindal. 2016. Importance of Testing in SDLC. International Journal of Engineering and Applied Computer Science (IJEACS) (2016). [link]

Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. 2023. Decomposed Prompting: A Modular Approach for Solving Complex Tasks. [link]

Myeongsoo Kim, Saurabh Sinha, and Alessandro Orso. 2025. LlamaRestTest: Effective REST API Testing with Small Language Models. [link]

Myeongsoo Kim, Tyler Stennett, Dhruv Shah, Saurabh Sinha, and Alessandro Orso. 2024. Leveraging Large Language Models to Improve REST API Testing. (2024). DOI: 10.1145/3639476.3639769

Jakub Kozera. 2025. Restaurants API - Repository. [link] Accessed on: March 24, 2025.

Alexander Lercher. 2024. Managing API Evolution in Microservice Architecture. In Proceedings of the 46th International Conference on Software Engineering (ICSE). ACM. DOI: 10.1145/3639478.3639800

LiveBench. 2024. LiveBench - Leaderboard. [link] Accessed on: January 25, 2025.

Mitra Madanchian and Hamed Taherdoost. 2023. A comprehensive guide to the TOPSIS method for multi-criteria decision making. Madanchian M, Taherdoost H. A comprehensive guide to the TOPSIS method for multi-criteria decision making. Sustainable Social Development (2023).

Isela Mendoza, Fernando Silva Filho, Gustavo Medeiros, Aline Paes, and Vânia Neves. 2024. Comparative Analysis of Large Language Model Tools forAutomated Test Data Generation from BDD. In Anais do XXXVIII Simpósio Brasileiro de Engenharia de Software (Curitiba/PR). SBC, Porto Alegre, RS, Brasil. [link]

Microsoft. 2025. Integration tests in ASP.NET Core. [link] Accessed: April 06, 2025.

Norman Mu, Jonathan Lu, Michael Lavery, and David Wagner. 2025. A Closer Look at System Prompt Robustness. [link]

Nebuly. 2024. LLM System Prompt vs. User Prompt. [link] Accessed: April 07, 2025.

Sam Newman. 2015. Building Microservices: Designing Fine-Grained Systems. O’Reilly Media.

Mitchell Olsthoorn. 2022. More Effective Test Case Generation with Multiple Tribes of AI. In Proceedings of the 44th International Conference on Software Engineering (ICSE) - Doctoral Symposium. ACM. DOI: 10.1145/3510454.3517066

OpenRouter. 2025. LLM Rankings - Programming. [link] Accessed on: January 25, 2025.

Alessandro Orso and Gregg Rothermel. 2014. Software testing: a research travelogue (2000-2014). In Future of Software Engineering Proceedings. Georgia Institute of Technology. [link]

Thomas J. Ostrand and Marc J. Balcer. 1988. The category-partition method for specifying and generating fuctional tests. Commun. ACM (1988). [link]

Wendkûuni C. Ouédraogo, Kader Kaboré, Haoye Tian, Yewei Song, Anil Koyuncu, Jacques Klein, David Lo, and Tegawendé F. Bissyandé. 2024. Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation. [link]

Mauro Pezzè, Matteo Ciniselli, Luca Grazia, Niccolò Puccinelli, and Ketai Qiu. 2024. The Trailer of the ACM 2030 Roadmap for Software Engineering. [link] [Online; accessed 1-Apr-2025].

Ruizhong Qiu,Weiliang Will Zeng, Hanghang Tong, James Ezick, and Christopher Lott. 2024. How Efficient is LLM-Generated Code? A Rigorous High-Standard Benchmark. [link]

Shakudo. 2025. Top 9 Large Language Models as of January 2025. [link] Accessed on: January 26, 2025.

Ian Sommerville. 2010. Software Engineering (9 ed.). Addison-Wesley, Harlow, England.

Stelios Sotiriadis, Andrus Lehmets, Euripides G. M. Petrakis, and Nik Bessis. 2017. Unit and Integration Testing of Modular Cloud Services. In 2017 IEEE 31st International Conference on Advanced Information Networking and Applications (AINA). DOI: 10.1109/AINA.2017.57

Poorna Soysa. 2025. Books API - Repository. [link] Accessed on: March 24, 2025.

Stryker. 2025. Stryker .NET - Configuration. [link] Accessed: April 09, 2025.

Martin Tappler, Andrea Pferscher, Bernhard K. Aichernig, and Bettina Könighofer. 2024. Learning and Repair of Deep Reinforcement Learning Policies from Fuzz-Testing Data. In Proceedings of the 46th International Conference on Software Engineering (ICSE). ACM. DOI: 10.1145/3597503.3623311

Maneela Tuteja, Gaurav Dubey, et al. 2012. A research study on importance of testing and quality assurance in software development life cycle (SDLC) models. International Journal of Soft Computing and Engineering (IJSCE) (2012). [link]

Unite.AI. 2025. Best Large Language Models (LLMs) in 2025. [link] Accessed on: January 20, 2025.

Vellum. 2025. LLM Leaderboard - Model Comparison. [link] Accessed on: January 26, 2025.

Emanuele Viglianisi, Michael Dallago, and Mariano Ceccato. 2020. RESTTESTGEN: Automated Black-Box Testing of RESTful APIs. In 2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST). IEEE. DOI: 10.1109/ICST46399.2020.00024

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software Testing with Large Language Models: Survey, Landscape, and Vision. [link]

Trevoir Williams. 2025. Hotels API - Repository. [link] Accessed on: March 24, 2025.

Chunqiu Steven Xia, Yinlin Deng, and Lingming Zhang. 2024. Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM. [link]

Lechen Zhang, Tolga Ergen, Lajanugen Logeswaran, Moontae Lee, and David Jurgens. 2024. SPRIG: Improving Large Language Model Performance by System Prompt Optimization. [link]

Peng Zhang, Yang Wang, Xutong Liu, Yibiao Yang, Yanhui Li, Lin Chen, Ziyuan Wang, Chang ai Sun, and Yuming Zhou. 2022. Test suite effectiveness metric evaluation: what do we know and what should we do? [link]