A Comparative Study of LLMs for Gherkin Generation
Abstract
[Context] Behavior-Driven Development (BDD) is widely adopted, but the manual creation of Gherkin scenarios remains a significant bottleneck. While Large Language Models (LLMs) show promise for automation, there is a lack of empirical evidence on their accuracy and stability when converting free-form test descriptions into structured Gherkin, creating risks for industrial adoption. Manual scenario authoring is also time-consuming and prone to inconsistencies, leading to miscommunication between technical and non-technical stakeholders and impacting software quality assurance. [Objective] This study addresses this gap by investigating the use of LLMs to automate the generation of Gherkin-based BDD scenarios from real-world, free-form test case descriptions. The goal is to assess the robustness of current models when handling informal, ambiguous, and diverse inputs typically found in practice. [Method]We conducted a comparative evaluation involving seven LLMs — GPT-3.5 Turbo, GPT-4 Turbo, GPT-4o Mini, LLaMA 3, Phi-3, Gemini, and DeepSeek R1 — using zero-shot, one-shot, and fewshot prompting strategies. The models generated BDD scenarios from a stratified sample of ten test descriptions selected from a corpus of 1,286, ensuring diversity in structure and domain complexity. We assessed quality and consistency using quantitative metrics (METEOR, variability analysis) and Repeated Measures ANOVA to test statistical significance. [Results] The analysis revealed that simple zero-shot prompting was highly effective, achieving results comparable to more complex example-based prompting. For the topperforming model, Gemini, which balanced accuracy and stability, the difference between zero-shot and few-shot was not statistically significant. Performance differences across models were often small, suggesting that practical factors like integration and cost should also guide model choice. Some models showed higher output variability, raising concerns about consistency in test generation workflows. [Conclusion] This paper offers practical insights into prompt design and model selection for LLM-based BDD scenario generation. Results show that effective zero-shot prompts can enable scalable, high-quality generation comparable to more complex techniques, simplifying LLM adoption in industrial testing. These findings suggest that LLMs can be leveraged with minimal setup to streamline BDD, reduce costs, and accelerate validation cycles.
References
Víctor Manuel Arredondo-Reyes, Saúl Domínguez-Isidro, Ángel J. Sánchez-García, and Jorge Octavio Ocharán-Hernández. 2023. Benefits and Challenges of the Behavior-Driven Development: A Systematic Literature Review. In 2023 11th International Conference in Software Engineering Research and Innovation (CONISOFT). 45–54. DOI: 10.1109/CONISOFT58849.2023.00016
Ajay Bandi, Pydi Venkata Satya Ramesh Adapa, and Yudu Eswar Vinay Pratap Kumar Kuchi. 2023. The power of generative ai: A review of requirements, models, input–output formats, evaluation metrics, and challenges. Future Internet 15, 8 (2023), 260.
Oleksandr Bezsmertnyi, Nataliia Golian, Vira Golian, and Iryna Afanasieva. 2020. Behavior Driven Development Approach in the Modern Quality Control Process. IEEE International Conference on Problems of Infocommunications. Science and Technology (PIC S&T) (2020), 215–218. DOI: 10.1109/PICST51311.2020.9467891
Maria Gerliane Cavalcante and José Iranildo Sales. 2019. The Behavior Driven Development Applied to the Software Quality Test:. In 2019 14th Iberian Conference on Information Systems and Technologies (CISTI). 1–4. DOI: 10.23919/CISTI.2019.8760965
Imran Chamieh, Torsten Zesch, and Klaus Giebermann. 2024. LLMs in Short Answer Scoring: Limitations and Promise of Zero-Shot and Few-Shot Approaches. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024). Association for Computational Linguistics. [link]
Adwait Chandorkar, Nitish Patkar, Andrea Di Sorbo, and Oscar Nierstrasz. 2022. An Exploratory Study on the Usage of Gherkin Features in Open-Source Projects. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 1159–1166. DOI: 10.1109/SANER53432.2022.00134
D. Chelimsky et al. 2010. The RSpec Book: Behaviour-Driven Development with RSpec, Cucumber, and Friends. Pragmatic Bookshelf.
Ednaldo DiLorenzo de Souza Filho. 2021. Uma Abordagem para Recomendação de Casos de Teste em Projetos Ágeis Baseados no Scrum. Ph.D. Dissertation. Programa de Pós-Graduação em Ciência da Computação, Universidade Federal de Campina Grande, Campina Grande, Brasil. [link]
Adrian De Wynter, Xun Wang, Alex Sokolov, Qilong Gu, and Si-Qing Chen. 2023. An evaluation on large language model outputs: Discourse and memorization. Natural Language Processing Journal 4 (2023).
Ronald Díaz-Arrieta, Byron Díaz-Monroy, and Luis Castillo. 2024. Comparative Analysis of the Efficiency of Generation of Unit Test Cases: Manual Methods Versus Automation with LLM. SSRN. DOI: 10.2139/ssrn.5185412
Nicolas Erni, Al-Ameen Mohammed Ali Mohammed, Christian Birchler, Pouria Derakhshanfar, Stephan Lukasczyk, and Sebastiano Panichella. 2024. SBFT Tool Competition 2024 – Python Test Case Generation Track. arXiv:2401.15189 [cs.SE] [link]
DeepSeek-AI et al. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948 [cs.CL] [link]
Muhammad Shoaib Farooq, Uzma Omer, Amna Ramzan, Mansoor Ahmad Rasheed, and Zabihullah Atal. 2023. Behavior Driven Development: A Systematic Literature Review. IEEE Access 11 (2023), 88007–88019. DOI: 10.1109/ACCESS.2023.3302356
Muhammad Shoaib Farooq, Uzma Omer, Amna Ramzan, Mansoor Ahmad Rasheed, and Zabihullah Atal. 2023. Behavior Driven Development: A Systematic Literature Review. IEEE Access 11 (2023), 88008–88024. DOI: 10.1109/ACCESS.2023.3302356
Abhimanyu Gupta, Geert Poels, and Palash Bera. 2023. Generating multiple conceptual models from behavior-driven development scenarios. Data & Knowledge Engineering 145 (2023), 102141. DOI: 10.1016/j.datak.2023.102141
Mohsin Irshad, Ricardo Britto, and Kai Petersen. 2021. Adapting Behavior Driven Development (BDD) for large-scale software systems. Journal of Systems and Software 177 (2021), 110944. DOI: 10.1016/j.jss.2021.110944
Mohsin Irshad, Jürgen Börstler, and Kai Petersen. 2022. Supporting Refactoring of BDD Specifications—An Empirical Study. Information and Software Technology 141 (2022), 106717. DOI: 10.1016/j.infsof.2021.106717
Shanthi Karpurapu, Sravanthy Myneni, Unnati Nettur, Likhit Sagar Gajja, Dave Burke, Tom Stiehm, and Jeffery Payne. 2024. Comprehensive Evaluation and Insights Into the Use of Large Language Models in the Automation of Behavior-Driven Development Acceptance Test Formulation. IEEE Access 12 (2024), 58715–58730. DOI: 10.1109/ACCESS.2024.3391815
Mohammed Lafi, Thamer Alrawashed, and Ahmad Munir Hammad. 2021. Automated Test Cases Generation From Requirements Specification. 2021 International Conference on Information Technology (ICIT) (2021), 851–857. DOI: 10.1109/ICIT52682.2021.9491761
Alon Lavie and Michael J Denkowski. 2009. The METEOR metric for automatic evaluation of machine translation. Machine Translation 23, 2-3 (2009), 105–115.
Rakesh Kumar Lenka, Srikant Kumar, and Sunakshi Mamgain. 2018. Behavior Driven Development: Tools and Challenges. (2018), 1032–1036. DOI: 10.1109/ICACCCN.2018.8748756
Ali Hassaan Mughal. 2025. An Autonomous RL Agent Methodology for Dynamic Web UI Testing in a BDD Framework. arXiv:2503.08464 [cs.SE] [link]
Gabriel Oliveira and Sabrina Marczak. 2017. On the Empirical Evaluation of BDD Scenarios Quality: Preliminary Findings of an Empirical Study. In 2017 IEEE 25th International Requirements Engineering Conference Workshops (REW). 299–302. DOI: 10.1109/REW.2017.62
Gabriel Oliveira, Sabrina Marczak, and Cassiano Moralles. 2019. How to Evaluate BDD Scenarios’ Quality? (2019), 181–190. DOI: 10.1145/3350768.3351301
Ciprian Paduraru, Miruna Zavelca, and Alin Stefanescu. 2025. Agentic AI for Behavior-Driven Development Testing Using Large Language Models. In Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART. INSTICC, SciTePress, 805–815. DOI: 10.5220/0013374400003890
Ciprian Paduraru, Miruna Zavelca, and Alin Stefanescu. 2025. Agentic AI for Behavior-Driven Development Testing Using Large Language Models. In Proceedings of the 17th International Conference on Agents and Artificial Intelligence (ICAART 2025), Volume 2. SCITEPRESS, Rome, Italy, 805–815. DOI: 10.5220/0013374400003890
Nitish Patkar, Andrei Chiş, Nataliia Stulova, and Oscar Nierstrasz. 2021. Interactive Behavior-driven Development: a Low-code Perspective. In 2021 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C). 128–137. DOI: 10.1109/MODELS-C53483.2021.00024
Indra Kharisma Raharjana, Fadel Harris, and Army Justitia. 2020. Tool for Generating Behavior-Driven Development Test-Cases. Journal of Information Systems Engineering and Business Intelligence 6, 1 (2020), 27–36. DOI: 10.20473/jisebi.6.1.27-36
Indra Kharisma Raharjana, Fadel Harris, and Army Justitia. 2020. Tool for Generating Behavior-Driven Development Test-Cases. Journal of Information Systems Engineering and Business Intelligence 6, 1 (2020), 27–36. DOI: 10.20473/jisebi.6.1.27-36
Shafin Rahman, Salman Khan, and Fatih Porikli. 2018. A Unified Approach for Conventional Zero-Shot, Generalized Zero-Shot, and Few-Shot Learning. IEEE Transactions on Image Processing 27, 11 (Nov. 2018), 5652–5667. DOI: 10.1109/TIP.2018.2861573
June Sallou, Thomas Durieux, and Annibale Panichella. 2024. Breaking the Silence: the Threats of Using LLMs in Software Engineering. In New Ideas and Emerging Results (ICSE-NIER’24). ACM, 1–5. DOI: 10.1145/3639476.3639764
Neelabh Sinha, Vinija Jain, and Aman Chadha. 2025. Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types. In Proceedings of the First Workshop of Evaluation of Multi-Modal Generation, Wei Emma Zhang, Xiang Dai, Desmond Elliot, Byron Fang, Mongyuan Sim, Haojie Zhuang, and Weitong Chen (Eds.). Association for Computational Linguistics, Abu Dhabi, UAE, 76–94. [link]
Carlos Solis and XiaofengWang. 2011. A Study of the Characteristics of Behaviour Driven Development. In 2011 37th EUROMICRO Conference on Software Engineering and Advanced Applications. 383–387. DOI: 10.1109/SEAA.2011.76
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs.CL] [link]
Wenhan Wang, Chenyuan Yang, Zhijie Wang, Yuheng Huang, Zhaoyang Chu, Da Song, Lingming Zhang, An Ran Chen, and Lei Ma. 2025. TESTEVAL: Benchmarking Large Language Models for Test Case Generation. arXiv:2406.04531 [cs.SE] [link]
Matt Wynne, Aslak Hellesøy, and Steve Tooke. 2017. The Cucumber Book, Second Edition: Behaviour-Driven Development for Testers and Developers (2 ed.). Pragmatic Bookshelf, Raleigh.
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. Agentless: Demystifying LLM-based Software Engineering Agents. In Proceedings of the 33rd ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE 2025). Association for Computing Machinery, Trondheim, Norway. DOI: 10.1145/3715754
Zhifei Xie and Changqiao Wu. 2024. Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities. arXiv:2410.11190 [eess.AS] [link]
Yunxi Yan, Biao Li, Jinyuan Feng, Yang Du, Zhichen Lu, and Manling Huang. 2023. Research on the Impact of Trends Related to ChatGPT. In Procedia Computer Science, Volume 221: Proceedings of the 10th International Conference on Information Technology and Quantitative Management (ITQM 2023). Elsevier B.V., 1284–1291. DOI: 10.1016/j.procs.2023.08.117 Peer-review under responsibility of the scientific committee of ITQM 2023.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. arXiv:1904.09675 [cs.CL] [link]
