Real-Time Feedback for BDD Test Scenarios Using AI-Based Classification

David Brandão; Denis Marques; Cleyton Rodrigues; Wylliams Santos

doi:10.5753/sast.2025.14469

David Brandão UPE
Denis Marques UPE
Cleyton Rodrigues UPE
Wylliams Santos UPE

DOI: https://doi.org/10.5753/sast.2025.14469

Resumo

Behavior-Driven Development (BDD) has gained widespread adoption as a means to align software behavior with stakeholder expectations, yet maintaining high-quality scenarios remains challenging at scale. Manual review of Gherkin-based steps is often slow, inconsistent, and prone to oversight, leading to structural errors, semantic inconsistencies, and reduced maintainability. To address these issues, this work proposes a hybrid automated analysis framework that combines Natural Language Processing (NLP) and Machine Learning (ML) to improve both the clarity and correctness of BDD artifacts. The framework consists of two complementary components: a rule-based validator that inspects linguistic and structural adherence to established BDD conventions and a supervised classifier that assigns each step to one of three semantic categories: Precondition, Action, or Expected Result regardless of its original Gherkin keyword. Models were trained on a balanced synthetic dataset of 1,500 labeled steps and validated against a large-scale industrial repository from a leading global manufacturer of laptops and mobile devices, ensuring external validity. Performance was measured using macro-averaged accuracy, precision, recall, and F1-score, alongside statistical significance testing to compare algorithms. The best results were achieved by Support Vector Machines and gradient boosting models, which outperformed neural and transformer-based approaches. Designed for near real time operation, the framework can be applied to any Gherkin compatible library and any supported natural language, enabling broad applicability across projects. It integrates seamlessly into development workflows, including pull requests and CI/CD pipelines, to provide continuous, automated feedback on BDD scenarios. Findings suggest that hybrid NLP–ML solutions are effective in scaling quality assurance for agile both Test and DevOps teams, while reducing the manual effort required for review and maintenance.

Palavras-chave: Behavior-Driven Development, Gherkin, Natural Language Processing, Machine Learning, Hybrid Approach, Test Automation, Software Testing, Step Classification, Rule-based Validation, Continuous Integration, Continuous Delivery

Referências

P. R. Anish, P. Lawhatre, R. Chatterjee, V. Joshi, and S. Ghaisas. 2022. Automated labeling and classification of business rules from software requirement specifications. In Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice. 53–54.

R. Asha, N. Padmalata, P. Ajim, K. Piyush, and K. Vinay. 2023. RClassify: Combining NLP and ML to Classify Rules. In 2023 IEEE 31st International Requirements Engineering Conference (RE).

Kent Beck. 2003. Test Driven Development: By Example. Addison-Wesley.

Zeyd Boukhers and Azeddine Bouabdallah. 2022. Vision and Natural Language for Metadata Extraction from Scientific PDF Documents: A Multimodal Approach. 2022 ACM/IEEE Joint Conference on Digital Libraries (JCDL) (2022).

Adwait Chandorkar, Nitish Patkar, Andrea Di Sorbo, and Oscar Nierstrasz. 2022. An Exploratory Study on the Usage of Gherkin Features in Open-Source Projects. 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (2022).

W. Edwards Deming. 1993. The New Economics. MIT Press.

Elfriede Dustin, Jeff Rashka, and John Paul. 1999. Automated Software Testing: Introduction, Management, and Performance. Addison-Wesley.

Muhammad Shoaib Farooq, Uzma Omer, Amna Ramzan, Mansoor Ahmad Rasheed, and Zabihullah Atal. 2023. Behavior Driven Development: A Systematic Literature Review. IEEE Access (2023).

Mark Fewster and Dorothy Graham. 1999. Software Test Automation: Effective Use of Test Execution Tools. Addison-Wesley.

Kevin Forsberg and Harold Mooz. 1991. The Relationship of System Engineering to the Project Cycle. In Proceedings of the National Council on System Engineering (NCOSE). 57–65.

Martin Fowler. 1999. Refactoring: Improving the Design of Existing Code. Addison-Wesley Professional.

David Gelperin and William C. Hetzel. 1988. The Growth of Software Testing. Commun. ACM 31, 6 (1988), 687–695.

Mary Jean Harrold. 2000. Testing: A Roadmap. In Proceedings of the Conference on the Future of Software Engineering. 61–72.

Aslak Hellesøy, Matt Wynne, and Steve Tooke. 2017. The Cucumber Book: Behaviour-Driven Development for Testers and Developers. Pragmatic Bookshelf.

ISO/IEC/IEEE. 2013. ISO/IEC/IEEE 29119-2:2013 - Software and Systems Engineering – Software Testing – Part 2: Test Processes. Accessed on: September 4, 2024.

Joseph M. Juran. 1989. The Quality Trilogy: A Universal Approach to Managing for Quality. Juran Institute.

Mohamamed Lafi and Akram abdelQader. 2023. Automated Business Rules Classification Using Machine Learning to Enhance Software Requirements Elicitation. 2023 International Conference on Information Technology (ICIT) (2023).

Diellza Nagavci Mati, Mentor Hamiti, Besnik Selimi, and Jaumin Ajdari. 2021. Building Spell-Check Dictionary for Low Resource Language by ComparingWord Usage. 2021 44th International Convention on Information, Communication and Electronic Technology (MIPRO) (2021).

Dan North. 2006. Introducing BDD. Better Software 8, 2 (2006), 29–34.

Dan North. 2006. Introducing BDD. Better Software (2006).

OneDayTesting. 2019. Gherkin: Concepts and Benefits. [link] Accessed on: September 4, 2024.

Asha Rajbhoj, Padmalata Nistala, Vinay Kulkarni, Shivani Soni, and Ajim Pathan. 2023. DocToModel: Automated Authoring of Models from Diverse Requirements Specification Documents. 2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (2023).

Stuart Russell and Peter Norvig. 2020. Artificial Intelligence: A Modern Approach (4th ed.). Pearson.

John Ferguson Smart. 2014. BDD in Action: Behavior-driven development for the whole software lifecycle. Manning Publications.

Khushboo Taneja and Jyoti Vashishtha. 2022. Comparison of Transfer Learning and Traditional Machine Learning Approach for Text Classification. 2022 9th International Conference on Computing for Sustainable Global Development (2022).

Matt Wynne and Aslak Hellesoy. 2017. The Cucumber Book: Behaviour-Driven Development for Testers and Developers. Pragmatic Bookshelf.

Jiakai Xiao, Wei Du, Zhengxiang Xu, and Yang Qian. 2023. Cross-system Data Integration Based on Rule-based NLP and Node2Vec. 2023 8th International Conference on Data Science in Cyberspace (DSC) (2023).

Y. Ye, X. Xie, H. Jin, and D. Wang. 2021. A Hybrid Model Combined with SVM and CNN for Community Content Classification. In 2021 IEEE 23rd International Conference on High Performance Computing & Communications; 7th International Conference on Data Science & Systems; 19th International Conference on Smart City; 7th International Conference on Dependability in Sensor, Cloud & Big Data Systems & Application.