AI-Generated User Stories: Are They Good Enough?
Resumo
Large Language Models (LLMs), combined with advanced prompting techniques, have been used in requirements engineering, particularly in the automated generation of user stories. These stories are essential for agile software projects, but manual creation can be time-consuming and prone to inconsistencies, which has driven interest in automated approaches. Questions about the effectiveness and practical acceptance of these approaches, especially regarding quality and software professionals’ perceptions, still remain. Furthermore, little is known about users’ perspectives and the limitations of these new automation techniques. We conducted an empirical study with 24 participants who generated 457 user stories using the US-Prompt technique.We assessed the quality of the user stories using the QUS framework, and we analyzed the acceptance of the technique using the TAM3 model. Results showed that the US-Prompt method was effective, with 87.5% of the stories meeting more than 75% of the quality criteria. Participants found the technique easy to use and useful, although they identified limitations such as formatting inconsistencies and concerns about reliability for critical tasks. This study thus offers a provocative reflection on the use of LLMs and points to new directions for future research in this emerging area.
Palavras-chave:
User Story, Large Language Models, Requirements Engineering
Referências
L. Belzner, T. Gabor, and M. Wirsing. 2023. Large language model assisted software engineering: prospects, challenges, and a case study. In International Conference on Bridging the Gap between AI and Reality. Springer Nature Switzerland, Cham, 355–374. doi:doi/abs/10.1007/978-3-031-46002-9_23
Allan Brockenbrough and Dominic Salinas. 2024. Using Generative AI to Create User Stories in the Software Engineering Classroom. In 2024 36th International Conference on Software Engineering Education and Training (CSEET). 1–5. DOI: 10.1109/CSEET62301.2024.10662994
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. [link]
Martin Höst, Björn Regnell, and ClaesWohlin. 2000. Using students as subjects—a comparative study of students and professionals in lead-time impact assessment. Empirical Software Engineering 5 (2000), 201–214.
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2024. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. (Nov. 2024). DOI: 10.1145/3703155 Just Accepted.
G. Lucassen, F. Dalpiaz, J. M. E. M. van der Werf, and et al. 2016. Improving Agile Requirements: The Quality User Story Framework and Tool. Requirements Engineering 21, 4 (2016), 383–403. DOI: 10.1007/s00766-016-0250-x
Garm Lucassen, Fabiano Dalpiaz, Jan Martijn E. M. van der Werf, and Sjaak Brinkkemper. 2016. The Use and Effectiveness of User Stories in Practice. In Requirements Engineering: Foundation for Software Quality, Maya Daneva and Oscar Pastor (Eds.). Springer International Publishing, Cham, 205–222.
OpenAI. 2024. ChatGPT (versão GPT-4). [link]. Acesso em: 19 maio 2025.
Jay U Oswal, Harshil T Kanakia, and Devvrat Suktel. 2024. Transforming Software Requirements into User Stories with GPT-3.5-: An AI-Powered Approach. In 2024 2nd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT). IEEE, 913–920.
Tajmilur Rahman, Yuecai Zhu, Lamyea Maha, Chanchal Roy, Banani Roy, and Kevin Schneider. 2024. Take Loads Off Your Developers: Automated User Story Generation using Large Language Model. In 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). 791–801. DOI: 10.1109/ICSME58944.2024.00082
Vijayalakshmi Ramasamy, Suganya Ramamoorthy, Gursimran Singh Walia, Eli Kulpinski, and Aaron Antreassian. 2024. Enhancing User Story Generation in Agile Software Development Through Open AI and Prompt Engineering. In 2024 IEEE Frontiers in Education Conference (FIE). 1–8. DOI: 10.1109/FIE61694.2024.10893343
K. Ronanki, B. Cabrero-Daniel, and C. Berger. 2024. ChatGPT as a Tool for User Story Quality Evaluation: Trustworthy Out of the Box?. In Agile Processes in Software Engineering and Extreme Programming – Workshops (Lecture Notes in Business Information Processing, Vol. 489), P. Kruchten and P. Gregory (Eds.). Springer, Cham. DOI: 10.1007/978-3-031-48550-3_17
Iflaah Salman, Ayse Tosun Misirli, and Natalia Juristo. 2015. Are students representatives of professionals in software engineering experiments?. In 2015 IEEE/ACM 37th IEEE international conference on software engineering, Vol. 1. IEEE, 666–676.
Reine Santos, Gabriel Freitas, Igor Steinmacher, Tayana Conte, Ana Oran, and Bruno Gadelha. 2025. User Stories: Does ChatGPT Do It Better?. In Proceedings of the 27th International Conference on Enterprise Information Systems - Volume 2: ICEIS. INSTICC, SciTePress, 47–58. DOI: 10.5220/0013365500003929
Jose Sousa, Cristian Souza, Raiza Hanada, Diogo Nascimento, and Eliane Collins. 2024. Generation of test datasets using LLM - Quality Assurance Perspective. In Anais do XXXVIII Simpósio Brasileiro de Engenharia de Software (Curitiba/PR). SBC, Porto Alegre, RS, Brasil, 644–650. DOI: 10.5753/sbes.2024.3587
Viswanath Venkatesh and Hillol Bala. 2008. Technology acceptance model 3 and a research agenda on interventions. Decision sciences 39, 2 (2008), 273–315.
Andreas Vogelsang. 2024. From Specifications to Prompts: On the Future of Generative Large Language Models in Requirements Engineering. IEEE Software 41, 5 (2024), 9–13. DOI: 10.1109/MS.2024.3410712
Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012. Experimentation in software engineering. Vol. 236. Springer.
Asma Yamani, Malak Baslyman, and Moataz Ahmed. 2025. Leveraging LLMs for User Stories in AI Systems: UStAI Dataset. In Proceedings of the 21st International Conference on Predictive Models and Data Analytics in Software Engineering (Trondheim, Norway) (PROMISE ’25). Association for Computing Machinery, New York, NY, USA, 21–30. DOI: 10.1145/3727582.3728689
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2023. Automatic Chain of Thought Prompting in Large Language Models. In The Eleventh International Conference on Learning Representations. [link]
Allan Brockenbrough and Dominic Salinas. 2024. Using Generative AI to Create User Stories in the Software Engineering Classroom. In 2024 36th International Conference on Software Engineering Education and Training (CSEET). 1–5. DOI: 10.1109/CSEET62301.2024.10662994
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. [link]
Martin Höst, Björn Regnell, and ClaesWohlin. 2000. Using students as subjects—a comparative study of students and professionals in lead-time impact assessment. Empirical Software Engineering 5 (2000), 201–214.
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2024. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. (Nov. 2024). DOI: 10.1145/3703155 Just Accepted.
G. Lucassen, F. Dalpiaz, J. M. E. M. van der Werf, and et al. 2016. Improving Agile Requirements: The Quality User Story Framework and Tool. Requirements Engineering 21, 4 (2016), 383–403. DOI: 10.1007/s00766-016-0250-x
Garm Lucassen, Fabiano Dalpiaz, Jan Martijn E. M. van der Werf, and Sjaak Brinkkemper. 2016. The Use and Effectiveness of User Stories in Practice. In Requirements Engineering: Foundation for Software Quality, Maya Daneva and Oscar Pastor (Eds.). Springer International Publishing, Cham, 205–222.
OpenAI. 2024. ChatGPT (versão GPT-4). [link]. Acesso em: 19 maio 2025.
Jay U Oswal, Harshil T Kanakia, and Devvrat Suktel. 2024. Transforming Software Requirements into User Stories with GPT-3.5-: An AI-Powered Approach. In 2024 2nd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT). IEEE, 913–920.
Tajmilur Rahman, Yuecai Zhu, Lamyea Maha, Chanchal Roy, Banani Roy, and Kevin Schneider. 2024. Take Loads Off Your Developers: Automated User Story Generation using Large Language Model. In 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). 791–801. DOI: 10.1109/ICSME58944.2024.00082
Vijayalakshmi Ramasamy, Suganya Ramamoorthy, Gursimran Singh Walia, Eli Kulpinski, and Aaron Antreassian. 2024. Enhancing User Story Generation in Agile Software Development Through Open AI and Prompt Engineering. In 2024 IEEE Frontiers in Education Conference (FIE). 1–8. DOI: 10.1109/FIE61694.2024.10893343
K. Ronanki, B. Cabrero-Daniel, and C. Berger. 2024. ChatGPT as a Tool for User Story Quality Evaluation: Trustworthy Out of the Box?. In Agile Processes in Software Engineering and Extreme Programming – Workshops (Lecture Notes in Business Information Processing, Vol. 489), P. Kruchten and P. Gregory (Eds.). Springer, Cham. DOI: 10.1007/978-3-031-48550-3_17
Iflaah Salman, Ayse Tosun Misirli, and Natalia Juristo. 2015. Are students representatives of professionals in software engineering experiments?. In 2015 IEEE/ACM 37th IEEE international conference on software engineering, Vol. 1. IEEE, 666–676.
Reine Santos, Gabriel Freitas, Igor Steinmacher, Tayana Conte, Ana Oran, and Bruno Gadelha. 2025. User Stories: Does ChatGPT Do It Better?. In Proceedings of the 27th International Conference on Enterprise Information Systems - Volume 2: ICEIS. INSTICC, SciTePress, 47–58. DOI: 10.5220/0013365500003929
Jose Sousa, Cristian Souza, Raiza Hanada, Diogo Nascimento, and Eliane Collins. 2024. Generation of test datasets using LLM - Quality Assurance Perspective. In Anais do XXXVIII Simpósio Brasileiro de Engenharia de Software (Curitiba/PR). SBC, Porto Alegre, RS, Brasil, 644–650. DOI: 10.5753/sbes.2024.3587
Viswanath Venkatesh and Hillol Bala. 2008. Technology acceptance model 3 and a research agenda on interventions. Decision sciences 39, 2 (2008), 273–315.
Andreas Vogelsang. 2024. From Specifications to Prompts: On the Future of Generative Large Language Models in Requirements Engineering. IEEE Software 41, 5 (2024), 9–13. DOI: 10.1109/MS.2024.3410712
Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012. Experimentation in software engineering. Vol. 236. Springer.
Asma Yamani, Malak Baslyman, and Moataz Ahmed. 2025. Leveraging LLMs for User Stories in AI Systems: UStAI Dataset. In Proceedings of the 21st International Conference on Predictive Models and Data Analytics in Software Engineering (Trondheim, Norway) (PROMISE ’25). Association for Computing Machinery, New York, NY, USA, 21–30. DOI: 10.1145/3727582.3728689
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2023. Automatic Chain of Thought Prompting in Large Language Models. In The Eleventh International Conference on Learning Representations. [link]
Publicado
22/09/2025
Como Citar
SANTOS, Reine; STEINMACHER, Igor; CONTE, Tayana; ORAN, Ana Carolina; GADELHA, Bruno.
AI-Generated User Stories: Are They Good Enough?. In: SIMPÓSIO BRASILEIRO DE ENGENHARIA DE SOFTWARE (SBES), 39. , 2025, Recife/PE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 741-747.
ISSN 2833-0633.
DOI: https://doi.org/10.5753/sbes.2025.11321.
