Evaluating the Capability of Prompted LLMs to Recommend NFR from User Stories: A Preliminary Study

José R. A. Pereira; Mirko Perkusich; Felipe B. A. Ramos; Danyllo W. Albuquerque; Kyller Costa Gorgônio; Ângelo Perkusich

doi:10.5753/sbes.2025.11042

José R. A. Pereira UFCG
Mirko Perkusich UFCG
Felipe B. A. Ramos IFPB
Danyllo W. Albuquerque UFCG
Kyller Costa Gorgônio UFCG
Ângelo Perkusich UFCG

DOI: https://doi.org/10.5753/sbes.2025.11042

Resumo

[Context] Non-functional requirements (NFRs) are critical to software quality but are often underspecified in agile projects. Previous work proposed NFRec, a k-nearest neighbors (kNN) recommender system, to support NFR elicitation based on structured User Story metadata. [Objective] This study investigates whether Large Language Models (LLMs), when prompted with structured representations of User Stories, can generate relevant NFRs comparable to those recommended by NFRec. [Method] We reused the original dataset of 246 User Stories and adopted the same evaluation protocol. The gpt-4.1-mini model was queried using zero-shot prompting, instruction tuning, and role-playing strategies. Predicted NFRs were evaluated using effectiveness measures against the original ground truth. [Results] The LLM achieved high recall and moderate F1 performance but lower precision due to frequent overgeneration. The quality of recommendations was highly sensitive to prompt design. In several cases, the model produced plausible NFRs not present in the baseline, suggesting that traditional metrics may understate its practical value. [Conclusion] Prompted LLMs offer a viable and flexible alternative for NFR elicitation, especially in cold-start scenarios where historical data is scarce. This study serves as an initial step toward LLM-assisted requirements engineering, opening up new directions for research in prompt engineering, hybrid models, and evaluation metrics that better reflect semantic relevance and practical utility.

Palavras-chave: Non-functional requirements, Agile projects, Large Language Models, Prompt engineering, Recommender systems

Referências

Danyllo Albuquerque, Everton Guimarães, Graziela Tonin, Pilar Rodríguezs, Mirko Perkusich, Hyggo Almeida, Angelo Perkusich, and Ferdinandy Chagas. 2023. Managing Technical Debt Using Intelligent Techniques - A Systematic Mapping Study. IEEE Transactions on Software Engineering 49, 4 (2023), 2202–2220. DOI: 10.1109/TSE.2022.3214764

Yonatha Almeida, Danyllo Albuquerque, Emanuel Dantas Filho, Felipe Muniz, Katyusco de Farias Santos, Mirko Perkusich, Hyggo Almeida, and Angelo Perkusich. 2024. AICodeReview: Advancing code quality with AI-enhanced reviews. SoftwareX 26 (2024), 101677.

Corey Baham and Rudy Hirschheim. 2022. Issues, challenges, and a proposed theoretical core of agile software development research. Information Systems Journal 32, 1 (2022), 103–129.

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren E. Gillespie, Karan Goel, Noah D. Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas F. Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, O. Khattab, Pang Wei Koh, Mark S. Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir P. Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Benjamin Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, J. F. Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Robert Reich, Hongyu Ren, Frieda Rong, Yusuf H. Roohani, Camilo Ruiz, Jack Ryan, Christopher R’e, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishna Parasuram Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei A. Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. 2021. On the Opportunities and Risks of Foundation Models. ArXiv (2021). [link]

Lan Cao and Balasubramaniam Ramesh. 2008. Agile Requirements Engineering Practices: An Empirical Study. IEEE Software 25, 1 (2008), 60–67. DOI: 10.1109/MS.2008.1

John W Creswell and J David Creswell. 2018. Mixed methods procedures. Research Defign: Qualitative, Quantitative, and Mixed M ethods Approaches 31, 3 (2018), 75–77.

Karina Curcio, Tiago Navarro, Andreia Malucelli, and Sheila Reinehr. 2018. Requirements engineering: A systematic mapping study in agile software development. J. Syst. Softw. 139, C (May 2018), 32–50. DOI: 10.1016/j.jss.2018.01.036

Yashar Deldjoo. 2024. Understanding biases in ChatGPT-based recommender systems: Provider fairness, temporal stability, and recency. ACM Transactions on Recommender Systems (2024).

Ednaldo Dilorenzo, Emanuel Dantas, Mirko Perkusich, Felipe Ramos, Alexandre Costa, Danyllo Albuquerque, Hyggo Almeida, and Angelo Perkusich. 2020. Enabling the Reuse of Software Development Assets Through a Taxonomy for User Stories. IEEE Access 8 (2020), 107285–107300. DOI: 10.1109/ACCESS.2020.2996951

Junda He, Xin Zhou, Bowen Xu, Ting Zhang, Kisub Kim, Zhou Yang, Ferdian Thung, Ivana Clairine Irsan, and David Lo. 2024. Representation learning for stack overflow posts: How far are we? ACM Transactions on Software Engineering and Methodology 33, 3 (2024), 1–24.

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review. arXiv:2308.10620 [cs.SE] [link]

Qiao Huang, Xin Xia, Zhenchang Xing, David Lo, and Xinyu Wang. 2018. API method recommendation without worrying about the task-API knowledge gap. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 293–304.

Maliheh Izadi, Roberta Gismondi, and Georgios Gousios. 2022. Codefill: Multi-token code completion by jointly learning from structure and naming sequences. In Proceedings of the 44th international conference on software engineering. 401–412.

Chumeng Jiang, Jiayin Wang, Weizhi Ma, Charles LA Clarke, Shuai Wang, Chuhan Wu, and Min Zhang. 2025. Beyond Utility: Evaluating LLM as Recommender. In Proceedings of the ACM on Web Conference 2025. 3850–3862.

Genki Kusano, Kosuke Akimoto, and Kunihiro Takeoka. 2024. Are Longer Prompts Always Better? Prompt Selection in Large Language Models for Recommendation Systems. arXiv preprint arXiv:2412.14454 (2024).

Lei Li, Yongfeng Zhang, Dugang Liu, and Li Chen. 2023. Large language models for generative recommendation: A survey and visionary discussions. arXiv preprint arXiv:2309.01157 (2023).

Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Hao Zhang, Yong Liu, Chuhan Wu, Xiangyang Li, Chenxu Zhu, et al. 2025. How can recommender systems benefit from large language models: A survey. ACM Transactions on Information Systems 43, 2 (2025), 1–47.

Jinfeng Lin, Yalin Liu, Qingkai Zeng, Meng Jiang, and Jane Cleland-Huang. 2021. Traceability transformed: Generating more accurate links with pretrained bert models. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 324–335.

Pengfei Liu,Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM computing surveys 55, 9 (2023), 1–35.

Ahtsham Manzoor, Samuel C Ziegler, Klaus Maria Pirker Garcia, and Dietmar Jannach. 2024. ChatGPT as a conversational recommender system: A user-centric analysis. In Proceedings of the 32nd ACM Conference on User Modeling, Adaptation and Personalization. 267–272.

Dario Di Palma, Giovanni Maria Biancofiore, Vito Walter Anelli, Fedelucio Narducci, Tommaso Di Noia, and Eugenio Di Sciascio. 2024. Evaluating ChatGPT as a Recommender System: A Rigorous Approach. arXiv:2309.03613 [cs.IR] [link]

Felipe Ramos, Alexandre Costa, Mirko Perkusich, Luiz Silva, Dalton Valadares, Ademar de Sousa Neto, Felipe Cunha, Hyggo Almeida, and Angelo Perkusich. 2025. A Data-Driven Recommendation System for Enhancing Non-Functional Requirements Elicitation in Scrum-Based Projects. IEEE Access 13 (2025), 44000–44023. DOI: 10.1109/ACCESS.2025.3548631

John Slankas and Laurie Williams. 2013. Automated extraction of nonfunctional requirements in available documentation. In 2013 1st International workshop on natural language analysis in software engineering (NaturaLiSE). IEEE, 9–16.

Jianling Wang, Haokai Lu, James Caverlee, Ed H Chi, and Minmin Chen. 2024. Large language models as data augmenters for cold-start item recommendation. In Companion Proceedings of the ACM Web Conference 2024. 726–729.

Moshi Wei, Nima Shiri Harzevili, Yuchao Huang, Junjie Wang, and Song Wang. 2022. Clear: contrastive learning for api recommendation. In Proceedings of the 44th International Conference on Software Engineering. 376–387.

Yinwei Wei, Xiang Wang, Qi Li, Liqiang Nie, Yan Li, Xuanping Li, and Tat-Seng Chua. 2021. Contrastive learning for cold-start recommendation. In Proceedings of the 29th ACM international conference on multimedia. 5382–5390.

Jules White, Sam Hays, Quchen Fu, Jesse Spencer-Smith, and Douglas C. Schmidt. 2023. ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design. arXiv:2303.07839 [cs.SE] [link]

ClaesWohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012. Experimentation in software engineering. Vol. 236. Springer.

Lanling Xu, Junjie Zhang, Bingqian Li, Jinpeng Wang, Sheng Chen, Wayne Xin Zhao, and Ji-Rong Wen. 2025. Tapping the Potential of Large Language Models as Recommender Systems: A Comprehensive Framework and Empirical Analysis. ACM Transactions on Knowledge Discovery from Data (2025).

Jie Zhu, Lingwei Li, Li Yang, Xiaoxiao Ma, and Chun Zuo. 2023. Automating method naming with context-aware prompt-tuning. In 2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC). IEEE, 203–214.

Jianfei Zhu, Guanping Xiao, Zheng Zheng, and Yulei Sui. 2022. Enhancing traceability link recovery with unlabeled data. In 2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE). IEEE, 446–457.