Towards Automating User Story Classification with Large Language Models Using a Reuse-Oriented Taxonomy

Carlos E. M. de Souza; Mirko Perkusich; Emanuel Filho; Danyllo W. Albuquerque; Kyller Costa Gorgônio; Angelo Perkusich

doi:10.5753/sbes.2025.11045

Carlos E. M. de Souza UFCG
Mirko Perkusich UFCG
Emanuel Filho IFPE
Danyllo W. Albuquerque UFCG
Kyller Costa Gorgônio UFCG
Angelo Perkusich UFCG

DOI: https://doi.org/10.5753/sbes.2025.11045

Resumo

[Context] Agile Software Development (ASD) and reuse strategies are increasingly used to improve software productivity and maintainability. However, while reuse relies on structured and traceable artifacts, ASD often depends on informal elements such as user stories, limiting opportunities for systematic reuse. A recent taxonomy proposes classifying user stories to support traceability and asset reuse, but manual classification remains labor-intensive and error-prone. [Objective] This study investigates whether Large Language Models (LLMs) can automate the classification of user stories using a reuse-oriented taxonomy, reducing manual effort while preserving annotation quality. [Method] We adopted an explanatory sequential mixed-methods approach. First, a two-step prompting protocol was applied to classify user stories from 12 real-world projects using GPT-4-turbo. Then, we compared model outputs to expert annotations, measuring agreement and qualitatively analyzing disagreements to identify causes and propose corrective actions. [Results] The LLM achieved a 48.1% agreement rate with human labels, with project-specific performance ranging from 14.0% to 84.4%. Notably, in 46% of disagreement cases, the LLM’s classifications were judged more appropriate than the human label, and in only 25% the human labels were judged to be correct, highlighting inconsistencies in the human annotation process despite prior validation. [Conclusion] These initial findings suggest that LLMs can effectively assist in classifying user stories for reuse purposes. Beyond reducing labeling effort, they offer the potential as reviewers in collaborative workflows to improve consistency, transparency, and the overall quality of software artifact organization.

Palavras-chave: Agile Software Development, User Stories Classification, Software Requirements Classification, Large Language Models, Prompt Engineering, Software Reuse

Referências

Danyllo Albuquerque, Everton Guimarães, Graziela Tonin, Pilar Rodríguezs, Mirko Perkusich, Hyggo Almeida, Angelo Perkusich, and Ferdinandy Chagas. 2023. Managing Technical Debt Using Intelligent Techniques - A Systematic Mapping Study. IEEE Transactions on Software Engineering 49, 4 (2023), 2202–2220. DOI: 10.1109/TSE.2022.3214764

Yonatha Almeida, Danyllo Albuquerque, Emanuel Dantas Filho, Felipe Muniz, Katyusco de Farias Santos, Mirko Perkusich, Hyggo Almeida, and Angelo Perkusich. 2024. AICodeReview: Advancing code quality with AI-enhanced reviews. SoftwareX 26 (2024), 101677.

Anis R Amna and Geert Poels. 2022. Systematic literature mapping of user story research. IEEE access 10 (2022), 51723–51746.

Bader Alshemaimri Batool Alawaji and, Mona Hakami and. 2024. Evaluating Generative Language Models with Prompt Engineering for Categorizing User Stories to its Sector Domains. In Proceedings of the IEEE Conference. [link]

Edna Dias Canedo, Angelica Toffano S Calazans, Geovana Ramos Sousa Silva, Eloisa Toffano Seidel Masson, and Isabel Sofia Brito. 2024. On the Challenges to Documenting Requirements in Agile Software Development: A Practitioners’ Perspective. In Congresso Ibero-Americano em Engenharia de Software (CIbSE). SBC, 286–300.

Shih-Chieh Dai, Aiping Xiong, and Lun-Wei Ku. 2023. LLM-in-the-loop: Leveraging Large Language Model for Thematic Analysis. arXiv preprint arXiv:2310.15100 (2023). [link]

Ednaldo Dilorenzo, Emanuel Dantas, Mirko Perkusich, Felipe Ramos, Alexandre Costa, Danyllo Albuquerque, Hyggo Almeida, and Angelo Perkusich. 2020. Enabling the Reuse of Software Development Assets Through a Taxonomy for User Stories. IEEE Access 8 (2020), 107285–107300. DOI: 10.1109/ACCESS.2020.2996951

Henry Edison, Xiaofeng Wang, and Kieran Conboy. 2021. Comparing methods for large-scale agile software development: A systematic literature review. IEEE Transactions on Software Engineering 48, 8 (2021), 2709–2731.

Ahmed Fawzy, Amjed Tahir, Matthias Galster, and Peng Liang. 2025. Exploring data management challenges and solutions in agile software development: a literature review and practitioner survey. Empirical Software Engineering 30, 3 (2025), 1–61.

Katharina Großer, Volker Riediger, and Jan Jürjens. 2022. Requirements document relations: A reuse perspective on traceability through standards. Software and Systems Modeling 21, 6 (2022), 1–37.

Samedi Heng, Monique Snoeck, and Konstantinos Tsilionis. 2022. Building a Software Architecture out of User Stories and BDD Scenarios: Research Agenda. In CEUR Workshop Proceedings (CEUR-WS. org), Vol. 3134. 40–46.

Tobias Hey, Jan Keim, and Sophie Corallo. 2024. Requirements classification for traceability link recovery. In 2024 IEEE 32nd International Requirements Engineering Conference (RE). IEEE, 155–167.

Tobias Hey, Jan Keim, Anne Koziolek, andWalter F Tichy. 2020. Norbert: Transfer learning for requirements classification. In 2020 IEEE 28th international requirements engineering conference (RE). IEEE, 169–179.

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology 33, 8 (2024), 1–79.

Rashidah Kasauli, Eric Knauss, Jennifer Horkoff, Grischa Liebel, and Francisco Gomes de Oliveira Neto. 2021. Requirements engineering challenges and practices in large-scale agile system development. Journal of Systems and Software 172 (2021), 110851.

Jinfeng Lin, Yalin Liu, Qingkai Zeng, Meng Jiang, and Jane Cleland-Huang. 2021. Traceability transformed: Generating more accurate links with pre-trained bert models. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 324–335.

Garm Lucassen, Fabiano Dalpiaz, Jan Martijn EM van der Werf, and Sjaak Brinkkemper. 2016. Improving agile requirements: the quality user story framework and tool. Requirements engineering 21 (2016), 383–403.

Xianchang Luo, Yinxing Xue, Zhenchang Xing, and Jiamou Sun. 2022. Prcbert: Prompt learning for requirement classification using bert-based pretrained language models. In Proceedings of the 37th IEEE/ACM international conference on automated software engineering. 1–13.

Barak Or. 2025. Improving Requirements Classification with SMOTE-Tomek Preprocessing. arXiv preprint arXiv:2501.06491 (2025).

Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. 2024. Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering 36, 7 (2024), 3580–3599.

Tingrui Qiao, Caroline Walker, Chris Cunningham, and Yun Sing Koh. 2025. Thematic-LM: A LLM-based Multi-agent System for Large-scale Thematic Analysis. In Proceedings of the ACM on Web Conference 2025. 649–658.

Scarlet Rahy and Julian M Bass. 2022. Managing non-functional requirements in agile software development. IET software 16, 1 (2022), 60–72.

Soumya Prakash Rath, Nikunj Kumar Jain, Gunjan Tomar, and Alok Kumar Singh. 2025. A systematic literature review of agile software development projects. Information and Software Technology (2025), 107727.

Summra Saleem, Muhammad Nabeel Asim, Ludger Van Elst, and Andreas Dengel. 2023. FNReq-Net: A hybrid computational framework for functional and nonfunctional requirements classification. Journal of King Saud University-Computer and Information Sciences 35, 8 (2023), 101665.

Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, Anders Wesslén, et al. 2012. Experimentation in software engineering. Vol. 236. Springer.

Yuchen Xia, Zhewen Xiao, Nasser Jazdi, and Michael Weyrich. 2024. Generation of asset administration shell with large language model agents: Towards semantic interoperability in digital twins in the context of industry 4.0. IEEE Access (2024).

Asma Yamani, Malak Baslyman, and Moataz Ahmed. 2025. Leveraging LLMs for User Stories in AI Systems: UStAI Dataset. arXiv preprint arXiv:2504.00513 (2025).

He Zhang, Chuhao Wu, Jingyi Xie, Yao Lyu, Jie Cai, and John M Carroll. 2023. Redefining qualitative analysis in the AI era: Utilizing ChatGPT for efficient thematic analysis. arXiv preprint arXiv:2309.10771 (2023).

Zheying Zhang, Maruf Rayhan, Tomas Herda, Manuel Goisauf, and Pekka Abrahamsson. 2024. Llm-based agents for automating the enhancement of user story quality: An early report. In International Conference on Agile Software Development. Springer Nature Switzerland Cham, 117–126.