LLM-Assisted INVEST Evaluation and Improvement of User Stories: An Industrial Replication Study

Erika Hernández-Agüero; Christian Quesada-López; José P. Chaves-Sánchez

doi:10.5753/cibse.2026.42437

Erika Hernández-Agüero Universidad de Costa Rica / Universidad Estatal a Distancia
Christian Quesada-López Universidad de Costa Rica / Universidad Estatal a Distancia
José P. Chaves-Sánchez Universidad Estatal a Distancia

DOI: https://doi.org/10.5753/cibse.2026.42437

Resumo

The specification and maintenance of high-quality user stories are critical in agile software development, yet they are often hindered by natural-language ambiguity, evolving business requirements, and the effort required for manual backlog refinement in industrial settings. This paper investigates the use of large language models (LLMs), specifically GPT-5.1, to support the automated evaluation and improvement of user-story quality using the INVEST framework. Building on prior expert-based assessments, we propose a human-in-the-loop procedure that combines LLM automation with requirements-engineering expertise. We conduct an industrial replication study using 49 real user stories from a scholarship‑management system, preserving the evaluation–improvement–reevaluation design of prior expert‑based work. Results show alignment between GPT-5.1 and expert judgments, particularly after an evaluation–improvement–reevaluation cycle, with strong semantic agreement and convergence in key INVEST dimensions. GPT‑5.1 assigns slightly lower scores than experts for Independent, Negotiable, Estimable, and Small, with moderate monotonic correlations (ρ≈0.53–0.65). After the improvement cycle, expert medians reach 5 across all INVEST criteria, and GPT‑5.1 converges strongly on Valuable and Testable while remaining more conservative on Independent and Small; semantic agreement exceeds 85–90% across most dimensions. These findings indicate that GPT‑5.1 could reduce manual assessment effort, reinforce structural quality, and support consistent requirements evaluation, while highlighting the complementary role of human oversight in industrial requirements‑engineering workflows.

Palavras-chave: requirements engineering, user stories, large language models, automated quality assessment, INVEST framework, human-in-the-loop, in dustrial study

Referências

Baltes, S. et al. (2025) “Guidelines for Empirical Studies in Software Engineering Involving Large Language Models”, arXiv preprint arXiv:2508.15503.

Belzner, L., Gabor, T. and Wirsing, M. (2023) "Large language model assisted software engineering: prospects, challenges, and a case study", In: International Conference on Bridging the Gap between AI and Reality (pp. 355-374). Springer Nature Switzerland.

Bosch, J. (2014) "Continuous software engineering: An introduction", In: Bosch, J. (eds) Continuous Software Engineering (pp. 3-13). Springer, Cham.

Bourque, P., and Fairley, R. (2014). Guide to the Software Engineering Body of Knowledge (Swebok). 335.

Fitzgerald, B. and Stol, K. (2017) "Continuous software engineering: A roadmap and agenda", In Journal of Systems and Software, 123, 176-189.

Hernández–Agüero, E., Quesada–López, C. and Chaves–Sánchez, J. P. (2024) “Integración de Enfoques Ágiles para el Mejoramiento Continuo de Procesos de Software”, In: 13th CIMPS, IEEE, pp. 01–14.

Hernández-Agüero, E., Quesada-López, C. and Chaves-Sánchez, J. (2025) “Evaluación de la Calidad de Historias de Usuario Usando Modelos de Lenguaje de Gran Tamaño: Un Estudio en la Industria”, In: Anais do XXVIII Congresso Ibero-Americano em Engenharia de Software, pp. 45–59, Porto Alegre, SBC.

Hernández-Agüero, E., Quesada-López, C., and Chaves-Sánchez, J. (2026). “Assessing and Improving the Quality of User Stories Using Large Language Models: An Empirical Study in an Industrial Context”. In press.

Krishna, M., Gaur, B., Verma, A. and Jalote, P. (2024). "Using LLMs in software requirements specifications: an empirical evaluation", In: 2024 IEEE 32nd International Requirements Engineering Conference (RE) (pp. 475-483). IEEE.

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R. and Zhu, C. (2023) “G-EVAL: NLG Evaluation Using GPT-4 with Better Human Alignment”, arXiv preprint arXiv:2303.16634.

Marques, N., Silva, R. and Bernardino, J. (2024). Using chatgpt in software requirements engineering: A comprehensive review. Future Internet, 16(6), 180.

Ronanki, K., Berger, C. and Horkoff, J. (2023) "Investigating ChatGPT’s potential to assist in requirements elicitation processes", In: 2023 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) (pp. 354-361). IEEE.

Roumeliotis, K., Tselikas, N. and Nasiopoulos, D. (2024). "LLMs in e-commerce: a comparative analysis of GPT and LLaMA models in product review evaluation", In: Natural Language Processing Journal, 6, 100056.

Santos, R., Steinmacher, I., Conte, T., Oran, A. C. and Gadelha, B. (2025) “Adoption of LLMs in Requirements Engineering: What Practitioners Are Worried About?”, In: Simpósio Brasileiro de Qualidade de Software (SBQS), pp. 248–258, SBC.

Shull, F. J., Carver, J. C., Vegas, S. and Juristo, N. (2008) “The Role of Replications in Empirical Software Engineering”, ESEJ, 13(2), pp. 211–218.

Wang, Y., Zhang, X., Li, Z., Chen, Z. and Wang, X. (2025) “Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering”.

Zhang, Z., Rayhan, M., Herda, T., Goisauf, M., and Abrahamsson, P. (2024). "Llm-based agents for automating the enhancement of user story quality: An early report", In: International Conference on Agile Software Development (pp. 117-126). Springer.