Quality Evaluation of Software Functional Requirements Generated by LLMs: A Systematic Mapping Study

  • Lucas M. Lourenço UFC
  • Carla Bezerra UFC
  • Allan P. Marques UFC

Abstract


Research Context: Large Language Models (LLMs) have been increasingly applied in software engineering, especially in the generation of functional requirements for information systems. Scientific and/or Practical Problem: However, the quality of these requirements still raises concerns, such as ambiguities, incompleteness, and dependency on prompts. Proposed Solution and/or Analysis: This study conducts a systematic mapping study to analyze how the literature has evaluated requirements generated by LLMs. Related IS Theory: The work is aligned with the sociotechnical perspective, considering information systems requirements as both technical and social artifacts. Research Method: A systematic mapping study was conducted, reviewing 1,875 studies and selecting 51 primary studies published between 2020 and 2025. Summary of Results: The findings indicate a diversity of evaluation practices, combining traditional criteria and NLP metrics, with advantages in automation and standardization, but limitations regarding the absence of benchmarks and validation in industrial contexts, especially in the development of information systems. Contributions and Impact to IS area: The study contributes to academia by consolidating evaluation approaches and identifying gaps, and to industry by supporting the understanding of risks and opportunities in the use of LLMs in the requirements engineering of information systems.

References

Achiam, O., Adler, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

Araújo, R. and Suzana, R. (2017). Grand research challenges in information systems in brazil 2016–2026. Brazilian Computer Society. Clodis Boscarioli Renata Araujo and Rita Suzana, 5(1):2016–2026.

Arora, C., Grundy, J., and Abdelrazek, M. (2024). Advancing requirements engineering through generative ai: Assessing the role of llms. In Generative AI for Effective Software Development, pages 129–148. Springer.

Bajaj, D., Goel, A., Gupta, S., and Batra, H. (2022). Muce: a multilingual use case model extractor using gpt-3. International Journal of Information Technology, 14(3):1543–1554.

Bertram, V., Kausch, H., Kusmenko, E., Nqiri, H., Rumpe, B., and Venhoff, C. (2023). Leveraging natural language processing for a consistency checking toolchain of automotive requirements. In 2023 IEEE 31st International Requirements Engineering Conference (RE), pages 212–222. IEEE.

Binder, M. and Mezhuyev, V. (2024). A framework for creating an iot system specification with chatgpt. Internet of Things, 27:101218.

Blasek, N., Eichenmüller, K., Ernst, B., Götz, N., Nast, B., and Sandkuhl, K. (2023). Large language models in requirements engineering for digital twins. In PoEM Companion.

Committee, I. C. S. S. E. S. and Board, I.-S. S. (1998). IEEE recommended practice for software requirements specifications, volume 830. IEEE.

Davis, A., Overmyer, S., Jordan, K., Caruso, J., Dandashi, F., Dinh, A., Kincaid, G., Ledeboer, G., Reynolds, P., Sitaram, P., et al. (1993). Identifying and measuring quality in a software requirements specification. In [1993] Proceedings First International Software Metrics Symposium, pages 141–152. Ieee.

Dearstyne, K. R., Rodriguez, A. D., and Cleland-Huang, J. (2024). Supporting software maintenance with dynamically generated document hierarchies. In 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 426–437. IEEE.

Fernández, D. M., Wagner, S., Kalinowski, M., Felderer, M., Mafra, P., Vetrò, A., Conte, T., Christiansson, M.-T., Greer, D., Lassenius, C., et al. (2017). Naming the pain in requirements engineering: Contemporary problems, causes, and effects in practice. Empirical software engineering, 22(5):2298–2338.

Ferrari, A., Esuli, A., Gori, M., et al. (2023). Chatgpt for requirements engineering: Threat or opportunity? In 2023 IEEE 31st International Requirements Engineering Conference (RE), pages 13–23. IEEE.

Ferrari, A. and Spoletini, P. (2025). Formal requirements engineering and large language models: A two-way roadmap. Information and Software Technology, 181:107697.

Hemmat, A., Sharbaf, M., Kolahdouz-Rahimi, S., Lano, K., and Tehrani, S. Y. (2025). Research directions for using llm in software requirement engineering: A systematic review. Frontiers in Computer Science, 7:1519437.

Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo, X., Lo, D., Grundy, J., and Wang, H. (2024). Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology, 33(8):1–79.

Huang, K., Wang, F., Huang, Y., and Arora, C. (2025). Prompt engineering for requirements engineering: A literature review and roadmap. arXiv preprint arXiv:2507.07682.

ISO, I. (2018). Iec/ieee 29148: 2018 systems and software engineering-life cycle processes. Requirements engineering.

Jain, C., Anish, P. R., Singh, A., and Ghaisas, S. (2023). A transformer-based approach for abstractive summarization of requirements from obligations in software engineering contracts. In 2023 IEEE 31st International Requirements Engineering Conference (RE), pages 169–179. IEEE.

Ji, Z., Lee, N., Frieske, R., et al. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.

Kitchenham, B. and Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering.

Krishna, M., Gaur, B., Verma, A., and Jalote, P. (2024). Using llms in software requirements specifications: An empirical evaluation. In 2024 IEEE 32nd International Requirements Engineering Conference (RE), pages 475–483. IEEE.

Liberati, A., Altman, D. G., Tetzlaff, J., Mulrow, C., Gøtzsche, P. C., Ioannidis, J. P., Clarke, M., Devereaux, P. J., Kleijnen, J., and Moher, D. (2009). The prisma statement for reporting systematic reviews and meta-analyses of studies that evaluate healthcare interventions: explanation and elaboration. Bmj, 339.

Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), pages 74–81. ACL.

Liu, H., García, M. B., and Korkakakis, N. (2024). Exploring multi-label data augmentation for llm fine-tuning and inference in requirements engineering: A study with domain expert evaluation. In 2024 International Conference on Machine Learning and Applications (ICMLA), pages 432–439. IEEE.

Lubos, S., Felfernig, A., Tran, T. N. T., Garber, D., El Mansi, M., Erdeniz, S. P., and Le, V.-M. (2024). Leveraging llms for the quality assurance of software requirements. In 2024 IEEE 32nd International Requirements Engineering Conference (RE), pages 389–397. IEEE.

Nakagawa, H. and Honiden, S. (2023). Mape-k loop-based goal model generation using generative ai. In 2023 IEEE 31st International Requirements Engineering Conference Workshops (REW), pages 247–251. IEEE.

Nicklas, M. and Mozelius, P. (2024). Exploring student perspectives on generative ai in requirements engineering education. In ICAIR 2024, volume 4.

Norheim, J. J. and Rebentisch, E. (2024). Structuring natural language requirements with large language models. In 2024 IEEE 32nd International Requirements Engineering Conference Workshops (REW), pages 68–71. IEEE.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318.

Pohl, K. (2010). Fundamentals, principles, and techniques.

Ramasamy, V., Ramamoorthy, S., Walia, G. S., Kulpinski, E., and Antreassian, A. (2024). Enhancing user story generation in agile software development through open ai and prompt engineering. In 2024 IEEE Frontiers in Education Conference (FIE), pages 1–8. IEEE.

Ronanki, K., Berger, C., and Horkoff, J. (2023). Investigating chatgpt’s potential to assist in requirements elicitation processes. In 2023 49th Euromicro conference on software engineering and advanced applications (SEAA), pages 354–361. IEEE.

Ruan, K., Chen, X., and Jin, Z. (2023). Requirements modeling aided by chatgpt: An experience in embedded systems. In 2023 IEEE 31st International Requirements Engineering Conference Workshops (REW), pages 170–177. IEEE.

Sabetzadeh, M. and Arora, C. (2025). Practical guidelines for the selection and evaluation of natural language processing techniques in requirements engineering. In Handbook on Natural Language Processing for Requirements Engineering, pages 407–433. Springer.

Sami, M. A., Waseem, M., Zhang, Z., Rasheed, Z., Systä, K., and Abrahamsson, P. (2024). Early results of an ai multiagent system for requirements elicitation and anal ysis. In International Conference on Product-Focused Software Process Improvement, pages 307–316. Springer.

Santos, C. A. d., Bouchard, K., and Minetto Napoleão, B. (2025). Automatic user story generation: a comprehensive systematic literature review. International Journal of Data Science and Analytics, 20(1):1–24.

Santos, S., Breaux, T., Norton, T., Haghighi, S., and Ghanavati, S. (2024). Requirements satisfiability with in-context learning. In 2024 IEEE 32nd International Requirements Engineering Conference (RE), pages 168–179. IEEE.

Schwedt, S. and Ströder, T. (2025). From bugs to benefits: Improving user stories by leveraging crowd knowledge with cruise-ac. arXiv preprint arXiv:2501.15181.

Shah, S. T. U., Hussein, M., Barcomb, A., and Moshirpour, M. (2025). From inductive to deductive: Llms-based qualitative data analysis in requirements engineering. arXiv preprint arXiv:2504.19384.

Sharma, A., Chaturvedi, A., and Tripathi, A. K. (2024). From problem descriptions to user stories: Utilizing large language models through prompt chaining. In 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), pages 1–6. IEEE.

Siddeshwar, V., Alwidian, S., and Makrehchi, M. (2024). A systematic review of ai-enabled frameworks in requirements elicitation. IEEE Access.

Sommerville, I. (2011). Software Engineering. Pearson Education, 9th edition.

Touvron, H., Lavril, T., Izacard, G., et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

Umar, M. A. and Lano, K. (2024). Advances in automated support for requirements engineering: a systematic literature review. Requirements Engineering, 29(2):177–207.

Verdecchia, R., Engström, E., Lago, P., Runeson, P., and Song, Q. (2023). Threats to validity in software engineering research: A critical reflection. Information and Software Technology, 164:107329.

Vogelsang, A. (2024). From specifications to prompts: On the future of generative large language models in requirements engineering. IEEE Software, 41(5):9–13.

Voria, G., Casillo, F., Gravino, C., Catolino, G., and Palomba, F. (2025). Recover: Toward requirements generation from stakeholders’ conversations. IEEE Transactions on Software Engineering.

Wagner, S., Fernández, D. M., Felderer, M., Vetrò, A., Kalinowski, M., Wieringa, R., Pfahl, D., Conte, T., Christiansson, M.-T., Greer, D., et al. (2019). Status quo in requirements engineering: A theory and a global family of surveys. ACM Transactions on Software Engineering and Methodology (TOSEM), 28(2):1–48.

Wang, B., Wang, C., Liang, P., Li, B., and Zeng, C. (2024). How llms aid in uml modeling: An exploratory study with novice analysts. In 2024 IEEE International Conference on Software Services Engineering (SSE), pages 249–257. IEEE.

Wiegers, K. E. and Beatty, J. (2013). Software Requirements. Microsoft Press, 3rd edition.

Xu, Y., Feng, J., and Miao, W. (2024). Learning from failures: Translation of natural language requirements into linear temporal logic with large language models. In 2024 IEEE 24th International Conference on Software Quality, Reliability and Security (QRS), pages 204–215. IEEE.

Zhang, B., Carriero, V. A., Schreiberhuber, K., Tsaneva, S., González, L. S., Kim, J., and de Berardinis, J. (2024). Ontochat: a framework for conversational ontology engineering using language models. In European Semantic Web Conference, pages 102–121. Springer.

Zhang, T., Kishore, V., Wu, F., et al. (2020). Bertscore: Evaluating text generation with bert. International Conference on Learning Representations (ICLR).

Zhao, L., Alhoshan, W., Ferrari, A., Letsholo, K. J., Ajagbe, M. A., Chioasca, E.-V., and Batista-Navarro, R. T. (2021). Natural language processing for requirements engineering: A systematic mapping study. ACM Computing Surveys (CSUR), 54(3):1–41.

Zhao, W. X., Zhou, K., Li, J., et al. (2023a). A survey of large language models. arXiv preprint arXiv:2303.18223.

Zhao, Z., Zhang, L., Lian, X., Gao, X., Lv, H., and Shi, L. (2023b). Reqgen: Keywords-driven software requirements generation. Mathematics, 11(2):332.

Zitouni, M. N., Anda, A. A., Rajpal, S., Amyot, D., and Mylopoulos, J. (2025). Towards the llm-based generation of formal specifications from natural-language contracts: Early experiments with symboleo. In 2025 IEEE/ACM Requirements Engineering for AI-powered SoftwarE (RAISE), pages 1–9. IEEE.
Published
2026-05-25
LOURENÇO, Lucas M.; BEZERRA, Carla; MARQUES, Allan P.. Quality Evaluation of Software Functional Requirements Generated by LLMs: A Systematic Mapping Study. In: BRAZILIAN SYMPOSIUM ON INFORMATION SYSTEMS (SBSI), 22. , 2026, Vitória/ES. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2026 . p. 161-180. DOI: https://doi.org/10.5753/sbsi.2026.248317.

Most read articles by the same author(s)

<< < 1 2 3 4 5 > >>