Ensuring Test Case Coverage through Multilabel Requirements Classification: A Feasibility Study with Pre-Trained Models
Abstract
Context: In a software development project, the developers’ main task is to understand the requirements to then implement them without bugs. Particularly, this paper is inserted in the context of a test team from a software institute that receives new requirements daily. With the goal of verifying if the requirements were correctly implemented, test cases are created based on these requirements, to then confirm if the software is working. Problem: Each requirement must be associated to one or more test cases, thus allowing tracking and ensuring the validation during the test process. However, manually classifying requirements to their associated test cases demands time and effort. Solution: This work details a study about the viability of using the pre-trained models: BERT, Electra, RoBERTa, DeBERTa and XLNet in the process of multi-label classification of requirements, identifying the test cases associated to each requirement. Method: For this study, we considered requirements received by the test team from 2024 to April 2025, and 38 associated test cases. After collecting and preparing the data, we trained the models and measured their performance through the metrics: Hamming Loss, Jaccard, Precision, Recall and F1-Score. Results: Results showed that all models had low Hamming Loss values, specially the BERT model, with 0.012 Hamming Loss. After performing the Friedman and Conover tests, it was determined that there is a significative discrepany between the models. Conclusions: Therefore, this research’s results show that it is possible to use pre-trained models to classify requirements with associated test cases.
References
Osman Balci. 1998. Verification, validation, and testing. Handbook of simulation 10, 8 (1998), 335–393.
Victor R Basili. 1994. Goal, question, metric paradigm. Encyclopedia of software engineering 1 (1994), 528–532.
T Bhuvaneswari and S Prabaharan. 2013. A survey on software development life cycle models. International Journal of Computer Science and Mobile Computing 2, 5 (2013), 262–267.
Dankmar Böhning. 1992. Multinomial logistic regression algorithm. Annals of the institute of Statistical Mathematics 44, 1 (1992), 197–200.
Leo Breiman. 2001. Random forests. Machine learning 45 (2001), 5–32.
Zhichao Chen, Han Liu, and Qin Zhang. 2023. Adaptive Selection of BERT Layer for Multi-Label Text Classification. In 2023 International Conference on Machine Learning and Cybernetics (ICMLC). IEEE, 434–439.
Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).
Ronan Collobert, JasonWeston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. (2011).
William Jay Conover. 1999. Practical nonparametric statistics. john wiley & sons.
André CPLF de Carvalho and Alex A Freitas. 2009. A tutorial on multi-label classification techniques. Foundations of Computational Intelligence Volume 5: Function Approximation and Classification (2009), 177–195.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. DOI: 10.18653/v1/N19-1423
Robert Feldt. 2014. Do System Test Cases Grow Old?. In 2014 IEEE Seventh International Conference on Software Testing, Verification and Validation. 343–352. DOI: 10.1109/ICST.2014.47
Dorothy Graham, Erik van Veenendaal, Isabel Evans, and Rex Black. 2008. Foundations of software testing: ISTQB certification. Intl Thomson Business Pr.
Xu Han, Zhengyan Zhang, Ning Ding, Yuxian Gu, Xiao Liu, Yuqi Huo, Jiezhong Qiu, Yuan Yao, Ao Zhang, Liang Zhang, et al. 2021. Pre-trained models: Past, present and future. AI Open 2 (2021), 225–250.
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654 (2020).
Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. 1998. Support vector machines. IEEE Intelligent Systems and their applications 13, 4 (1998), 18–28.
Sture Holm. 1979. A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics (1979), 65–70.
Sayem Imtiaz, Md Rayhan Amin, Anh Quoc Do, Stefano Iannucci, and Tanmay Bhowmik. 2021. Predicting vulnerability for requirements. In 2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science (IRI). IEEE, 160–167.
Abdul Jabbar, Sajid Iqbal, Manzoor Ilahi Tamimy, Amjad Rehman, Saeed Ali Bahaj, and Tanzila Saba. 2023. An analytical analysis of text stemming methodologies in information retrieval and natural language processing systems. IEEE Access 11 (2023), 133681–133702.
Zahra Jamshidzadeh, Mohammad Ehteram, and Hanieh Shabanian. 2024. Bidirectional Long Short-Term Memory (BILSTM)-Support Vector Machine: A new machine learning model for predicting water quality parameters. Ain Shams Engineering Journal 15, 3 (2024), 102510.
Pravin M Kamde, VD Nandavadekar, and RG Pawar. 2006. Value of test cases in software testing. In 2006 IEEE International Conference on Management of Innovation and Technology, Vol. 2. IEEE, 668–672.
Derya Kici, Garima Malik, Mucahit Cevik, Devang Parikh, and Ayse Basar. 2021. A BERT-based transfer learning approach to text classification on software requirements specifications.. In Canadian AI.
Ja-Yong Koo, Yoonkyung Lee, Yuwon Kim, and Changyi Park. 2008. A Bahadur Representation of the Linear Support Vector Machine. Journal of Machine Learning Research 9, 44 (2008), 1343–1368. [link]
Carlos DQ Lima, Everton LG Alves, and Wilkerson L Andrade. 2024. A Systematic Literature Review on MBT Test Cases Maintenance. In 2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, 1356–1365.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
Nuno Marques, Rodrigo Rocha Silva, and Jorge Bernardino. 2024. Using ChatGPT in Software Requirements Engineering: A Comprehensive Review. Future Internet 16, 6 (2024). DOI: 10.3390/fi16060180
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems 26 (2013).
Mohamad Syahrul Mubarok, Nanang Saiful Huda, et al. 2018. A multi-label classification on topics of quranic verses in English translation using Tree Augmented Naïve Bayes. In 2018 6th International Conference on Information and Communication Technology (ICoICT). IEEE, 103–106.
Bashar Nuseibeh and Steve Easterbrook. 2000. Requirements engineering: a roadmap. In Proceedings of the Conference on The Future of Software Engineering (Limerick, Ireland) (ICSE ’00). Association for Computing Machinery, New York, NY, USA, 35–46. DOI: 10.1145/336512.336523
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
Guzman Santafe, Iñaki Inza, and Jose A Lozano. 2015. Dealing with the evaluation of supervised classification algorithms. Artificial Intelligence Review 44 (2015), 467–508.
Samuel Sanford Shapiro and Martin B Wilk. 1965. An analysis of variance test for normality (complete samples). Biometrika 52, 3-4 (1965), 591–611.
Navnath Shete and Avinash Jadhav. 2014. An empirical study of test cases in software testing. In International Conference on Information Communication and Embedded Systems (ICICES2014). IEEE, 1–5.
Gurpej Singh, Rahul Bhandari, and Prabhdeep Singh. 2024. Advancing NLP for Punjabi Language: A Comprehensive Review of Language Processing Challenges and Opportunities. In 2024 2nd International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT). IEEE, 1250–1257.
Ahmad F Subahi. 2023. Bert-based approach for greening software requirements engineering through non-functional requirements. IEEE Access 11 (2023), 103001–103013.
Nor Hashimah Sulaiman and Daud Mohamad. 2012. A Jaccard-based similarity measure for soft sets. In 2012 IEEE symposium on humanities, science and engineering research. IEEE, 659–663.
Chetan Surana Rajender Kumar Surana, Dipesh B Gupta, Sahana P Shankar, et al. 2019. Intelligent chatbot for requirements elicitation and classification. In 2019 4th International Conference on Recent Trends on Electronics, Information, Communication & Technology (RTEICT). IEEE, 866–870.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
Ayushi Verma and Neetu Sardana. 2019. Comparative Study of Multilabel Classifiers on Software Engineering Q&A Community for Tag Recommendation. In 2019 International Conference on Signal Processing and Communication (ICSC). IEEE, 190–194.
Haifeng Wang, Jiwei Li, Hua Wu, Eduard Hovy, and Yu Sun. 2023. Pre-trained language models and their applications. Engineering 25 (2023), 51–65.
Xiao Wang, Guangyao Chen, Guangwu Qian, Pengcheng Gao, Xiao-Yong Wei, Yaowei Wang, Yonghong Tian, and Wen Gao. 2023. Large-scale multi-modal pre-trained models: A comprehensive survey. Machine Intelligence Research 20, 4, 2025, Recife, PE (2023), 447–482.
Jonas Paul Winkler, Jannis Grönberg, and Andreas Vogelsang. 2019. Predicting how to test requirements: An automated approach. In 2019 IEEE 27th International Requirements Engineering Conference (RE). IEEE, 120–130.
ClaesWohlin, Per Runeson, Martin Hst, Magnus C. Ohlsson, Bjrn Regnell, and Anders Wessln. 2012. Experimentation in Software Engineering. Springer Publishing Company, Incorporated.
Caiming Xiong, Victor Zhong, and Richard Socher. 2017. Dcn+: Mixed objective and deep residual coattention for question answering. arXiv preprint arXiv:1711.00106 (2017).
Jie Xiong, Li Yu, Xi Niu, and Youfang Leng. 2023. XRR: Extreme multi-label text classification with candidate retrieving and deep ranking. Information Sciences 622 (2023), 115–132.
Guangxu Xun, Kishlay Jha, Jianhui Sun, and Aidong Zhang. 2020. Correlation networks for extreme multi-label text classification. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1074–1082.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32 (2019).
Tingting Zhai and Hao Wang. 2022. Online passive-aggressive multilabel classification algorithms. IEEE transactions on neural networks and learning systems 34, 12 (2022), 10116–10129.
Min-Ling Zhang and Zhi-Hua Zhou. 2013. A review on multi-label learning algorithms. IEEE transactions on knowledge and data engineering 26, 8 (2013), 1819–1837.
Yuan Zhao, Sining Liu, Quanjun Zhang, Xiuting Ge, and Jia Liu. 2023. Test case classification via few-shot learning. Information and Software Technology 160 (2023), 107228.
Zhi-Hua Zhou and Min-Ling Zhang. 2017. Multi-label Learning.
Hong Zhu, Patrick A. V. Hall, and John H. R. May. 1997. Software unit test coverage and adequacy. ACM Comput. Surv. 29, 4 (Dec. 1997), 366–427. DOI: 10.1145/267580.267590
Celal Ziftci and Ingolf Krüger. 2013. Getting more from requirements traceability: Requirements testing progress. In 2013 7th international workshop on traceability in emerging forms of software engineering (TEFSE). IEEE, 12–18.
Donald W Zimmerman and Bruno D Zumbo. 1993. Relative power of the Wilcoxon test, the Friedman test, and repeated-measures ANOVA on ranks. The Journal of Experimental Education 62, 1 (1993), 75–86.
