Using Large Language Models to Classify Test Case Complexity with Explainability
Resumo
The classification of black-box test case complexity is a key task in software testing, enabling resource prioritization and the identification of edge scenarios. In this work, we propose a three-stage LLM-based pipeline that integrates explanation generation into the classification process, treating justifications as central to model decision-making. Our approach formulates the task as a conditional generation problem, where LLMs are guided to first produce a rationale and then a complexity label. Experimental results demonstrate that this strategy improves both predictive accuracy and explainability compared to using LLMs directly for classification. We show that LLM-generated justifications not only enhance user trust but also contribute to more consistent and explainable decisions.
Palavras-chave:
Test Case Complexity, Large Language Models, Prompt Engineer
Referências
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
AI@Meta. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] [link]
Antonia Bertolino. 2007. Software Testing Research: Achievements, Challenges, Dreams. In Future of Software Engineering (FOSE ’07). 85–103. DOI: 10.1109/FOSE.2007.25
Nicolas Antonio Cloutier and Nathalie Japkowicz. 2023. Fine-tuned generative LLM oversampling can improve performance over traditional techniques on multiclass imbalanced text classification. In 2023 IEEE International conference on big data (BigData). IEEE, 5181–5186.
Naihao Deng, Yikai Liu, Mingye Chen, WinstonWu, Siyang Liu, Yulong Chen, Yue Zhang, and Rada Mihalcea. 2023. EASE: An Easily-Customized Annotation System Powered by Efficiency Enhancement Mechanisms. arXiv:2305.14169 [cs.HC] [link]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314 [cs.LG] [link]
Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh, Ming-Wei Chang, Dipanjan Das, and William Cohen. 2019. Handling Divergent Reference Texts when Evaluating Table-to-Text Generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 4884–4895. DOI: 10.18653/v1/P19-1483
Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of Interpretable Machine Learning. arXiv:1702.08608 [stat.ML] [link]
Evgenia Gkintoni, Hera Antonopoulou, Andrew Sortwell, and Constantinos Halkiopoulos. 2025. Challenging Cognitive Load Theory: The Role of Educational Neuroscience and Artificial Intelligence in Redefining Learning Efficacy. Brain Sciences 15, 2 (2025), 203.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2023. Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916 [cs.CL] [link]
Dragan Milicev. 2007. On the Semantics of Associations and Association Ends in UML. IEEE Transactions on Software Engineering 33, 4 (2007), 238–251. DOI: 10.1109/TSE.2007.37
Anders Giovanni Møller, Jacob Aarup Dalsgaard, Arianna Pera, and Luca Maria Aiello. 2023. The parrot dilemma: Human-labeled vs. LLM-augmented data in classification tasks. arXiv preprint arXiv:2304.13861 (2023).
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, NY, USA, 1135–1144. DOI: 10.1145/2939672.2939778
Gemma Team. 2025. Gemma 3 Technical Report. arXiv:2503.19786 [cs.CL] [link]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
Sowmya Vajjala and Shwetali Shimangaud. 2025. Text Classification in the LLM Era-Where do we stand? arXiv preprint arXiv:2502.11830 (2025).
Zhiqiang Wang, Yiran Pang, and Yanbin Lin. 2024. Smart Expert System: Large Language Models as Text Classifiers. arXiv e-prints (2024), arXiv–2405.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, Quoc Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models. CoRR abs/2201.11903 (2022). arXiv:2201.11903 [link]
Xuansheng Wu, Wenhao Yu, Xiaoming Zhai, and Ninghao Liu. 2025. Self-regularization with latent space explanations for controllable llm-based classification. arXiv preprint arXiv:2502.14133 (2025).
Yuhang Wu, Yingfei Wang, Chu Wang, and Zeyu Zheng. 2024. Large Language Model Enhanced Machine Learning Estimators for Classification. arXiv preprint arXiv:2405.05445 (2024).
Yazhou Zhang, Mengyao Wang, Chenyu Ren, Qiuchi Li, Prayag Tiwari, Benyou Wang, and Jing Qin. 2024. Pushing The Limit of LLM Capacity for Text Classification. arXiv:2402.07470 [cs.CL] [link]
Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, and Mengnan Du. 2023. Explainability for Large Language Models: A Survey. arXiv:2309.01029 [cs.CL] [link]
AI@Meta. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI] [link]
Antonia Bertolino. 2007. Software Testing Research: Achievements, Challenges, Dreams. In Future of Software Engineering (FOSE ’07). 85–103. DOI: 10.1109/FOSE.2007.25
Nicolas Antonio Cloutier and Nathalie Japkowicz. 2023. Fine-tuned generative LLM oversampling can improve performance over traditional techniques on multiclass imbalanced text classification. In 2023 IEEE International conference on big data (BigData). IEEE, 5181–5186.
Naihao Deng, Yikai Liu, Mingye Chen, WinstonWu, Siyang Liu, Yulong Chen, Yue Zhang, and Rada Mihalcea. 2023. EASE: An Easily-Customized Annotation System Powered by Efficiency Enhancement Mechanisms. arXiv:2305.14169 [cs.HC] [link]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314 [cs.LG] [link]
Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh, Ming-Wei Chang, Dipanjan Das, and William Cohen. 2019. Handling Divergent Reference Texts when Evaluating Table-to-Text Generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 4884–4895. DOI: 10.18653/v1/P19-1483
Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of Interpretable Machine Learning. arXiv:1702.08608 [stat.ML] [link]
Evgenia Gkintoni, Hera Antonopoulou, Andrew Sortwell, and Constantinos Halkiopoulos. 2025. Challenging Cognitive Load Theory: The Role of Educational Neuroscience and Artificial Intelligence in Redefining Learning Efficacy. Brain Sciences 15, 2 (2025), 203.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2023. Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916 [cs.CL] [link]
Dragan Milicev. 2007. On the Semantics of Associations and Association Ends in UML. IEEE Transactions on Software Engineering 33, 4 (2007), 238–251. DOI: 10.1109/TSE.2007.37
Anders Giovanni Møller, Jacob Aarup Dalsgaard, Arianna Pera, and Luca Maria Aiello. 2023. The parrot dilemma: Human-labeled vs. LLM-augmented data in classification tasks. arXiv preprint arXiv:2304.13861 (2023).
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, NY, USA, 1135–1144. DOI: 10.1145/2939672.2939778
Gemma Team. 2025. Gemma 3 Technical Report. arXiv:2503.19786 [cs.CL] [link]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
Sowmya Vajjala and Shwetali Shimangaud. 2025. Text Classification in the LLM Era-Where do we stand? arXiv preprint arXiv:2502.11830 (2025).
Zhiqiang Wang, Yiran Pang, and Yanbin Lin. 2024. Smart Expert System: Large Language Models as Text Classifiers. arXiv e-prints (2024), arXiv–2405.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, Quoc Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models. CoRR abs/2201.11903 (2022). arXiv:2201.11903 [link]
Xuansheng Wu, Wenhao Yu, Xiaoming Zhai, and Ninghao Liu. 2025. Self-regularization with latent space explanations for controllable llm-based classification. arXiv preprint arXiv:2502.14133 (2025).
Yuhang Wu, Yingfei Wang, Chu Wang, and Zeyu Zheng. 2024. Large Language Model Enhanced Machine Learning Estimators for Classification. arXiv preprint arXiv:2405.05445 (2024).
Yazhou Zhang, Mengyao Wang, Chenyu Ren, Qiuchi Li, Prayag Tiwari, Benyou Wang, and Jing Qin. 2024. Pushing The Limit of LLM Capacity for Text Classification. arXiv:2402.07470 [cs.CL] [link]
Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, and Mengnan Du. 2023. Explainability for Large Language Models: A Survey. arXiv:2309.01029 [cs.CL] [link]
Publicado
22/09/2025
Como Citar
CUSTÓDIO, Tiago; CARVALHO, André; SANTOS, Maikon; SOARES, Yan; MELO, Hallyson; FERREIRA, Nikson; MARQUES, Rodrigo.
Using Large Language Models to Classify Test Case Complexity with Explainability. In: SIMPÓSIO BRASILEIRO DE ENGENHARIA DE SOFTWARE (SBES), 39. , 2025, Recife/PE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 797-803.
ISSN 2833-0633.
DOI: https://doi.org/10.5753/sbes.2025.11575.
