Improving Bug Reporting by Fine-Tuning the T5 Model: An Evaluation in a Software Industry

Davi Gonzaga; Leonardo Tiago; Ana Paula Silva; Flávia Oliveira; Lennon Chaves

doi:10.5753/sast.2025.13591

Davi Gonzaga Sidia Institute of Science and Technology
Leonardo Tiago Sidia Institute of Science and Technology
Ana Paula Silva Sidia Institute of Science and Technology
Flávia Oliveira Sidia Institute of Science and Technology
Lennon Chaves Sidia Institute of Science and Technology

DOI: https://doi.org/10.5753/sast.2025.13591

Resumo

Context: Bug reporting is essential in software development to ensure product quality. Within testing teams, testers report multiple bugs daily through manual reports describing test details. Problem: Manual bug reporting is exhausting and time-consuming, leading to potential errors and missing information. These issues can compromise the product if developers cannot understand the bug’s cause. Solution: We explore training Large Language Models (LLMs) to automatically generate bug reports from brief user descriptions. Methodology: We collected 1,800 bugs reported in 2024 to build a dataset and fine-tuned the T5 model (Text-to-Text Transfer Transformer). Three fine-tuning iterations were performed and evaluated using BLEU, METEOR, Precision, Recall, and F1-Score metrics. Summary of Results: The best-performing model, trained with two low-complexity bug report types, achieved BLEU of 0.9845, METEOR of 0.9634, Precision of 0.9898, Recall of 0.9886, and F1-Score of 0.9892. In testing, this model achieved 70% success rate with responses matching user input without hallucinations. The results indicate LLMs are viable for automatic bug report generation.

Palavras-chave: Bug Report, Large Language Models, Software Testing

Referências

Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, and Weizhu Chen. 2024. Make your llm fully utilize the context. Advances in Neural Information Processing Systems 37 (2024), 62160–62188.

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72.

Nicolas Bettenburg, Sascha Just, Adrian Schröter, Cathrin Weiss, Rahul Premraj, and Thomas Zimmermann. 2008. What makes a good bug report?. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering (Atlanta, Georgia) (SIGSOFT ’08/FSE-16). Association for Computing Machinery, New York, NY, USA, 308–318. DOI: 10.1145/1453101.1453146

Steven Davies and Marc Roper. 2014. What’s in a bug report?. In Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (Torino, Italy) (ESEM ’14). Association for Computing Machinery, New York, NY, USA, Article 26, 10 pages. DOI: 10.1145/2652524.2652541

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2023. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683 [cs.LG] [link]

Vincenzo Scotti, Licia Sbattella, and Roberto Tedesco. 2023. A primer on seq2seq models for generative chatbots. Comput. Surveys 56, 3 (2023), 1–58.

Cegu Vima, Hanger Bosch, and John Harringstone. 2024. Enhancing inference efficiency and accuracy in large language models through next-phrase prediction. (2024).

Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, and Mengnan Du. 2024. Explainability for Large Language Models: A Survey. ACM Trans. Intell. Syst. Technol. 15, 2, Article 20 (Feb. 2024), 38 pages. DOI: 10.1145/3639372

Thomas Zimmermann, Rahul Premraj, Nicolas Bettenburg, Sascha Just, Adrian Schroter, and Cathrin Weiss. 2010. What Makes a Good Bug Report? IEEE Transactions on Software Engineering 36, 5 (2010), 618–643. DOI: 10.1109/TSE.2010.63