Evaluating Large Language Models on the Classification of Different Technical Debt Types in Stack Overflow Discussions

Lucas Amaral; Eliakim Gama; Matheus Paixao; Lucas Aguiar

doi:10.5753/ise.2025.14868

Lucas Amaral UECE
Eliakim Gama UECE
Matheus Paixao UECE
Lucas Aguiar UECE

DOI: https://doi.org/10.5753/ise.2025.14868

Resumo

Technical Debt (TD) refers to suboptimal decisions made during software development that offer short-term benefits at the cost of long-term maintainability. Managing TD is critical for ensuring the sustainability of software systems, especially as projects evolve. While prior research has leveraged machine learning techniques to identify TD in data from platforms such as Stack Overflow (SO), those approaches have shown limited performance. To address these limitations, this study investigates the effectiveness of transformer-based Large Language Models (LLMs) for the automated identification and classification of TD types in SO discussions. We evaluated three prominent LLMs: BERT, BART, and GPT-2, on their ability to classify multiple types of TD. Our contributions are: (i) a reproducible training/evaluation pipeline on an SO TD dataset, and (ii) a comparison against prior studies. LLMs reach up to 85% F1 and 78.6% average F1, outperforming previous results by 8–23%.

Palavras-chave: Technical Debt, Stack Overflow, Large Language Models

Referências

William Aiken, Paul K. Mvula, Paula Branco, Guy-Vincent Jourdan, Mehrdad Sabetzadeh, and Herna Viktor. 2023. Measuring Improvement of F1-Scores in Detection of Self-Admitted Technical Debt . IEEE Computer Society.

Newton S. R. Alves, T. S. Mendes, M. G. de Mendonça, R. O. Spínola, F. Shull, and C. Seaman. 2016. Identification and management of technical debt: A systematic mapping study. Information and Software Technology (2016).

Lucas Amaral. 2025. Replication package: “Evaluating Large Language Models on the Classification of Different Technical Debt Types in StackOverflow Discussions”. [link]

Alan Bandeira, Carlos Alberto Medeiros, Matheus Paixao, and Paulo Henrique Maia. 2019. We Need to Talk about Microservices: an Analysis from the Discussions on StackOverflow. MSR (2019).

Anton Barua, Stephen W Thomas, and Ahmed E Hassan. 2014. What are developers talking about? an analysis of topics and trends in stack overflow. Empirical software engineering (2014).

Nathan Brown, Yuanfang Cai, Yanyan Guo, Rick Kazman, Michael Kim, Philippe Kruchten, Eun-Young Lim, Alan MacCormack, Robert Nord, and Ipek Ozkaya. 2010. Managing technical debt in software-reliant systems. In Proceedings of the FSE/SDP workshop on Future of Software Engineering Research.

Weijia Chen et al. 2023. Code Llama: Open Foundation Models for Code. [link].

Ward Cunningham. 1992. The WyCash Portfolio Management System. In Addendum to the proceedings on Object-oriented programming systems, languages, and applications (Addendum).

Ward Cunningham. 1992. The WyCash portfolio management system. ACM Sigplan Oops Messenger (1992).

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers).

Joshua Aldrich Edbert, Sahrima Jannat Oishwee, Shubhashis Karmakar, Zadia Codabux, and Roberto Verdecchia. 2023. Exploring Technical Debt in Security Questions on Stack Overflow. arXiv preprint arXiv:2307.11387 (2023).

Eliakim Gama, Mariela I. Cortés, Matheus Paixao, and Adson Damasceno. 2023. Machine Learning for the Identification and Classification of Technical Debt Types on StackOverflowDiscussions. In Brazilian Workshop on Intelligent Software Engineering (ISE).

Eliakim Gama, Sávio Freire, Manoel Mendonça, Rodrigo O. Spínola, Matheus Paixao, and Mariela I. Cortés. 2020. Using Stack Overflow to Assess Technical Debt Identification on Software Projects. In Proceedings of the 34th Brazilian Symposium on Software Engineering.

Eliakim Gama, Sávio Freire, Manoel Mendonça, Rodrigo O. Spínola, Matheus Paixão, and Mariela I. Cortés. 2020. Using Stack Overflow to Assess Technical Debt Identification on Software Projects. In Proceedings of the 34th Brazilian Symposium on Software Engineering (SBES).

Eliakim Gama, Matheus Paixao, Emmanuel Sávio Silva Freire, and Mariela Inés Cortés. 2019. Technical Debt’s State of Practice on Stack Overflow: A Preliminary Study. In Proceedings of the XVIII Brazilian Symposium on Software Quality.

Bhawna Jain, Gunika Goyal, and Mehak Sharma. 2024. Evaluating Emotional Detection Classification Capabilities of GPT-2 GPT-Neo Using Textual Data. In 2024 14th International Conference on Cloud Computing, Data Science Engineering (Confluence).

Nicholas Kozanidis, Roberto Verdecchia, and Emitza Guzman. 2022. Asking about Technical Debt: Characteristics and Automatic Identification of Technical Debt Questions on Stack Overflow. In International Symposium on Empirical Software Engineering and Measurement (ESEM).

Nicholas Kozanidis, Roberto Verdecchia, and Emitzá Guzmán. 2022. Asking about Technical Debt Characteristics and Automatic Identification of Technical Debt Questions on Stack Overflow. In International Symposium on Empirical Software Engineering and Measurement (ESEM).

Philippe Kruchten, Robert L. Nord, and Ipek Ozkaya. 2012. Technical debt: From metaphor to theory and practice. IEEE Software (2012).

Max Kuhn and Kjell Johnson. 2013. Applied predictive modeling.

Pedro Lambert, Lucila Ishitani, and Laerte Xavier. 2024. On the Identification of Self-Admitted Technical Debt with Large Language Models. In Anais do XXXVIII Simpósio Brasileiro de Engenharia de Software. SBC.

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).

Zengyang Li, Paris Avgeriou, and Peng Liang. 2015. A systematic mapping study on technical debt and its management. Journal of Systems and Software (2015).

Erin Lim, Nitin Taksande, and Carolyn Seaman. 2012. A balancing act: What software practitioners have to say about technical debt. IEEE software (2012).

Sarah Meldrum, Sherlock A Licorish, and Bastin Tony Roy Savarimuthu. 2017. Crowdsourced knowledge on stack overflow: A systematic mapping study. In Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering.

Nicolli Rios, Manoel Gomes de Mendonça Neto, and Rodrigo Oliveira Spínola. 2018. A tertiary study on technical debt: Types, management strategies, research trends, and base information for practitioners. Information and Software Technology (2018).

Mohammad Sadegh Sheikhaei, Yuan Tian, Shaowei Wang, and Bowen Xu. 2024. An Empirical Study on the Effectiveness of Large Language Models for SATD Identification and Classification. arXiv preprint arXiv:2405.06806 (2024).

Peeradon Sukkasem, Chitsutha Soomlek, and Chanon Dechsupa. 2025. Llm-Based Code Comment Summarization: Efficacy Evaluation and Challenges. In 2025 17th International Conference on Knowledge and Smart Technology (KST).

Mark Swillus and Andy Zaidman. 2023. Sentiment overflow in the testing stack: Analyzing software testing posts on Stack Overflow. Journal of Systems and Software (2023).

Amjed Tahir, Aiko Yamashita, Sherlock Licorish, Jens Dietrich, and Steve Counsell. 2018. Can you tell me if it smells? a study on how developers discuss code smells and anti-patterns in stack overflow. In Proceedings of the 22nd International Conference on Evaluation and Assessment in Software Engineering 2018.

Edith Tom, Aybüke Aurum, and Richard Vidgen. 2013. An exploration of technical debt. Journal of Systems and Software (2013).

Michele Tufano, Cody Watson, Gustavo White-Martins, and Denys Poshyvanyk. 2023. An Empirical Study on the Use of Large Language Models for Code-related Tasks. Empirical Software Engineering (2023).

ThomasWolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.

Xin Xia, Lingfeng Bao, David Lo, Pavneet Singh Kochhar, Ahmed E Hassan, and Zhenchang Xing. 2017. What do developers search for on the web? Empirical Software Engineering (2017).

Yaqin Zhang, Xiang Ren, Shihan Wang, Ziyue Chen, Hongyu Zhang, and David Lo. 2023. A Survey of Large Language Models in Software Engineering. arXiv preprint arXiv:2307.04743 (2023).