RefModel: Detecting Refactorings using Foundation Models

Pedro Simões; Rohit Gheyi; Rian Melo; Jonhnanthan Oliveira; Márcio Ribeiro; Wesley K. G. Assunção

doi:10.5753/sbes.2025.11582

Pedro Simões UFCG
Rohit Gheyi UFCG
Rian Melo UFCG
Jonhnanthan Oliveira UFCG
Márcio Ribeiro UFAL
Wesley K. G. Assunção NCSU

DOI: https://doi.org/10.5753/sbes.2025.11582

Resumo

Refactoring is a common software engineering practice that improves code quality without altering program behavior. Although tools like ReExtractor+, RefactoringMiner, and RefDiff have been developed to detect refactorings automatically, they rely on complex rule definitions and static analysis, making them difficult to extend and generalize to other programming languages. In this paper, we investigate the viability of using foundation models for refactoring detection, implemented in a tool named RefModel. We evaluate Phi4-14B, and Claude 3.5 Sonnet on a dataset of 858 single-operation transformations applied to artificially generated Java programs, covering widely-used refactoring types. We also extend our evaluation by including Gemini 2.5 Pro and o4-minihigh, assessing their performance on 44 real-world refactorings extracted from four open-source projects. These models are compared against RefactoringMiner, RefDiff, and ReExtractor+. RefModel is competitive with, and in some cases outperform, traditional tools. In real-world settings, Claude 3.5 Sonnet and Gemini 2.5 Pro jointly identified 97% of all refactorings, surpassing the best-performing static-analysis-based tools. The models showed encouraging generalization to Python and Golang. They provide natural language explanations and require only a single sentence to define each refactoring type.

Palavras-chave: Refactoring Detection, Foundation Models, RefModel

Referências

2025. Chatbot Arena LLM Leaderboard. [link].

Pouria Alikhanifard and Nikolaos Tsantalis. 2025. A Novel Refactoring and Semantic Aware Abstract Syntax Tree Differencing Tool and a Benchmark for Evaluating the Accuracy of Diff Tools. Transactions on Software Engineering and Methodology 34, 2 (2025).

Anthropic. 2024. Building effective agents. [link]

Anthropic. 2024. Claude 3.5. [link].

Victor R. Basili, Gianluigi Caldiera, and H. Dieter Rombach. 1994. The Goal Question Metric Approach. 528–532 pages.

Fraol Batole, Abhiram Bellur, Malinda Dilhara, Mohammed Raihan Ullah, Yaroslav Zharov, Timofey Bryksin, Kai Ishikawa, Haifeng Chen, Masaharu Morimoto, Shota Motoura, Takeo Hosomi, Tien N. Nguyen, Hridesh Rajan, Nikolaos Tsantalis, and Danny Dig. 2025. Leveraging LLMs, IDEs, and Semantic Embeddings for Automated Move Method Refactoring. arXiv:2503.20934 [cs.SE] [link]

Ana Bibiano, Wesley Assunção, Daniel Coutinho, Kleber Santos, Vinícius Soares, Rohit Gheyi, Alessandro Garcia, Baldoino Fonseca, Márcio Ribeiro, Daniel Oliveira, Caio Barbosa, João Marques, and Anderson Oliveira. 2021. Look Ahead! Revealing Complete Composite Refactorings and their Smelliness Effects. In International Conference on Software Maintenance and Evolution. 298–308.

Ana Bibiano, Vinicius Soares, Daniel Coutinho, Eduardo Fernandes, João Lucas Correia, Kleber Santos, Anderson Oliveira, Alessandro Garcia, Rohit Gheyi, Baldoino Fonseca, Márcio Ribeiro, Caio Barbosa, and Daniel Oliveira. 2020. How Does Incomplete Composite Refactoring Affect Internal Quality Attributes?. In International Conference on Program Comprehension. 149–159.

Danny Dig, Can Comertoglu, Darko Marinov, and Ralph Johnson. 2006. Automated Detection of Refactorings in Evolving Components. In European Conference on Object-Oriented Programming. 404–428.

Marah Abdin et al. 2024. Phi-4 Technical Report. [link]

Mark Chen et al. 2021. Evaluating Large Language Models Trained on Code. [link]

Tom B. Brown et al. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems.

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. 2023. Large Language Models for Software Engineering: Survey and Open Problems. In International Conference on Software Engineering: Future of Software Engineering. IEEE, 31–53.

Martin Fowler. 1999. Refactoring: improving the design of existing code. Addison-Wesley.

Martin Fowler. 2025. Catalog of Refactorings. [link].

Rohit Gheyi, Marcio Ribeiro, and Jonhnanthan Oliveira. 2025. Evaluating the Effectiveness of Small Language Models in Detecting Refactoring Bugs. arXiv:2502.18454 [cs.SE] [link]

Yaroslav Golubev, Zarina Kurbatova, Eman Abdullah AlOmar, Timofey Bryksin, and Mohamed Wiem Mkaouer. 2021. One thousand and one stories: a largescale survey of software refactoring. In Foundations of Software Engineering. 1303–1313.

Google. 2025. Gemini 2.5 Pro. [link].

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review. ACM Transactions on Software Engineering and Methodology 33, 8 (2024).

Miryung Kim, Thomas Zimmermann, and Nachiappan Nagappan. 2012. A Field Study of Refactoring Challenges and Benefits. In Foundations of Software Engineering. 50:1–50:11.

Osmar Leandro, Rohit Gheyi, Leopoldo Teixeira, Márcio Ribeiro, and Alessandro F. Garcia. 2022. A Technique to Test Refactoring Detection Tools. In Brazilian Symposium on Software Engineering. 188–197.

Bo Liu, Hui Liu, Nan Niu, Yuxia Zhang, Guangjie Li, He Jiang, and Yanjie Jiang. 2025. An Automated Approach to Discovering Software Refactorings by Comparing Successive Versions. IEEE Transactions on Software Engineering 51, 5 (2025), 1358–1380.

Bo Liu, Hui Liu, Nan Niu, Yuxia Zhang, Guangjie Li, and Yanjie Jiang. 2023. Automated Software Entity Matching Between Successive Versions. In Automated Software Engineering. 1615–1627.

Rian Melo, Pedro Simões, Rohit Gheyi, Marcelo d’Amorim, Márcio Ribeiro, Gustavo Soares, Eduardo Almeida, and Elvys Soares. 2025. Agentic SLMs: Hunting Down Test Smells. arXiv:2504.07277 [cs.SE] [link]

Tom Mens and Tom Tourwé. 2004. A Survey of Software Refactoring. IEEE Transactions on Software Engineering 30, 2 (2004), 126–139.

Melina Mongiovi, Rohit Gheyi, Gustavo Soares, Márcio Ribeiro, Paulo Borba, and Leopoldo Teixeira. 2018. Detecting Overly Strong Preconditions in Refactoring Engines. IEEE Transactions on Software Engineering 44, 5 (2018), 429–452.

Melina Mongiovi, Rohit Gheyi, Gustavo Soares, Leopoldo Teixeira, and Paulo Borba. 2014. Making refactoring safer through impact analysis. Science of Computer Programming 93 (2014), 39–64.

Melina Mongiovi, Gustavo Mendes, Rohit Gheyi, Gustavo Soares, and Márcio Ribeiro. 2014. Scaling Testing of Refactoring Engines. In International Conference on Software Maintenance and Evolution. 371–380.

Emerson R. Murphy-Hill, Chris Parnin, and Andrew P. Black. 2009. How we refactor, and how we know it. In International Conference on Software Engineering. IEEE, 287–297.

Stas Negara, Nicholas Chen, M. Vakilian, Ralph Johnson, and Danny Dig. 2013. A Comparative Study of Manual and Automated Refactorings. In European Conference on Object-Oriented Programming. 552–576.

Daniel Oliveira, Wesley K. G. Assunção, Alessandro F. Garcia, Ana Carla Bibiano, Márcio Ribeiro, Rohit Gheyi, and Baldoino Fonseca. 2023. The untold story of code refactoring customizations in practice. In International Conference on Software Engineering. 108–120.

Jonhnanthan Oliveira, Rohit Gheyi, Melina Mongiovi, Gustavo Soares, Márcio Ribeiro, and Alessandro Garcia. 2019. Revisiting the Refactoring Mechanics. Information and Software Technology 110 (2019), 136–138.

Jonhnanthan Oliveira, Rohit Gheyi, Felipe Pontes, Melina Mongiovi, Márcio Ribeiro, and Alessandro Garcia. 2020. Revisiting Refactoring Mechanics from Tool Developers’ Perspective. In Brazilian Symposium on Formal Methods. 25–42.

Jonhnanthan Oliveira, Rohit Gheyi, Leopoldo Teixeira, Márcio Ribeiro, Osmar Leandro, and Baldoino Fonseca. 2023. Towards a better understanding of the mechanics of refactoring detection tools. Information and Software Technology 162 (2023), 107273.

William Opdyke. 1992. Refactoring Object-oriented Frameworks. Ph.D. Dissertation. University of Illinois at Urbana-Champaign.

William Opdyke and Ralph Johnson. 1990. Refactoring: An Aid in Designing Application Frameworks and Evolving Object-Oriented Systems. In Symposium Object-Oriented Programming Emphasizing Practical Applications. 274–282.

OpenAI. 2025. OpenAI o3-mini. [link].

Fabio Palomba, Andy Zaidman, Rocco Oliveto, and Andrea De Lucia. 2017. An Exploratory Study on the Relationship between Changes and Refactoring. In International Conference on Program Comprehension. 176–185.

Kyle Prete, Napol Rachatasumrit, Nikita Sudan, and Miryung Kim. 2010. Templatebased Reconstruction of Complex Refactorings. In International Conference on Software Maintenance. 1–10.

PromptHub. 2025. A Complete Guide to Meta Prompting.

Jacek Ratzinger, Thomas Sigmund, and Harald C. Gall. 2008. On the relation of refactorings and software defect prediction. In Mining Software Repositories. 35–38.

June Sallou, Thomas Durieux, and Annibale Panichella. 2024. Breaking the silence: the threats of using llms in software engineering. In International Conference on Software Engineering: New Ideas and Emerging Results. 102–106.

Danilo Silva, João Paulo da Silva, Gustavo Santos, Ricardo Terra, and Marco Tulio Valente. 2021. RefDiff 2.0: A Multi-language Refactoring Detection Tool. IEEE Transactions on Software Engineering 47, 12 (2021), 2786–2802.

P. Simões, R. Gheyi, R. Melo, J. Oliveira, M. Ribeiro, and W. Assunção. 2025. RefModel: Detecting Refactorings using Foundation Models (artifacts). [link].

Gustavo Soares, Rohit Gheyi, and Tiago Massoni. 2013. Automated Behavioral Testing of Refactoring Engines. IEEE Transactions on Software Engineering 39, 2 (2013), 147–162.

Gustavo Soares, Rohit Gheyi, Emerson Murphy-Hill, and Brittany Johnson. 2013. Comparing Approaches to Analyze Refactoring Activity on Software Repositories. Journal of Systems and Software 86, 4 (2013), 1006–1022.

Gustavo Soares, Rohit Gheyi, Dalton Serey, and Tiago Massoni. 2010. Making Program Refactoring Safer. IEEE Software 27, 4 (2010), 52–57.

Nikolaos Tsantalis, Ameya Ketkar, and Danny Dig. 2022. RefactoringMiner 2.0. IEEE Transactions on Software Engineering 48, 3 (2022), 930 – 950.

Nikolaos Tsantalis, Matin Mansouri, Laleh Mousavi Eshkevari, Davood Mazinanian, and Danny Dig. 2018. Accurate and efficient refactoring detection in commit history. In International Conference on Software Engineering (ICSE). 483–494.

Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software testing with large language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering 50 (2024), 911–936.

PeterWeissgerber and Stephan Diehl. 2006. Identifying Refactorings from Source-Code Changes. In Automated Software Engineering. 231–240.

Frank F. Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large language models of code. In International Symposium on Machine Programming. 1–10.