Refactoring Python Code with LLM-Based Multi-Agent Systems: An Empirical Study in ML Software Projects

Alexander Puma Pucho; Alexandre Mello Ferreira; Elder José Reioli Cirilo; Bruno B. P. Cafeo

doi:10.5753/sbes.2025.11033

Alexander Puma Pucho UNICAMP
Alexandre Mello Ferreira UNICAMP
Elder José Reioli Cirilo UFSJ
Bruno B. P. Cafeo UNICAMP

DOI: https://doi.org/10.5753/sbes.2025.11033

Resumo

Refactoring is essential for improving software maintainability, yet it often remains a validation-intensive and developer-guided task—particularly in Python projects shaped by fast-paced experimentation and iterative workflows, as is common in the machine learning (ML) domain. Recent advances in large language models (LLMs) have introduced new possibilities for automating refactoring, but many existing approaches rely on single-model prompting and lack structured coordination or task specialization. This study presents an empirical evaluation of a modular LLM-based multi-agent system (LLM-MAS), orchestrated through the MetaGPT framework, which enables sequential coordination and reproducible communication among specialized agents for static analysis, refactoring strategy planning, and code transformation. The system was applied to 1,719 Python files drawn from open-source ML repositories, and its outputs were compared against both the original and human-refactored versions using eight static metrics related to complexity, modularity, and code size. Results show that the agent consistently produces more compact and modular code, with measurable reductions in function length and structural complexity. However, the absence of a validation agent led to 281 syntactically invalid outputs, reinforcing the importance of incorporating semantic and syntactic verification to ensure transformation correctness and build trust in automated refactoring. These findings highlight the potential of LLM-based multi-agent systems to automate structural code improvements and establish a foundation for future domain-aware refactoring in ML software.

Palavras-chave: Code Refactoring, Large Language Models, Multi-Agent Systems, Software Maintenance Automation, Machine Learning Projects

Referências

Pouria Alikhanifard and Nikolaos Tsantalis. 2025. A novel refactoring and semantic aware abstract syntax tree differencing tool and a benchmark for evaluating the accuracy of diff tools. ACM Transactions on Software Engineering and Methodology 34, 2 (2025), 1–63.

Eman Abdullah AlOmar, Anushkrishna Venkatakrishnan, Mohamed Wiem Mkaouer, Christian Newman, and Ali Ouni. 2024. How to refactor this code? An exploratory study on developer-ChatGPT refactoring conversations. In Proceedings of the 21st International Conference on Mining Software Repositories. 202–206.

Andrea Arcuri and Lionel Briand. 2011. A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Proceedings of the 33rd international conference on software engineering. 1–10.

Hassan Atwi, Bin Lin, Nikolaos Tsantalis, Yutaro Kashiwa, Yasutaka Kamei, Naoyasu Ubayashi, Gabriele Bavota, and Michele Lanza. 2021. Pyref: Refactoring detection in python projects. In 2021 IEEE 21st international working conference on source code analysis and manipulation (SCAM). IEEE, 136–141.

Fraol Batole, David OBrien, Tien N Nguyen, Robert Dyer, and Hridesh Rajan. 2025. An LLM-Based Agent-Oriented Approach for Automated Code Design Issue Localization. In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE Computer Society, 637–637.

Gabriele Bavota, Andrea De Lucia, and Rocco Oliveto. 2011. Identifying extract class refactoring opportunities using structural and semantic cohesion measures. Journal of Systems and Software 84, 3 (2011), 397–414.

Di Cui, SiqiWang, Yong Luo, Xingyu Li, Jie Dai, LuWang, and Qingshan Li. 2022. Rmove: Recommending move method refactoring opportunities using structural and semantic representations of code. In 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 281–292.

Kayla DePalma, Izabel Miminoshvili, Chiara Henselder, Kate Moss, and Eman Abdullah AlOmar. 2024. Exploring ChatGPT’s code refactoring capabilities: An empirical study. Expert Systems with Applications 249 (2024), 123602.

Stephen R Foster, William G Griswold, and Sorin Lerner. 2012. WitchDoctor: IDE support for real-time auto-completion of refactorings. In 2012 34th international conference on software engineering (ICSE). IEEE, 222–232.

Martin Fowler. 2018. Refactoring: improving the design of existing code. Addison-Wesley Professional.

Maurice H Halstead. 1977. Elements of Software Science (Operating and programming systems series). Elsevier Science Inc.

Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. 2023. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352 (2023).

Barbara Kitchenham, Lech Madeyski, David Budgen, Jacky Keung, Pearl Brereton, Stuart Charters, Shirley Gibbs, and Amnart Pohthong. 2017. Robust statistical methods for empirical software engineering. Empirical Software Engineering 22 (2017), 579–630.

Bo Liu, Yanjie Jiang, Yuxia Zhang, Nan Niu, Guangjie Li, and Hui Liu. 2025. Exploring the potential of general purpose LLMs in automated software refactoring: an empirical study. Automated Software Engineering 32, 1 (2025), 26.

Thomas J McCabe. 1976. A complexity measure. IEEE Transactions on software Engineering 4 (1976), 308–320.

Tom Mens and Tom Tourwé. 2004. A survey of software refactoring. IEEE Transactions on software engineering 30, 2 (2004), 126–139.

Shayan Noei, Heng Li, and Ying Zou. 2025. Detecting Refactoring Commits in Machine Learning Python Projects: A Machine Learning-Based Approach. ACM Transactions on Software Engineering and Methodology 34, 3 (2025), 1–25.

Alberto S Nuñez-Varela, Héctor G Pérez-Gonzalez, Francisco E Martínez-Perez, and Carlos Soubervielle-Montalvo. 2017. Source code metrics: A systematic mapping study. Journal of Systems and Software 128 (2017), 164–197.

Dorin Pomian, Abhiram Bellur, Malinda Dilhara, Zarina Kurbatova, Egor Bogomolov, Timofey Bryksin, and Danny Dig. 2024. Next-generation refactoring: Combining llm insights and ide capabilities for extract method. In 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 275–287.

Roger S Pressman. 2005. Software Engineering: a practitioner’s approach. Pressman and Associates (2005).

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. 2023. Chatdev: Communicative agents for software development. arXiv preprint arXiv:2307.07924 (2023).

Jeanine Romano, Jeffrey D Kromrey, Jesse Coraggio, and Jeff Skowronek. 2006. Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen’sd for evaluating group differences on the NSSE and other surveys. In annual meeting of the Florida Association of Institutional Research, Vol. 177.

David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. Advances in neural information processing systems 28 (2015).

Atsushi Shirafuji, Yusuke Oda, Jun Suzuki, Makoto Morishita, and Yutaka Watanobe. 2023. Refactoring programs using large language models with fewshot examples. In 2023 30th Asia-Pacific Software Engineering Conference (APSEC). IEEE, 151–160.

Shahbaz Siddeeq, Zeeshan Rasheed, Malik Abdul Sami, Mahade Hasan, MuhammadWaseem, Jussi Rasku, Mika Saari, Kai-Kristian Kemell, and Pekka Abrahamsson. 2025. Distributed Approach to Haskell Based Applications Refactoring with LLMs Based Multi-Agent Systems. arXiv preprint arXiv:2502.07928 (2025).

Danilo Silva, Joao Paulo da Silva, Gustavo Santos, Ricardo Terra, and Marco Tulio Valente. 2020. Refdiff 2.0: A multi-language refactoring detection tool. IEEE Transactions on Software Engineering 47, 12 (2020), 2786–2802.

Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python framework for mining software repositories. In Proceedings of the 2018 26th ACM Joint meeting on european software engineering conference and symposium on the foundations of software engineering. 908–911.

Yiming Tang, Raffi Khatchadourian, Mehdi Bagherzadeh, Rhia Singh, Ajani Stewart, and Anita Raja. 2021. An Empirical Study of Refactorings and Technical Debt in Machine Learning Systems. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 238–250. DOI: 10.1109/ICSE43902.2021.00033

Nikolaos Tsantalis and Alexander Chatzigeorgiou. 2009. Identification of move method refactoring opportunities. IEEE Transactions on Software Engineering 35, 3 (2009), 347–367.

Yisen Xu, Feng Lin, Jinqiu Yang, Nikolaos Tsantalis, et al. 2025. MANTRA: Enhancing Automated Method-Level Refactoring with Contextual RAG and Multi-Agent LLM Collaboration. arXiv preprint arXiv:2503.14340 (2025).

John Yang, Carlos Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. Swe-agent: Agent-computer interfaces enable automated software engineering. Advances in Neural Information Processing Systems 37 (2024), 50528–50652.

Zhiqiang Yuan, Weitong Chen, Hanlin Wang, Kai Yu, Xin Peng, and Yiling Lou. 2024. Transagent: An llm-based multi-agent system for code translation. arXiv preprint arXiv:2409.19894 (2024).

Zejun Zhang, Zhenchang Xing, Xiaoxue Ren, Qinghua Lu, and Xiwei Xu. 2024. Refactoring to pythonic idioms: A hybrid knowledge-driven approach leveraging large language models. Proceedings of the ACM on Software Engineering 1, FSE (2024), 1107–1128.