Repairing DeFi Vulnerabilities: Benchmarking LLMs with Executable Solidity Exploits

  • Lucas Bastos Germano IME
  • Julio Cesar Duarte IME

Abstract


Decentralized finance protocols are frequently exploited, creating a demand for fast and reliable repair of vulnerable smart contracts and validation that reflects runtime security. Large language models are an emerging source of patches, yet many evaluations rely on manual checks or self-assessment, which cannot confirm whether attacker profit is actually prevented. We introduce an executable benchmark that replays verified real-world exploits against patched Solidity contracts under a resilient protocol that permits alternate attack paths and controlled state variation. Our framework compiles candidate patches, deploys them on a forked chain, and tests whether the exploit still yields profit. The benchmark covers six test cases drawn from reproducible incidents and is released as open-source. Among the nine evaluated models, GPT-5, GPT-4.1, and Claude Opus 4.1 performed the best, mitigating four of six test cases. Microsoft Phi-4 was the most reliable open-source model, mitigating two of six exploits and producing compilable patches for the remaining cases. No model mitigated the H2O case once resilient checks were enabled, while a simpler access control flaw, BTNFT, was often repaired with minimal edits. Grounding validation in executable exploit replay provides a precise and scalable method to measure whether proposed repairs harden contracts at runtime.
Keywords: large language models, vulnerability repair, defi, automated testing

References

Aider AI. 2025. Aider LLM Leaderboards: Polyglot coding benchmark. [link]. Accessed: 2025-08-12.

ApX Machine Learning. 2025. Best LLMs for Coding—Coding-LLMs Leaderboard. [link]. Updated: 2025-07-20; Accessed: 2025-08-12.

Lucas B. Germano and Julio Cesar Duarte. 2025. A Study on Vulnerability Explanation Using Large Language Models. In Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART. INSTICC, SciTePress, Porto, Portugal, 1404–1411. DOI: 10.5220/0013379200003890

Sihao Hu, Tiansheng Huang, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. 2023. Large Language Model-Powered Smart Contract Vulnerability Detection: New Perspectives. In 2023 5TH IEEE INTERNATIONAL CONFERENCE ON TRUST, PRIVACY AND SECURITY IN INTELLIGENT SYSTEMS AND APPLICATIONS, TPSISA. IEEE, New York, 297–306. DOI: 10.1109/TPS-ISA58951.2023.00044

Abhinav Jain, Ehan Masud, Michelle Han, Rohan Dhillon, Sumukh Rao, Arya Joshi, Salar Cheema, and Saurav Kumar. 2023. Two Timin’: Repairing Smart Contracts With A Two-Layered Approach. In 2023 Second International Conference on Informatics (ICI). IEEE, Noida, India, 1–6. DOI: 10.1109/ICI60088.2023.10421047

Sabrina Kaniewski, Fabian Schmidt, Markus Enzweiler, Michael Menth, and Tobias Heer. 2025. A Systematic Literature Review on Detecting Software Vulnerabilities with Large Language Models. DOI: 10.48550/arXiv.2507.22659 arXiv:2507.22659 [cs]

Rasoul Kiani and Victor S. Sheng. 2024. Automated Repair of Smart Contract Vulnerabilities: A Systematic Literature Review. Electronics 13, 19 (2024), 20 pages. DOI: 10.3390/electronics13193942

Lunaray. 2025. H2O Hack Analysis. [link] Medium; accessed August 14, 2025.

DC Marlow. 2025. Stolen Crypto Funds Surpassing 2024 Totals Only Halfway Through 2025: Chainalysis. The Daily Hodl. [link] Accessed August 7, 2025.

Emanuele Antonio Napoli and Valentina Gatteschi. 2023. Evaluating ChatGPT for Smart Contracts Vulnerability Correction. In 2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC). IEEE, Torino, Italy, 1828–1833. DOI: 10.1109/COMPSAC57700.2023.00283

Paradigm. 2021. Foundry. [link]. Accessed: 2025-08-06.

ProLLM. 2025. StackEval Leaderboard: Evaluating LLMs as coding assistants. [link]. Accessed: 2025-08-12.

Vince Quill. 2025. GMX halts trading, token minting following $40M exploit. Cointelegraph. [link] Exploit of GMX V1 decentralized exchange due to reentrancy-vulnerability.

Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, and Yang Liu. 2024. GPTScan: Detecting Logic Vulnerabilities in Smart Contracts by Combining GPT with Program Analysis. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE ’24). Association for Computing Machinery, New York, NY, USA, 1–13. DOI: 10.1145/3597503.3639117

SunWeb3Sec. 2025. DeFiHackLabs: Reproduce DeFi hack incidents using Foundry. [link]. Accessed on 2025-08-06.

X. Tang, Y. Du, A. Lai, Z. Zhang, and L. Shi. 2023. Deep Learning-Based Solution for Smart Contract Vulnerabilities Detection. Scientific Reports 13, 1 (2023), 17 pages. DOI: 10.1038/s41598-023-47219-0

TenArmor Alert. 2025. H2O Hack Alert. [link] Post on X (formerly Twitter); accessed August 14, 2025.

Che Wang, Jiashuo Zhang, Jianbo Gao, Libin Xia, Zhi Guan, and Zhong Chen. 2024. ContractTinker: LLM-Empowered Vulnerability Repair for Real-World Smart Contracts. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE ’24). Association for Computing Machinery, New York, NY, USA, 2350–2353. DOI: 10.1145/3691620.3695349

Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Sreemanti Dey, Shubh-Agrawal, Sandeep Singh Sandha, Siddartha Venkat Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie Neiswanger, and Micah Goldblum. 2025. LiveBench: A Challenging, Contamination-Free LLM Benchmark. In The Thirteenth International Conference on Learning Representations. Curran Associates, Singapore, 37 pages.

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, David Lo, Binyuan Hui, Niklas Muennighoff, Daniel Fried, Xiaoning Du, Harm de Vries, and Leandro Von Werra. 2025. BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions. In ICLR 2025. Curran Associates, Singapore, 55 pages. [link] Oral presentation; includes benchmarking dataset and leaderboard.

Lyuye Zhang, Kaixuan Li, Kairan Sun, Daoyuan Wu, Ye Liu, Haoye Tian, and Yang Liu. 2025. ACFix: Guiding LLMs with Mined Common RBAC Practices for Context-Aware Repair of Access Control Vulnerabilities in Smart Contracts. IEEE Transactions on Software Engineering Early Access (2025), 1–21. DOI: 10.1109/TSE.2025.3590108
Published
2025-11-10
GERMANO, Lucas Bastos; DUARTE, Julio Cesar. Repairing DeFi Vulnerabilities: Benchmarking LLMs with Executable Solidity Exploits. In: BRAZILIAN WORKSHOP ON WEB3 SYSTEMS - BRAZILIAN SYMPOSIUM ON MULTIMEDIA AND THE WEB (WEBMEDIA), 31. , 2025, Rio de Janeiro/RJ. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 311-315. ISSN 2596-1683. DOI: https://doi.org/10.5753/webmedia_estendido.2025.16331.