On the Practicality of LLM-Based Compiler Fuzzing
Resumo
Recently, Italiano and Cummins introduced an elegant methodology for uncovering performance bugs in compilers. Their approach involves using a pre-trained large language model (LLM) to generate a seed program, followed by successive mutations designed to provoke unexpected behavior, even in mainstream compilers. This methodology uncovered previously unknown (zero-day) performance bugs in widely used compilers such as Clang, ICC, and GCC. In an effort to reproduce the results reported by Italiano and Cummins, we confirm that their technique outperforms general-purpose LLMs, such as open-source versions of LlaMA and DeepSeek, in identifying compiler performance bugs. However, we also observe that while the LLM-based approach is commendable, it lags behind tools like CSmith in terms of throughput (the number of bugs found over time) and latency (the time to discover the first bug). LLMs also require significantly greater computational resources. Although this outcome may seem discouraging, it is important to note that we are comparing novel LLMs with a mature language-specific fuzzer. Nevertheless, as technology evolves, we expect the performance of LLM-based fuzzing to improve, potentially surpassing traditional methods in the future.
Referências
Junjie Chen, Jibesh Patra, Michael Pradel, Yingfei Xiong, Hongyu Zhang, Dan Hao, and Lu Zhang. 2020. A Survey of Compiler Testing. ACM Comput. Surv. 53, 1, Article 4 (Feb. 2020), 36 pages. DOI: 10.1145/3363562
Thaís Damásio, Vinícius Pacheco, Fabrício Goes, Fernando Pereira, and Rodrigo Rocha. 2021. Inlining for Code Size Reduction. In SBLP (Joinville, Brazil). Association for Computing Machinery, New York, NY, USA, 17–24. DOI: 10.1145/3475061.3475081
Natanael dos Santos Júnior, Gabriel Guimarães dos Santos Ricardo, Fernando Magno Quintão Pereira, and Flavio Figueiredo. 2025. Gagana: On the Practicality of LLM-Based Compiler Fuzzing. Zenodo. DOI: 10.5281/zenodo.16970270
Hongyan Gao, Yibiao Yang, Maolin Sun, Jiangchang Wu, Yuming Zhou, and Baowen Xu. 2025. ClozeMaster: Fuzzing Rust Compiler by Harnessing LLMs for Infilling Masked Real Programs. In ICSE. IEEE, New York, US, 712–712.
Qiuhan Gu. 2023. LLM-Based Code Generation Method for Golang Compiler Testing. In ESEC/FSE (San Francisco, CA, USA). Association for Computing Machinery, New York, NY, USA, 2201–2203. DOI: 10.1145/3611643.3617850
Kenneth V. Hanford. 1970. Automatic generation of test cases. IBM Systems Journal 9, 4 (1970), 242–257.
Davide Italiano and Chris Cummins. 2025. Finding Missed Code Size Optimizations in Compilers using Large Language Models. In International Conference on Compiler Construction (Las Vegas, NV, USA). Association for Computing Machinery, New York, NY, USA, 81–91. DOI: 10.1145/3708493.3712686
Christian Lindig. 2005. Random testing of C calling conventions. In AADEBUG (Monterey, California, USA). Association for Computing Machinery, New York, NY, USA, 3–12. DOI: 10.1145/1085130.1085132
Vsevolod Livinskii, Dmitry Babokin, and John Regehr. 2020. Random testing for C and C++ compilers with YARPGen. Proc. ACM Program. Lang. 4, OOPSLA, Article 196 (Nov. 2020), 25 pages. DOI: 10.1145/3428264
William M McKeeman. 1998. Differential testing for software. Digital Technical Journal 10, 1 (1998), 100–107.
Xianfei Ou, Cong Li, Yanyan Jiang, and Chang Xu. 2025. The Mutators Reloaded: Fuzzing Compilers with Large Language Model Generated Mutation Operators. In ASPLOS (Hilton La Jolla Torrey Pines, La Jolla, CA, USA). Association for Computing Machinery, New York, NY, USA, 298–312. DOI: 10.1145/3622781.3674171
Paul Purdom. 1972. A sentence generator for testing parsers. BIT Numerical Mathematics 12 (1972), 366–375.
Richard L. Sauder. 1962. A general test data generator for COBOL. In Spring Joint Computer Conference (San Francisco, California) (AIEE-IRE). Association for Computing Machinery, New York, NY, USA, 317–323. DOI: 10.1145/1460833.1460869
Flash Sheridan. 2007. Practical testing of a C99 compiler using output comparison. Softw. Pract. Exper. 37, 14 (Nov. 2007), 1475–1488.
João Victor Amorim Vieira, Luiza de Melo Gomes, Rafael Sumitani, Raissa Maciel, Augusto Mafra, Mirlaine Crepalde, and Fernando Magno Quintão Pereira. 2025. Bottom-Up Generation of Verilog Designs for Testing EDA Tools. arXiv:2504.06295 [cs.AR] [link]
Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. 2024. Fuzz4All: Universal Fuzzing with Large Language Models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (Lisbon, Portugal) (ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 126, 13 pages. DOI: 10.1145/3597503.3639121
Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang. 2024. WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models. Proc. ACM Program. Lang. 8, OOPSLA2, Article 296 (Oct. 2024), 27 pages. DOI: 10.1145/3689736
Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. 2011. Finding and understanding bugs in C compilers. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (San Jose, California, USA) (PLDI ’11). Association for Computing Machinery, New York, NY, USA, 283–294. DOI: 10.1145/1993498.1993532
Chen Zhao, Yunzhi Xue, Qiuming Tao, Liang Guo, and Zhaohui Wang. 2009. Automated test program generation for an industrial optimizing compiler. In Workshop on Automation of Software Test. IEEE, New York, US, 36–43.
