On the Practicality of LLM-Based Compiler Fuzzing

Gabriel Guimarães dos Santos Ricardo; Natanael dos Santos Junior; Flavio Figueiredo; Fernando Magno Quintão Pereira

doi:10.5753/sblp.2025.12264

Gabriel Guimarães dos Santos Ricardo UFMG
Natanael dos Santos Junior UFMG
Flavio Figueiredo UFMG
Fernando Magno Quintão Pereira UFMG

DOI: https://doi.org/10.5753/sblp.2025.12264

Resumo

Recently, Italiano and Cummins introduced an elegant methodology for uncovering performance bugs in compilers. Their approach involves using a pre-trained large language model (LLM) to generate a seed program, followed by successive mutations designed to provoke unexpected behavior, even in mainstream compilers. This methodology uncovered previously unknown (zero-day) performance bugs in widely used compilers such as Clang, ICC, and GCC. In an effort to reproduce the results reported by Italiano and Cummins, we confirm that their technique outperforms general-purpose LLMs, such as open-source versions of LlaMA and DeepSeek, in identifying compiler performance bugs. However, we also observe that while the LLM-based approach is commendable, it lags behind tools like CSmith in terms of throughput (the number of bugs found over time) and latency (the time to discover the first bug). LLMs also require significantly greater computational resources. Although this outcome may seem discouraging, it is important to note that we are comparing novel LLMs with a mature language-specific fuzzer. Nevertheless, as technology evolves, we expect the performance of LLM-based fuzzing to improve, potentially surpassing traditional methods in the future.

Palavras-chave: Fuzzing, Large-Language-Model, Compiler

Referências

Colin J Burgess and M Saidi. 1996. The automatic generation of test cases for optimizing Fortran compilers. Information and Software Technology 38, 2 (1996), 111–119.

Junjie Chen, Jibesh Patra, Michael Pradel, Yingfei Xiong, Hongyu Zhang, Dan Hao, and Lu Zhang. 2020. A Survey of Compiler Testing. ACM Comput. Surv. 53, 1, Article 4 (Feb. 2020), 36 pages. DOI: 10.1145/3363562

Thaís Damásio, Vinícius Pacheco, Fabrício Goes, Fernando Pereira, and Rodrigo Rocha. 2021. Inlining for Code Size Reduction. In SBLP (Joinville, Brazil). Association for Computing Machinery, New York, NY, USA, 17–24. DOI: 10.1145/3475061.3475081

Natanael dos Santos Júnior, Gabriel Guimarães dos Santos Ricardo, Fernando Magno Quintão Pereira, and Flavio Figueiredo. 2025. Gagana: On the Practicality of LLM-Based Compiler Fuzzing. Zenodo. DOI: 10.5281/zenodo.16970270

Hongyan Gao, Yibiao Yang, Maolin Sun, Jiangchang Wu, Yuming Zhou, and Baowen Xu. 2025. ClozeMaster: Fuzzing Rust Compiler by Harnessing LLMs for Infilling Masked Real Programs. In ICSE. IEEE, New York, US, 712–712.

Qiuhan Gu. 2023. LLM-Based Code Generation Method for Golang Compiler Testing. In ESEC/FSE (San Francisco, CA, USA). Association for Computing Machinery, New York, NY, USA, 2201–2203. DOI: 10.1145/3611643.3617850

Kenneth V. Hanford. 1970. Automatic generation of test cases. IBM Systems Journal 9, 4 (1970), 242–257.

Davide Italiano and Chris Cummins. 2025. Finding Missed Code Size Optimizations in Compilers using Large Language Models. In International Conference on Compiler Construction (Las Vegas, NV, USA). Association for Computing Machinery, New York, NY, USA, 81–91. DOI: 10.1145/3708493.3712686

Christian Lindig. 2005. Random testing of C calling conventions. In AADEBUG (Monterey, California, USA). Association for Computing Machinery, New York, NY, USA, 3–12. DOI: 10.1145/1085130.1085132

Vsevolod Livinskii, Dmitry Babokin, and John Regehr. 2020. Random testing for C and C++ compilers with YARPGen. Proc. ACM Program. Lang. 4, OOPSLA, Article 196 (Nov. 2020), 25 pages. DOI: 10.1145/3428264

William M McKeeman. 1998. Differential testing for software. Digital Technical Journal 10, 1 (1998), 100–107.

Xianfei Ou, Cong Li, Yanyan Jiang, and Chang Xu. 2025. The Mutators Reloaded: Fuzzing Compilers with Large Language Model Generated Mutation Operators. In ASPLOS (Hilton La Jolla Torrey Pines, La Jolla, CA, USA). Association for Computing Machinery, New York, NY, USA, 298–312. DOI: 10.1145/3622781.3674171

Paul Purdom. 1972. A sentence generator for testing parsers. BIT Numerical Mathematics 12 (1972), 366–375.

Richard L. Sauder. 1962. A general test data generator for COBOL. In Spring Joint Computer Conference (San Francisco, California) (AIEE-IRE). Association for Computing Machinery, New York, NY, USA, 317–323. DOI: 10.1145/1460833.1460869

Flash Sheridan. 2007. Practical testing of a C99 compiler using output comparison. Softw. Pract. Exper. 37, 14 (Nov. 2007), 1475–1488.

João Victor Amorim Vieira, Luiza de Melo Gomes, Rafael Sumitani, Raissa Maciel, Augusto Mafra, Mirlaine Crepalde, and Fernando Magno Quintão Pereira. 2025. Bottom-Up Generation of Verilog Designs for Testing EDA Tools. arXiv:2504.06295 [cs.AR] [link]

Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. 2024. Fuzz4All: Universal Fuzzing with Large Language Models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (Lisbon, Portugal) (ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 126, 13 pages. DOI: 10.1145/3597503.3639121

Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang. 2024. WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models. Proc. ACM Program. Lang. 8, OOPSLA2, Article 296 (Oct. 2024), 27 pages. DOI: 10.1145/3689736

Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. 2011. Finding and understanding bugs in C compilers. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (San Jose, California, USA) (PLDI ’11). Association for Computing Machinery, New York, NY, USA, 283–294. DOI: 10.1145/1993498.1993532

Chen Zhao, Yunzhi Xue, Qiuming Tao, Liang Guo, and Zhaohui Wang. 2009. Automated test program generation for an industrial optimizing compiler. In Workshop on Automation of Software Test. IEEE, New York, US, 36–43.