Quantifying the RAG Advantage: A Multi-Metric Benchmark for LLM-based Code Generation
Resumo
The recent advancement of Large Language Models (LLMs) has demonstrated remarkable capabilities in solving programming challenges. However, despite their proficiency, LLMs often suffer from hallucination and limited performance on unfamiliar or complex tasks. Retrieval-Augmented Generation (RAG) has emerged as a promising solution to address these limitations by supplementing prompts with relevant external information. In this paper, we propose a benchmark to assess the efficacy of RAG in solving algorithmic problems by integrating a curated database of 120 LeetCode problems, each paired with corresponding solutions and explanations. An Information Retrieval (IR) system was employed to construct enhanced prompts for solving novel problems.
Referências
Barbosa, M., Valle, P., Nakamura, W., Guerino, G., Finger, A., Lunardi, G., and Silva, W. (2022). Um estudo exploratório sobre métodos de avaliação de user experience em chatbots. In VI Escola Regional de Engenharia de Software, Porto Alegre, RS, Brasil.
Coignion, T., Quinton, C., and Rouvoy, R. (2024). A performance study of llm-generated code on leetcode. In 28th International Conference on Evaluation and Assessment in Software Engineering, page 79–89, New York, NY, USA. ACM.
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, H., and Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. 2:1.
Huynh, N. and Lin, B. (2025). Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications.
Izacard, G. and Grave, E. (2021). Leveraging passage retrieval with generative models for open domain question answering.
Liu, F., Liu, Y., Shi, L., Huang, H., Wang, R., Yang, Z., Zhang, L., Li, Z., and Ma, Y. (2024). Exploring and evaluating hallucinations in llm-powered code generation.
Miranda, A. L., Garcia, R., Lunardi, G. M., Vilela, R., Vale, P. H., and Silva, W. (2023). Projeto e avaliação de um template de worked examples para o ensino de programação. In Simpósio Brasileiro de Informática na Educação (SBIE), pages 1673–1684. SBC.
Soares, T. S., Costa, R. L. H., Soares, E., Calderon, I., Lunardi, G. M., Valle, P. H. D., Guedes, G. T., and Silva, W. (2025). Machine learning-assisted tools for user experience evaluation: A systematic mapping study. Simpósio Brasileiro de Sistemas de Informaçao (SBSI), pages 379–388.
Taschetto, L. and Fileto, R. (2024). Using retrieval-augmented generation to improve performance of large language models on the brazilian university admission exam. In Anais do XXXIX Simpósio Brasileiro de Bancos de Dados, pages 799–805, Porto Alegre, RS, Brasil. SBC.
Tian, H., Lu, W., Li, T. O., Tang, X., Cheung, S.-C., Klein, J., and Bissyandé, T. F. (2023). Is chatgpt the ultimate programming assistant – how far is it?
Wang, J. and Chen, Y. (2023). A review on code generation with llms: Application and evaluation. In 2023 IEEE International Conference on Medical Artificial Intelligence (MedAI), pages 284–289. IEEE.
Wang, L., Shi, C., Du, S., Tao, Y., Shen, Y., Zheng, H., and Qiu, X. (2024). Performance review on llm for solving leetcode problems. In 2024 4th International Symposium on Artificial Intelligence and Intelligent Manufacturing (AIIM), pages 1050–1054.
Xia, Y., Shen, W., Wang, Y., Liu, J. K., Sun, H., Wu, S., Hu, J., and Xu, X. (2025). Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.
Zhang, Z., Fang, M., and Chen, L. (2024). Retrievalqa: Assessing adaptive retrieval-augmented generation for short-form open-domain question answering.
