Quantifying the RAG Advantage: A Multi-Metric Benchmark for LLM-based Code Generation

Gabriel Souza Baggio; Gabriel Machado Lunardi; Guilherme Medeiros Machado; José Palazzo Moreira de Oliveira

doi:10.5753/sbbd.2025.247760

Gabriel Souza Baggio Universidade Federal de Santa Maria (UFSM)
Gabriel Machado Lunardi Universidade Federal de Santa Maria (UFSM) https://orcid.org/0000-0001-6655-184X
Guilherme Medeiros Machado ECE Engineering School
José Palazzo Moreira de Oliveira Universidade Federal do Rio Grande do Sul (UFRGS) https://orcid.org/0000-0002-9166-8801

DOI: https://doi.org/10.5753/sbbd.2025.247760

Resumo

The recent advancement of Large Language Models (LLMs) has demonstrated remarkable capabilities in solving programming challenges. However, despite their proficiency, LLMs often suffer from hallucination and limited performance on unfamiliar or complex tasks. Retrieval-Augmented Generation (RAG) has emerged as a promising solution to address these limitations by supplementing prompts with relevant external information. In this paper, we propose a benchmark to assess the efficacy of RAG in solving algorithmic problems by integrating a curated database of 120 LeetCode problems, each paired with corresponding solutions and explanations. An Information Retrieval (IR) system was employed to construct enhanced prompts for solving novel problems.

Palavras-chave: Retrieval-Augmented Generation (RAG), Large Language Models (LLMs), Code Generation, Algorithmic Problems, LeetCode, Information Retrieval (IR), Benchmark

Referências

Balasubramanian, N. (2011). Query-dependent selection of retrieval alternatives. PhD thesis, University of Massachusetts Amherst.

Barbosa, M., Valle, P., Nakamura, W., Guerino, G., Finger, A., Lunardi, G., and Silva, W. (2022). Um estudo exploratório sobre métodos de avaliação de user experience em chatbots. In VI Escola Regional de Engenharia de Software, Porto Alegre, RS, Brasil.

Coignion, T., Quinton, C., and Rouvoy, R. (2024). A performance study of llm-generated code on leetcode. In 28th International Conference on Evaluation and Assessment in Software Engineering, page 79–89, New York, NY, USA. ACM.

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, H., and Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. 2:1.

Huynh, N. and Lin, B. (2025). Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications.

Izacard, G. and Grave, E. (2021). Leveraging passage retrieval with generative models for open domain question answering.

Liu, F., Liu, Y., Shi, L., Huang, H., Wang, R., Yang, Z., Zhang, L., Li, Z., and Ma, Y. (2024). Exploring and evaluating hallucinations in llm-powered code generation.

Miranda, A. L., Garcia, R., Lunardi, G. M., Vilela, R., Vale, P. H., and Silva, W. (2023). Projeto e avaliação de um template de worked examples para o ensino de programação. In Simpósio Brasileiro de Informática na Educação (SBIE), pages 1673–1684. SBC.

Soares, T. S., Costa, R. L. H., Soares, E., Calderon, I., Lunardi, G. M., Valle, P. H. D., Guedes, G. T., and Silva, W. (2025). Machine learning-assisted tools for user experience evaluation: A systematic mapping study. Simpósio Brasileiro de Sistemas de Informaçao (SBSI), pages 379–388.

Taschetto, L. and Fileto, R. (2024). Using retrieval-augmented generation to improve performance of large language models on the brazilian university admission exam. In Anais do XXXIX Simpósio Brasileiro de Bancos de Dados, pages 799–805, Porto Alegre, RS, Brasil. SBC.

Tian, H., Lu, W., Li, T. O., Tang, X., Cheung, S.-C., Klein, J., and Bissyandé, T. F. (2023). Is chatgpt the ultimate programming assistant – how far is it?

Wang, J. and Chen, Y. (2023). A review on code generation with llms: Application and evaluation. In 2023 IEEE International Conference on Medical Artificial Intelligence (MedAI), pages 284–289. IEEE.

Wang, L., Shi, C., Du, S., Tao, Y., Shen, Y., Zheng, H., and Qiu, X. (2024). Performance review on llm for solving leetcode problems. In 2024 4th International Symposium on Artificial Intelligence and Intelligent Manufacturing (AIIM), pages 1050–1054.

Xia, Y., Shen, W., Wang, Y., Liu, J. K., Sun, H., Wu, S., Hu, J., and Xu, X. (2025). Leetcodedataset: A temporal dataset for robust evaluation and efficient training of code llms.

Zhang, Z., Fang, M., and Chen, L. (2024). Retrievalqa: Assessing adaptive retrieval-augmented generation for short-form open-domain question answering.