Accelerating RAG Systems: A Performance-Oriented Systematic Mapping

João Gabriel J. da Silva; Sávio S. T. de Oliveira; Arlindo R. Galvão Filho

doi:10.5753/eniac.2025.12304

João Gabriel J. da Silva UFG
Sávio S. T. de Oliveira UFG
Arlindo R. Galvão Filho UFG

DOI: https://doi.org/10.5753/eniac.2025.12304

Resumo

Optimizing latency and inference efficiency has become critical for the deployment of Retrieval-Augmented Generation (RAG) systems in production environments. While recent methods have explored GPU acceleration, keyvalue (KV) caching, and hierarchical indexing, their impact remains fragmented across studies. This paper presents a performance-oriented systematic mapping of optimization techniques targeting time-to-first-token (TTFT), latency, and caching efficiency metrics in RAG pipelines. Following the Kitchenham protocol and operationalized through the Parsifal platform, 34 studies were selected from an initial pool of 147. The analysis reveals growing research focus on prefix-aware KV reuse, asynchronous retrieval, and GPU-accelerated decoding, but also highlights gaps in unified evaluation and multilingual scalability. The findings provide a consolidated view of current strategies and identify research directions for scalable, latency-sensitive RAG systems.

Referências

Fan, W., Ding, Y., Ning, L., Wang, S., Li, H., Yin, D., Chua, T.-S., and Li, Q. (2024). A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, pages 6491–6501. Association for Computing Machinery.

Gao, S., Chen, Y., and Shu, J. (2025). Fast State Restoration in LLM Serving with HCache. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, pages 128–143, New York, NY, USA. Association for Computing Machinery. - Information systems -> Storage management.

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., and Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.

Gim, I., Chen, G., Lee, S.-s., Sarda, N., Khandelwal, A., and Zhong, L. (2024). Prompt cache: Modular attention reuse for low-latency inference. In Gibbons, P., Pekhimenko, G., and De Sa, C., editors, Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference.

Gu, J. (2025). A Research of Challenges and Solutions in Retrieval Augmented Generation (RAG) Systems. Highlights in Science, Engineering and Technology, 124:132–138.

Haddaway, N. R., Page, M. J., Pritchard, C. C., and McGuinness, L. A. (2022). Prisma2020: An r package and shiny app for producing prisma 2020-compliant flow diagrams. [link]. Campbell Systematic Reviews, 18, e1230. DOI: 10.1002/cl2.1230.

Hofstätter, S., Chen, J., Raman, K., and Zamani, H. (2023). Fid-Light: Efficient and Effective Retrieval-Augmented Text Generation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, pages 1437–1447. Association for Computing Machinery.

Hu, J., Huang, W., Wang, H., Wang, W., Hu, T., Zhang, Q., Feng, H., Chen, X., Shan, Y., and Xie, T. (2024). Epic: Efficient Position-Independent Context Caching for Serving Large Language Models. arXiv preprint arXiv:2410.15332.

Jiang, W., Subramanian, S., Graves, C., Alonso, G., Yazdanbakhsh, A., and Dadu, V. (2025a). Rago: Systematic Performance Optimization for Retrieval-Augmented Generation Serving. In Proceedings of 52nd Annual International Symposium on Computer Architecture. arXiv.

Jiang, W., Zeller, M., Waleffe, R., Hoefler, T., and Alonso, G. (2023). Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models. Proceedings of the VLDB Endowment, 18(1):42–52.

Jiang, W., Zhang, S., Han, B., Wang, J., Wang, B., and Kraska, T. (2025b). Piperag: Fast Retrieval-Augmented Generation via Adaptive Pipeline Parallelism. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1, KDD ’25, pages 589–600. Association for Computing Machinery.

Jin, C., Zhang, Z., Jiang, X., Liu, F., Liu, X., Liu, X., and Jin, X. (2024). Ragcache: Efficient knowledge caching for retrieval-augmented generation. arXiv preprint arXiv:2404.12457.

Kitchenham, B. and Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE-2007-01, Keele University and Durham University.

Kuratomi, G., Pirozelli, P., Cozman, F., and Peres, S. (2024). A rag-based institutional assistant. In Anais do XXI Encontro Nacional de Inteligência Artificial e Computacional, pages 755–766, Porto Alegre, RS, Brasil. SBC.

Lavie, A. and Agarwal, A. (2007). Meteor: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments. In Callison-Burch, C., Koehn, P., Fordyce, C. S., and Monz, C., editors, Proceedings of the Second Workshop on Statistical Machine Translation, pages 228–231. Association for Computational Linguistics.

Lee, K.-H., Park, E., Han, D., and Na, S.-H. (2025). Cachefocus: Dynamic cache re-positioning for efficient retrieval-augmented generation. arXiv preprint arXiv:2502.11101.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20. Curran Associates Inc.

Lin, C.-Y. (2004). Rouge: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81. Association for Computational Linguistics.

Liu, A., Liu, J., Pan, Z., He, Y., Haffari, G., and Zhuang, B. (2024a). Minicache: Kv Cache Compression in Depth Dimension for Large Language Models. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C., editors, Advances in Neural Information Processing Systems, volume 37, pages 139997–140031. Curran Associates, Inc.

Liu, Y., Li, H., Cheng, Y., Ray, S., Huang, Y., Zhang, Q., Du, K., Yao, J., Lu, S., Ananthanarayanan, G., Maire, M., Hoffmann, H., Holtzman, A., and Jiang, J. (2024b). Cachegen: Kv Cache Compression and Streaming for Fast Large Language Model Serving. In Proceedings of the ACM SIGCOMM 2024 Conference, pages 38–56, New York, NY, USA. ACM.

NVIDIA Corporation (2024). Benchmarking metrics for large language models. [link]. Accessed: 2025-04-17.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA. Association for Computational Linguistics.

Parsifal Team (2016). Parsifal - Plataforma para Revisões Sistemáticas da Literatura. Parsifal.

Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A., and Manning, C. D. (2024). Raptor: Recursive Abstractive Processing for Tree-Organized Retrieval. In Proceedings of the ICLR 2024 Conference. arXiv.

Yao, J., Li, H., Liu, Y., Ray, S., Cheng, Y., Zhang, Q., Du, K., Lu, S., and Jiang, J. (2025). Cacheblend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion. In Proceedings of the Twentieth European Conference on Computer Systems, pages 94–109, New York, NY, USA. ACM. - Computing methodologies -> -1Natural language processing.- Networks -> -1Cloud computing.- Information systems -> - 1Data management systems.

Yu, H., Gan, A., Zhang, K., Tong, S., Liu, Q., and Liu, Z. (2024). Evaluation of Retrieval-Augmented Generation: A Survey. In 12th CCF Conference, BigData 2024, Qingdao, China, August 9–11, 2024, Proceedings. arXiv.

Zhao, S., Hu, J., Huang, R., Zheng, J., and Chen, G. (2023). Mpic: Position-independent multimodal context caching system for efficient mllm serving. arXiv preprint arXiv:2405.03085.

Zhu, J., Wu, H., Wang, H., Li, Y., Hou, B., Li, R., and Zhai, J. (2025). Fastcache: Optimizing Multimodal LLM Serving through Lightweight KV-Cache Compression Framework.

Accelerating RAG Systems: A Performance-Oriented Systematic Mapping

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)