KV-RAPTOR: Scalable Tree-Structured Retrieval with KV Cache Compression for Question-Answering Systems

  • João Gabriel J. da Silva Universidade Federal de Goiás (UFG) http://orcid.org/0009-0009-7801-7463
  • Sávio S. T. de Oliveira Universidade Federal de Goiás (UFG)
  • Lucas Alexandria Alves Universidade Federal de Goiás (UFG)
  • Nicolás Eiris Panoplai
  • Arlindo R. Galvão Filho Universidade Federal de Goiás (UFG)

Resumo


This paper introduces KV-RAPTOR, a latency-optimized variant of the RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval) pipeline for Retrieval-Augmented Generation (RAG). By integrating CacheGen, a compressed key-value (KV) cache reuse mechanism, into RAPTOR’s tree-based index, we demonstrate that it is possible to reduce generation latency without sacrificing answer quality. We evaluate our method on both English and Portuguese datasets, showing consistent reductions in time-to-first-token and end-to-end latency while preserving performance across diverse linguistic and retrieval contexts.

Palavras-chave: Retrieval-Augmented Generation, Latency Optimization, Key-Value Cache, Question-Answering Systems, Large Language Models

Referências

Aquino, I., dos Santos, M. M., Dorneles, C., and Carvalho, J. T. (2024). Extracting information from brazilian legal documents with retrieval augmented generation. In Anais Estendidos do XXXIX Simpósio Brasileiro de Bancos de Dados, pages 280–287, Porto Alegre, RS, Brasil. SBC.

Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Truitt, S., Metropolitansky, D., Ness, R. O., and Larson, J. (2025). From local to global: A graph rag approach to query-focused summarization.

Jiang, C., Gao, L., Zarch, H. E., and Annavaram, M. (2024). Efficient llm inference with i/o-aware partial kv cache recomputation.

Jimenez Gutierrez, B., Shu, Y., Gu, Y., Yasunaga, M., and Su, Y. (2024). Hipporag: Neurobiologically inspired long-term memory for large language models. Advances in Neural Information Processing Systems, 37:59532–59569.

Kočiský, T., Schwarz, J., Blunsom, P., Dyer, C., Hermann, K. M., Melis, G., and Grefenstette, E. (2018). The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328.

Lavie, A. and Agarwal, A. (2007). Meteor: an automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, StatMT ’07, page 228–231, USA. Association for Computational Linguistics.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA. Curran Associates Inc.

Li, B., Jiang, Y., Gadepally, V., and Tiwari, D. (2024). Llm inference serving: Survey of recent advances and opportunities. In 2024 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–8.

Li, H., Li, Y., Tian, A., Tang, T., Xu, Z., Chen, X., Hu, N., Dong, W., Li, Q., and Chen, L. (2025). A survey on large language model acceleration based on kv cache management.

Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.

Liu, A., Liu, J., Pan, Z., He, Y., Haffari, G., and Zhuang, B. (2024a). Minicache: Kv cache compression in depth dimension for large language models. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C., editors, Advances in Neural Information Processing Systems, volume 37, pages 139997–140031. Curran Associates, Inc.

Liu, Y., Li, H., Cheng, Y., Ray, S., Huang, Y., Zhang, Q., Du, K., Yao, J., Lu, S., Ananthanarayanan, G., Maire, M., Hoffmann, H., Holtzman, A., and Jiang, J. (2024b). Cachegen: Kv cache compression and streaming for fast large language model serving. In Proceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM ’24, page 38–56, New York, NY, USA. Association for Computing Machinery.

Nolet, C. J., Lafargue, V., Raff, E., Nanditale, T., Oates, T., Zedlewski, J., and Patterson, J. (2021). Bringing umap closer to the speed of light with gpu acceleration.

NVIDIA Corporation (2024). Benchmarking metrics for large language models. [link]. Accessed: 2025-04-17.

Oliveira, V. P. L. (2024). MemoryGraph: uma proposta de memória para agentes conversacionais utilizando grafo de conhecimento. Tese (doutorado em ciência da computação), Universidade Federal de Goiás, Goiânia.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA. Association for Computational Linguistics.

Paschoal, A. F. A., Pirozelli, P., Freire, V., Delgado, K. V., Peres, S. M., José, M. M., Nakasato, F., Oliveira, A. S., Brandão, A. A. F., Costa, A. H. R., and Cozman, F. G. (2021). Pirá: A bilingual portuguese-english dataset for question-answering about the ocean. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM ’21, page 4544–4553, New York, NY, USA. Association for Computing Machinery.

Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Inui, K., Jiang, J., Ng, V., and Wan, X., editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.

RunPod (2025). Runpod – cloud compute for ai, ml, and more. Acesso em: 28 abr. 2025.

Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A., and Manning, C. D. (2024). RAPTOR: Recursive abstractive processing for tree-organized retrieval. In The Twelfth International Conference on Learning Representations.

Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: pretrained BERT models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23 (to appear).

Taschetto, L. and Fileto, R. (2024). Using retrieval-augmented generation to improve performance of large language models on the brazilian university admission exam. In Anais do XXXIX Simpósio Brasileiro de Bancos de Dados, pages 799–805, Porto Alegre, RS, Brasil. SBC.

Yao, J., Li, H., Liu, Y., Ray, S., Cheng, Y., Zhang, Q., Du, K., Lu, S., and Jiang, J. (2025). Cacheblend: Fast large language model serving for rag with cached knowledge fusion. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25, page 94–109, New York, NY, USA. Association for Computing Machinery.

Yu, H., Gan, A., Zhang, K., Tong, S., Liu, Q., and Liu, Z. (2024). Evaluation of retrieval-augmented generation: A survey.
Publicado
29/09/2025
J. DA SILVA, João Gabriel; DE OLIVEIRA, Sávio S. T.; ALVES, Lucas Alexandria; EIRIS, Nicolás; GALVÃO FILHO, Arlindo R.. KV-RAPTOR: Scalable Tree-Structured Retrieval with KV Cache Compression for Question-Answering Systems. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 40. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 316-329. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2025.247245.