CAGE: An Evaluation Framework for Cache-Augmented Generation Models

Lucas Mariano do Carmo; Wladmir Cardoso Brandão; Henrique Cota de Freitas

doi:10.5753/wperformance.2026.23603

Lucas Mariano do Carmo PUC Minas
Wladmir Cardoso Brandão PUC Minas
Henrique Cota de Freitas PUC Minas

DOI: https://doi.org/10.5753/wperformance.2026.23603

Resumo

Cache-Augmented Generation (CAG) is an emerging design that reduces the cost of repeated prompt processing by reusing previously processed context, yet it lacks a standard evaluation approach. We present CAGE, a framework that combines serving metrics and semantic quality analysis across baselines in cache-aware AI systems. CAGE integrates features from vLLM’s native prefix caching and evaluates latency, TTFT, throughput, and semantic metrics. In our results, native prefix caching reduced latency of 37.4% and TTFT by 65.7% with no loss in faithfulness, whereas RAG increased latency by 70.4% and reduced faithfulness by 11.6%. These results validate the usefulness of CAGE as an approach for evaluating cache-aware LLM systems.

Referências

Souza, W. J., Marques-Neto, H. T., and Freitas, H. C., Retrieval-Augmented Large Language Models for Computer Architecture Learning and Design Assistance. In International Journal of Computer Architecture Education (IJCAE), vol. 14, no. 1, pages. 12—18, 2025.

Lewis, P. et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems (NeurIPS), 2020.

Chan, B. J., Chen, C.-T., Cheng, J.-H., and Huang, H.-H., Don’t Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks. In WWW (Companion Volume) 2025, pages 893–897, 2025.

Lu, S., Wang, H., Rong, Y., Chen, Z., and Tang, Y., TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6588–6601, 2025.

Es, S., James, J., Espinosa Anke, L., and Schockaert, S., RAGAs: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 150–158, 2024.

Friel, R. et al. RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems. arXiv preprint arXiv:2407.11005, 2024.

Saad-Falcon, J., Khattab, O., Potts, C., and Zaharia, M., ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1), pages 338–354, 2024.

Rau, D., Déjean, H., Chirkova, N., Formal, T., Wang, S., Clinchant, S., and Nikoulina, V., BERGEN: A Benchmarking Library for Retrieval-Augmented Generation. In Findings of the Association for Computational Linguistics: EMNLP, pages 7640–7663, 2024.

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I., Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.

Li, H. et al. A Survey on Large Language Model Acceleration based on KV Cache Management. arXiv preprint arXiv:2412.19442, 2024.

Javidnia, N., Rouhani, B. D., and Koushanfar, F., Key, Value, Compress: A Systematic Exploration of KV Cache Compression Techniques. In IEEE CICC 2025, 2025.

Qin, R. et al., Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. arXiv:2407.00079, 2024.

Sahoo, P., Meharia, P., Ghosh, A., Saha, S., Jain, V., and Chadha, A., A Comprehensive Survey of Hallucination in Large Language, Image, Video and Audio Foundation Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 11709–11724, 2024.

Barnett, S., Kurniawan, S., Thudumu, S., Brannelly, Z., and Abdelrazek, M., Seven Failure Points When Engineering a Retrieval Augmented Generation System. In CAIN 2024, pages 194–199, 2024.

Zhong, Y. et al. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 2024.

Li, Z., Li, C., Zhang, M., Mei, Q., and Bendersky, M., Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 881–893, 2024.

Redis. Redis Official Documentation, 2025. Available at: [link].