Evaluating Semantic Caching in Practice: A Study on a LLM-Driven Distributed Application in a Brazilian EdTech

Henrique Lopes Nóbrega; David Candeia Medeiros Maia; João Brunet

doi:10.5753/sbrc_estendido.2025.6685

Henrique Lopes Nóbrega Alura
David Candeia Medeiros Maia IFPB
João Brunet UFCG

DOI: https://doi.org/10.5753/sbrc_estendido.2025.6685

Resumo

Large Language Models (LLMs) support various business functions, such as Alura’s use of GPT-4 to assess students’ answers. However, high computational and financial costs, along with response time issues, limit scalability. While caching mechanisms offer an alternative, traditional cache fails to evaluate queries semantically, leading to low hit rates. This work evaluates semantic caching on an Alura’s dataset of 94, 913 answers from 20, 639 students. Results show that 45.1% of LLM requests could be served from the cache, significantly reducing costs and improving response times by 4–12× for cache hits.

Palavras-chave: Semantic Cache, Large Language Models, Educational Application

Referências

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

Bang, F. (2023). Gptcache: An open-source semantic cache for llm applications enabling faster answers and cost savings. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 212–218.

Chen, L., Zaharia, M., and Zou, J. (2023). Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176.

Gill, W., Elidrisi, M., Kalapatapu, P., Ahmed, A., Anwar, A., and Gulzar, M. A. (2024). Meancache: User-centric semantic cache for large language model based web services. arXiv preprint arXiv:2403.02694.

Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. (2024). Mixtral of experts. arXiv preprint arXiv:2401.04088.

Malkov, Y. A. and Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4):824–836.

Markatos, E. P. (2001). On caching search engine query results. Computer Communications, 24(2):137–143.

OpenAI (2023). New models and developer products announced at devday. [link]. Accessed in September 05, 2024.

OpenAI (2024). New embedding models and api updates. [link]. Accessed in September 05, 2024.

Rahutomo, F., Kitasuka, T., Aritsugi, M., et al. (2012). Semantic cosine similarity. In The 7th international student conference on advanced science and technology ICAST, volume 4, page 1. University of Seoul South Korea.

Ramírez, G., Lindemann, M., Birch, A., and Titov, I. (2023). Cache & distil: Optimising api calls to large language models. arXiv preprint arXiv:2310.13561.

Regmi, S. and Pun, C. P. (2024). Gpt semantic cache: Reducing llm costs and latency via semantic embedding caching. arXiv preprint arXiv:2411.05276.

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

Zhu, B., Sheng, Y., Zheng, L., Barrett, C., Jordan, M. I., and Jiao, J. (2023). On optimal caching and model multiplexing for large model inference. arXiv preprint arXiv:2306.02003.

Ânthropic (2024). Introducing the next generation of claude. [link]. Âccessed in September 05, 2024.