Evaluating Semantic Caching in Practice: A Study on a LLM-Driven Distributed Application in a Brazilian EdTech
Resumo
Large Language Models (LLMs) support various business functions, such as Alura’s use of GPT-4 to assess students’ answers. However, high computational and financial costs, along with response time issues, limit scalability. While caching mechanisms offer an alternative, traditional cache fails to evaluate queries semantically, leading to low hit rates. This work evaluates semantic caching on an Alura’s dataset of 94, 913 answers from 20, 639 students. Results show that 45.1% of LLM requests could be served from the cache, significantly reducing costs and improving response times by 4–12× for cache hits.
Referências
Bang, F. (2023). Gptcache: An open-source semantic cache for llm applications enabling faster answers and cost savings. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 212–218.
Chen, L., Zaharia, M., and Zou, J. (2023). Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176.
Gill, W., Elidrisi, M., Kalapatapu, P., Ahmed, A., Anwar, A., and Gulzar, M. A. (2024). Meancache: User-centric semantic cache for large language model based web services. arXiv preprint arXiv:2403.02694.
Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. (2024). Mixtral of experts. arXiv preprint arXiv:2401.04088.
Malkov, Y. A. and Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4):824–836.
Markatos, E. P. (2001). On caching search engine query results. Computer Communications, 24(2):137–143.
OpenAI (2023). New models and developer products announced at devday. [link]. Accessed in September 05, 2024.
OpenAI (2024). New embedding models and api updates. [link]. Accessed in September 05, 2024.
Rahutomo, F., Kitasuka, T., Aritsugi, M., et al. (2012). Semantic cosine similarity. In The 7th international student conference on advanced science and technology ICAST, volume 4, page 1. University of Seoul South Korea.
Ramírez, G., Lindemann, M., Birch, A., and Titov, I. (2023). Cache & distil: Optimising api calls to large language models. arXiv preprint arXiv:2310.13561.
Regmi, S. and Pun, C. P. (2024). Gpt semantic cache: Reducing llm costs and latency via semantic embedding caching. arXiv preprint arXiv:2411.05276.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Zhu, B., Sheng, Y., Zheng, L., Barrett, C., Jordan, M. I., and Jiao, J. (2023). On optimal caching and model multiplexing for large model inference. arXiv preprint arXiv:2306.02003.
Ânthropic (2024). Introducing the next generation of claude. [link]. Âccessed in September 05, 2024.
