Performance analysis of LLMs using RAG techniques in resource-constrained hardware scenarios

  • Gabriela Malveira UFAM
  • Kaike Maciel UFAM
  • João Alfredo Bessa UFAM
  • Ricardo Miranda Filho UFAM
  • Rosiane de Freitas UFAM

Abstract


This paper presents an analysis of the performance of embedded Large Language Models (LLMs) combined with Retrieval-Augmented Generation (RAG) techniques in hardware-constrained scenarios. Metrics such as response time, memory usage, and token throughput were evaluated on devices with limited resources. The experiments indicate that smaller, quantized models offer the best balance between latency and throughput, while RAG implementations require optimization, such as preindexing, to be effective in edge computing. Practical limitations such as token limits and memory bottlenecks are explored, automating the RAG pipeline with a proprietary platform, which increases the scalability and reproducibility of tests on limited hardware. The results demonstrate the feasibility of using LLMs and RAG on restricted devices, contributing to the still scarce literature on the subject.

Keywords: Large Language Models, RAG, Hardware Limitado

References

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Gptq: Accurate post-training quantization for generative pre-trained transformers,” arXiv preprint arXiv:2210.17323, 2022.

P. Lewis, E. Perez et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” in NeurIPS, 2020.

A. Vaswani, N. Shazeer, and N. e. a. Parmar, “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017.

G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.

H. Zhao, J. Liu, K. Nguyen et al., “Tinyllama: Efficient transformer models for edge deployment,” arXiv preprint arXiv:2305.16420, 2023.

N. Thakur, J. Lin et al., “Beir: A heterogeneous benchmark for zeroshot evaluation of information retrieval models,” in arXiv preprint arXiv:2104.08663, 2021.

J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with gpus,” arXiv preprint arXiv:1702.08734, 2019.

Y. Chen, C. Wu, R. Sui, and J. Zhang, “Feasibility study of edge computing empowered by artificial intelligence—a quantitative analysis based on large models,” Big Data and Cognitive Computing, vol. 8, no. 8, 2024. [Online]. Available: [link]

K. Feng, L. Luo, Y. Xia, B. Luo, X. He, K. Li, Z. Zha, B. Xu, and K. Peng, “Optimizing microservice deployment in edge computing with large language models: Integrating retrieval augmented generation and chain of thought techniques,” Symmetry, vol. 16, no. 11, 2024. [Online]. Available: [link]

B. Jin, J. Yoon, J. Han, and S. O. Arik, “Long-context llms meet rag: Overcoming challenges for long inputs in rag,” arXiv preprint arXiv:2410.05983, 2024.
Published
2025-11-24
MALVEIRA, Gabriela; MACIEL, Kaike; BESSA, João Alfredo; MIRANDA FILHO, Ricardo; FREITAS, Rosiane de. Performance analysis of LLMs using RAG techniques in resource-constrained hardware scenarios. In: FULL PAPERS - BRAZILIAN SYMPOSIUM ON COMPUTING SYSTEMS ENGINEERING (SBESC), 15. , 2025, Campinas/SP. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 1-6. ISSN 2763-9002. DOI: https://doi.org/10.5753/sbesc_estendido.2025.15675.