Inference Optimization for LLMs on CPUs: Analysis of the Current Landscape

  • Pedro Cattai Unesp
  • Alexandro Baldassin Unesp
  • Allberson Dantas UNILAB

Abstract


Advancements in Artificial Intelligence (AI), particularly in Large Language Models (LLMs), have highlighted the challenges of efficient inference in resource-constrained environments. While GPUs are the preferred hardware for LLM inference, their limited accessibility motivates the exploration of CPU-based alternatives. This work presents a study of optimizations for LLM inference on CPUs, focusing on memory manipulation techniques. By addressing this bottleneck, we propose to enhance inference efficiency without high-end hardware. This paper outlines the current progress of the research, and the proposed methodology sets the ground for future experiments, with the potential to broaden the accessibility of LLMs.

References

Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623.

He, P., Zhou, S., Huang, W., Li, C., Wang, D., Guo, B., Meng, C., Gui, S., Yu, W., and Xie, Y. (2024). Inference performance optimization for large language models on cpus.

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173.

Ma, X., Fang, G., and Wang, X. (2023). Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36:21702–21720.

Na, S., Jeong, G., Ahn, B. H., Young, J., Krishna, T., and Kim, H. (2024). Understanding performance implications of LLM inference on cpus. In 2024 IEEE International Symposium on Workload Characterization (IISWC), pages 169–180.

Park, D. and Egger, B. (2024). Improving throughput-oriented LLM inference with cpu computations. In Proceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques, PACT ’24, page 233–245, New York, NY, USA. Association for Computing Machinery.

Samsi, S., Zhao, D., McDonald, J., Li, B., Michaleas, A., Jones, M., Bergeron, W., Kepner, J., Tiwari, D., and Gadepally, V. (2023). From words to watts: Benchmarking the energy costs of large language model inference. In 2023 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–9. IEEE.

Shanahan, M. (2024). Talking about large language models. Communications of the ACM, 67(2):68–79.

Shen, H., Chang, H., Dong, B., Luo, Y., and Meng, H. (2023). Efficient llm inference on cpus.

Xu, Q., Hong, F., Li, B., Hu, C., Chen, Z., and Zhang, J. (2023). On the tool manipulation capability of open-source large language models.
Published
2025-05-28
CATTAI, Pedro; BALDASSIN, Alexandro; DANTAS, Allberson. Inference Optimization for LLMs on CPUs: Analysis of the Current Landscape. In: REGIONAL SCHOOL OF HIGH PERFORMANCE COMPUTING FROM SÃO PAULO (ERAD-SP), 16. , 2025, São José do Rio Preto/SP. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 78-81. DOI: https://doi.org/10.5753/eradsp.2025.9731.