Otimização de Inferência em LLMs na CPU: Análise do Cenário Atual

Pedro Cattai; Alexandro Baldassin; Allberson Dantas

doi:10.5753/eradsp.2025.9731

Pedro Cattai Unesp
Alexandro Baldassin Unesp
Allberson Dantas UNILAB

DOI: https://doi.org/10.5753/eradsp.2025.9731

Resumo

Os avanços em Inteligência Artificial (IA), especialmente em Large Language Models (LLMs), evidenciaram os desafios da inferência eficiente em ambientes com recursos limitados. Embora GPUs sejam o hardware preferencial para inferência em LLMs, sua acessibilidade restrita motiva a exploração de alternativas baseadas em CPU. Este trabalho apresenta um estudo de otimizações para inferência em LLMs em CPUs, com foco em técnicas de manipulação de memória. Ao abordar este gargalo, propomos melhorar a eficiência da inferência sem hardware avançado. Este artigo descreve o andamento atual da pesquisa e a metodologia proposta estabelece bases para futuros experimentos, com potencial para ampliar a acessibilidade de LLMs.

Referências

Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623.

He, P., Zhou, S., Huang, W., Li, C., Wang, D., Guo, B., Meng, C., Gui, S., Yu, W., and Xie, Y. (2024). Inference performance optimization for large language models on cpus.

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173.

Ma, X., Fang, G., and Wang, X. (2023). Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36:21702–21720.

Na, S., Jeong, G., Ahn, B. H., Young, J., Krishna, T., and Kim, H. (2024). Understanding performance implications of LLM inference on cpus. In 2024 IEEE International Symposium on Workload Characterization (IISWC), pages 169–180.

Park, D. and Egger, B. (2024). Improving throughput-oriented LLM inference with cpu computations. In Proceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques, PACT ’24, page 233–245, New York, NY, USA. Association for Computing Machinery.

Samsi, S., Zhao, D., McDonald, J., Li, B., Michaleas, A., Jones, M., Bergeron, W., Kepner, J., Tiwari, D., and Gadepally, V. (2023). From words to watts: Benchmarking the energy costs of large language model inference. In 2023 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–9. IEEE.

Shanahan, M. (2024). Talking about large language models. Communications of the ACM, 67(2):68–79.

Shen, H., Chang, H., Dong, B., Luo, Y., and Meng, H. (2023). Efficient llm inference on cpus.

Xu, Q., Hong, F., Li, B., Hu, C., Chen, Z., and Zhang, J. (2023). On the tool manipulation capability of open-source large language models.