Melhorando a Eficiência Energética na Execução de LLMs via Controle de Potência de GPUs Orientado por SLAs

Alex F. R. Trajano; Crislane Costa; Francisco V. J. Nobre; Rafael L. Gomes

doi:10.5753/sbrc.2026.19707

Alex F. R. Trajano Instituto Atlântico
Crislane Costa Instituto Atlântico
Francisco V. J. Nobre UECE
Rafael L. Gomes UECE

DOI: https://doi.org/10.5753/sbrc.2026.19707

Resumo

Serviços de inferência de LLMs apresentam elevado consumo energético em GPUs e operam sob SLAs exigentes. Este trabalho propõe um algoritmo adaptativo e reativo de controle de potência para GPUs, que atua diretamente sobre mecanismos de power capping, ajustando dinamicamente o limite de potência com base no cumprimento de SLAs. A proposta é de baixo overhead, opera em tempo de execução e não depende de perfis offline ou modelos preditivos. A avaliação experimental em um ambiente real com múltiplas instâncias de LLM em múltiplas GPUs demonstra que limites de potência fixos podem reduzir o consumo energético, porém causam degradações severas de desempenho sob maior carga, enquanto o controle dinâmico reduz consistentemente o consumo em relação ao baseline e mitiga degradações abruptas de desempenho.

Referências

Gogineni, K., Suvizi, A., and Venkataramani, G. (2025). Llms on a budget: Systemlevel approaches to power-efficient and scalable fine-tuning. IEEE Open Journal of the Computer Society, 6:987–1000.

Jain, K., Parayil, A., Mallick, A., Choukse, E., Qin, X., Zhang, J., Goiri, I. n., Wang, R., Bansal, C., Rühle, V., Kulkarni, A., Kofsky, S., and Rajmohan, S. (2025). Performance aware llm load balancer for mixed workloads. In Proceedings of the 5th Workshop on Machine Learning and Systems, EuroMLSys ’25, page 19–30, New York, NY, USA. Association for Computing Machinery.

Kakolyris, A. K., Masouros, D., Xydis, S., and Soudris, D. (2024). Slo-aware gpu dvfs for energy-efficient llm inference serving. IEEE Computer Architecture Letters, 23(2):150–153.

Liu, J., Chung, J.-W., Wu, Z., Lai, F., Lee, M., and Chowdhury, M. (2024). Andes: Defining and enhancing quality-of-experience in llm-based text streaming services.

Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Jin, H., Chen, T., and Jia, Z. (2025). Towards efficient generative large language model serving: A survey from algorithms to systems. ACM Comput. Surv., 58(1).

Niu, C., Zhang, W., Zhao, Y., and Chen, Y. (2025). Energy efficient or exhaustive? benchmarking power consumption of llm inference engines. SIGENERGY Energy Inform. Rev., 5(2):56–62.

Rostam, Z. R. K., Szénási, S., and Kertész, G. (2024). Achieving peak performance for large language models: A systematic review. IEEE Access, 12:96017–96050.

Samsi, S., Zhao, D., McDonald, J., Li, B., Michaleas, A., Jones, M., Bergeron, W., Kepner, J., Tiwari, D., and Gadepally, V. (2023). From words to watts: Benchmarking the energy costs of large language model inference. In 2023 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–9.

Stojkovic, J., Zhang, C., Goiri, I., Torrellas, J., and Choukse, E. (2025a). Dynamollm: Designing llm inference clusters for performance and energy efficiency. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 1348–1362.

Stojkovic, J., Zhang, C., Goiri, I. n., Choukse, E., Qiu, H., Fonseca, R., Torrellas, J., and Bianchini, R. (2025b). TAPAS: Thermal-and Power-Aware Scheduling for LLM Inference in Cloud Platforms, page 1266–1281. Association for Computing Machinery, New York, NY, USA.

Melhorando a Eficiência Energética na Execução de LLMs via Controle de Potência de GPUs Orientado por SLAs

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)