Melhorando a Eficiência Energética na Execução de LLMs via Controle de Potência de GPUs Orientado por SLAs

Alex F. R. Trajano; Crislane Costa; Francisco V. J. Nobre; Rafael L. Gomes

doi:10.5753/sbrc.2026.19707

Alex F. R. Trajano Instituto Atlântico
Crislane Costa Instituto Atlântico
Francisco V. J. Nobre UECE
Rafael L. Gomes UECE

DOI: https://doi.org/10.5753/sbrc.2026.19707

Abstract

LLM inference services incur high energy consumption on GPUs while operating under stringent SLAs. This paper proposes an adaptive and reactive GPU power-control algorithm that directly leverages hardware power capping mechanisms, dynamically adjusting power limits based on SLA compliance. The approach is lightweight, operates at runtime, and does not rely on offline profiling or predictive models. An experimental evaluation in a real environment with multiple LLM instances deployed across multiple GPUs shows that fixed power caps can reduce energy consumption but cause severe performance degradation under higher load, whereas the dynamic controller consistently lowers energy usage relative to the baseline and mitigates abrupt performance degradation.

References

Gogineni, K., Suvizi, A., and Venkataramani, G. (2025). Llms on a budget: Systemlevel approaches to power-efficient and scalable fine-tuning. IEEE Open Journal of the Computer Society, 6:987–1000.

Jain, K., Parayil, A., Mallick, A., Choukse, E., Qin, X., Zhang, J., Goiri, I. n., Wang, R., Bansal, C., Rühle, V., Kulkarni, A., Kofsky, S., and Rajmohan, S. (2025). Performance aware llm load balancer for mixed workloads. In Proceedings of the 5th Workshop on Machine Learning and Systems, EuroMLSys ’25, page 19–30, New York, NY, USA. Association for Computing Machinery.

Kakolyris, A. K., Masouros, D., Xydis, S., and Soudris, D. (2024). Slo-aware gpu dvfs for energy-efficient llm inference serving. IEEE Computer Architecture Letters, 23(2):150–153.

Liu, J., Chung, J.-W., Wu, Z., Lai, F., Lee, M., and Chowdhury, M. (2024). Andes: Defining and enhancing quality-of-experience in llm-based text streaming services.

Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Jin, H., Chen, T., and Jia, Z. (2025). Towards efficient generative large language model serving: A survey from algorithms to systems. ACM Comput. Surv., 58(1).

Niu, C., Zhang, W., Zhao, Y., and Chen, Y. (2025). Energy efficient or exhaustive? benchmarking power consumption of llm inference engines. SIGENERGY Energy Inform. Rev., 5(2):56–62.

Rostam, Z. R. K., Szénási, S., and Kertész, G. (2024). Achieving peak performance for large language models: A systematic review. IEEE Access, 12:96017–96050.

Samsi, S., Zhao, D., McDonald, J., Li, B., Michaleas, A., Jones, M., Bergeron, W., Kepner, J., Tiwari, D., and Gadepally, V. (2023). From words to watts: Benchmarking the energy costs of large language model inference. In 2023 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–9.

Stojkovic, J., Zhang, C., Goiri, I., Torrellas, J., and Choukse, E. (2025a). Dynamollm: Designing llm inference clusters for performance and energy efficiency. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 1348–1362.

Stojkovic, J., Zhang, C., Goiri, I. n., Choukse, E., Qiu, H., Fonseca, R., Torrellas, J., and Bianchini, R. (2025b). TAPAS: Thermal-and Power-Aware Scheduling for LLM Inference in Cloud Platforms, page 1266–1281. Association for Computing Machinery, New York, NY, USA.

Improving Energy Efficiency in LLM Execution via SLA-Oriented GPU Power Control

Abstract

References

Most read articles by the same author(s)