Improving Energy Efficiency in LLM Execution via SLA-Oriented GPU Power Control

  • Alex F. R. Trajano Instituto Atlântico
  • Crislane Costa Instituto Atlântico
  • Francisco V. J. Nobre UECE
  • Rafael L. Gomes UECE

Abstract


LLM inference services incur high energy consumption on GPUs while operating under stringent SLAs. This paper proposes an adaptive and reactive GPU power-control algorithm that directly leverages hardware power capping mechanisms, dynamically adjusting power limits based on SLA compliance. The approach is lightweight, operates at runtime, and does not rely on offline profiling or predictive models. An experimental evaluation in a real environment with multiple LLM instances deployed across multiple GPUs shows that fixed power caps can reduce energy consumption but cause severe performance degradation under higher load, whereas the dynamic controller consistently lowers energy usage relative to the baseline and mitigates abrupt performance degradation.

References

Gogineni, K., Suvizi, A., and Venkataramani, G. (2025). Llms on a budget: Systemlevel approaches to power-efficient and scalable fine-tuning. IEEE Open Journal of the Computer Society, 6:987–1000.

Jain, K., Parayil, A., Mallick, A., Choukse, E., Qin, X., Zhang, J., Goiri, I. n., Wang, R., Bansal, C., Rühle, V., Kulkarni, A., Kofsky, S., and Rajmohan, S. (2025). Performance aware llm load balancer for mixed workloads. In Proceedings of the 5th Workshop on Machine Learning and Systems, EuroMLSys ’25, page 19–30, New York, NY, USA. Association for Computing Machinery.

Kakolyris, A. K., Masouros, D., Xydis, S., and Soudris, D. (2024). Slo-aware gpu dvfs for energy-efficient llm inference serving. IEEE Computer Architecture Letters, 23(2):150–153.

Liu, J., Chung, J.-W., Wu, Z., Lai, F., Lee, M., and Chowdhury, M. (2024). Andes: Defining and enhancing quality-of-experience in llm-based text streaming services.

Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Jin, H., Chen, T., and Jia, Z. (2025). Towards efficient generative large language model serving: A survey from algorithms to systems. ACM Comput. Surv., 58(1).

Niu, C., Zhang, W., Zhao, Y., and Chen, Y. (2025). Energy efficient or exhaustive? benchmarking power consumption of llm inference engines. SIGENERGY Energy Inform. Rev., 5(2):56–62.

Rostam, Z. R. K., Szénási, S., and Kertész, G. (2024). Achieving peak performance for large language models: A systematic review. IEEE Access, 12:96017–96050.

Samsi, S., Zhao, D., McDonald, J., Li, B., Michaleas, A., Jones, M., Bergeron, W., Kepner, J., Tiwari, D., and Gadepally, V. (2023). From words to watts: Benchmarking the energy costs of large language model inference. In 2023 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–9.

Stojkovic, J., Zhang, C., Goiri, I., Torrellas, J., and Choukse, E. (2025a). Dynamollm: Designing llm inference clusters for performance and energy efficiency. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 1348–1362.

Stojkovic, J., Zhang, C., Goiri, I. n., Choukse, E., Qiu, H., Fonseca, R., Torrellas, J., and Bianchini, R. (2025b). TAPAS: Thermal-and Power-Aware Scheduling for LLM Inference in Cloud Platforms, page 1266–1281. Association for Computing Machinery, New York, NY, USA.
Published
2026-05-25
TRAJANO, Alex F. R.; COSTA, Crislane; NOBRE, Francisco V. J.; GOMES, Rafael L.. Improving Energy Efficiency in LLM Execution via SLA-Oriented GPU Power Control. In: BRAZILIAN SYMPOSIUM ON COMPUTER NETWORKS AND DISTRIBUTED SYSTEMS (SBRC), 44. , 2026, Praia do Forte/BA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2026 . p. 968-981. ISSN 2177-9384. DOI: https://doi.org/10.5753/sbrc.2026.19707.

Most read articles by the same author(s)

<< < 1 2 3 4 5