Analyzing HPC Monitoring Data With a View Towards Efficient Resource Utilization

  • Samuel Maloney Jülich Supercomputing Centre
  • Estela Suarez Jülich Supercomputing Centre / University of Bonn
  • Norbert Eicker Jülich Supercomputing Centre / University of Wuppertal
  • Filipe Guimarães Jülich Supercomputing Centre
  • Wolfgang Frings Jülich Supercomputing Centre

Resumo


Compute nodes in modern HPC systems are growing in size and their hardware has become ever more diverse. Still, many HPC centers allocate the resources of full nodes exclusively to avoid contention, despite the associated risk of underutilization. This paper describes a thorough resource utilization study of CPU and GPU compute and memory capacity, and interconnect bandwidth on JUWELS, a mature leadership-class modular supercomputer, with the aim of identifying opportunities for improving utilization through advanced scheduling and node sharing. Separate analysis of CPU-only and GPU-accelerated nodes finds that CPU compute usage is already close to optimal for the CPU-only nodes, whereas there is plenty of scope for co-scheduling CPU-based jobs on GPU-accelerated nodes. Memory capacity and node-level interconnect bandwidth are sufficient to provision co-scheduled jobs. We analyze multiple one-month datasets to validate robustness of conclusions over time and compare with previous studies on other systems to establish generalizability of results.
Palavras-chave: Runtime, High performance computing, Graphics processing units, Bandwidth, Computer architecture, Supercomputers, Robustness, Hardware, Resource management, Monitoring, High performance computing (HPC), Dynamic/Adaptive scheduling, Predictive analytics
Publicado
13/11/2024
MALONEY, Samuel; SUAREZ, Estela; EICKER, Norbert; GUIMARÃES, Filipe; FRINGS, Wolfgang. Analyzing HPC Monitoring Data With a View Towards Efficient Resource Utilization. In: INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD), 36. , 2024, Hilo/Hawaii. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 170-181.