Characterizing the Impact of Job Execution on the Occurrence of Memory Failures on a Petascale HPC System

  • Scott Levy Center for Computing Research Sandia National Laboratories
  • Josh Hemmert Center for Computing Research Sandia National Laboratories
  • Kurt Ferreira Center for Computing Research Sandia National Laboratories
  • Kevin Pedretti Center for Computing Research Sandia National Laboratories

Resumo


Characterizing the reliability of current and recent high performance (HPC) systems is critical for forecasting how future systems may behave and informing the design of fault tolerance mechanisms. Although research has been conducted to understand memory failures, there are few examples where the occurrence of memory failures is considered in the broader context of system power, temperature, and the execution of user jobs. In this paper, we combine job data with existing power, temperature, and memory failure data collected on a petascale HPC system. By focusing on periods when jobs were running on the system, we identified trends that were not evident in earlier studies of this same data. The inclusion of job data also demonstrated how user behavior can affect the occurrence of memory failures. In conjunction with this paper, we have publicly released the job data used in this paper to complement existing publicly-available data from Astra regarding power, temperature, and the occurrence of correctable memory failures.
Palavras-chave: Temperature distribution, High performance computing, Random access memory, Graphics processing units, Focusing, Reliability engineering, Market research, Hardware, Object recognition, Forecasting, high-performance computing, fault tolerance, memory
Publicado
13/11/2024
LEVY, Scott; HEMMERT, Josh; FERREIRA, Kurt; PEDRETTI, Kevin. Characterizing the Impact of Job Execution on the Occurrence of Memory Failures on a Petascale HPC System. In: INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD), 36. , 2024, Hilo/Hawaii. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 116-126.