Characterizing the Impact of Job Execution on the Occurrence of Memory Failures on a Petascale HPC System
Resumo
Characterizing the reliability of current and recent high performance (HPC) systems is critical for forecasting how future systems may behave and informing the design of fault tolerance mechanisms. Although research has been conducted to understand memory failures, there are few examples where the occurrence of memory failures is considered in the broader context of system power, temperature, and the execution of user jobs. In this paper, we combine job data with existing power, temperature, and memory failure data collected on a petascale HPC system. By focusing on periods when jobs were running on the system, we identified trends that were not evident in earlier studies of this same data. The inclusion of job data also demonstrated how user behavior can affect the occurrence of memory failures. In conjunction with this paper, we have publicly released the job data used in this paper to complement existing publicly-available data from Astra regarding power, temperature, and the occurrence of correctable memory failures.
Palavras-chave:
Temperature distribution, High performance computing, Random access memory, Graphics processing units, Focusing, Reliability engineering, Market research, Hardware, Object recognition, Forecasting, high-performance computing, fault tolerance, memory
Publicado
13/11/2024
Como Citar
LEVY, Scott; HEMMERT, Josh; FERREIRA, Kurt; PEDRETTI, Kevin.
Characterizing the Impact of Job Execution on the Occurrence of Memory Failures on a Petascale HPC System. In: INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD), 36. , 2024, Hilo/Hawaii.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2024
.
p. 116-126.