Characterizing the Impact of Job Execution on the Occurrence of Memory Failures on a Petascale HPC System

Scott Levy; Josh Hemmert; Kurt Ferreira; Kevin Pedretti

Scott Levy Center for Computing Research Sandia National Laboratories
Josh Hemmert Center for Computing Research Sandia National Laboratories
Kurt Ferreira Center for Computing Research Sandia National Laboratories
Kevin Pedretti Center for Computing Research Sandia National Laboratories

Resumo

Characterizing the reliability of current and recent high performance (HPC) systems is critical for forecasting how future systems may behave and informing the design of fault tolerance mechanisms. Although research has been conducted to understand memory failures, there are few examples where the occurrence of memory failures is considered in the broader context of system power, temperature, and the execution of user jobs. In this paper, we combine job data with existing power, temperature, and memory failure data collected on a petascale HPC system. By focusing on periods when jobs were running on the system, we identified trends that were not evident in earlier studies of this same data. The inclusion of job data also demonstrated how user behavior can affect the occurrence of memory failures. In conjunction with this paper, we have publicly released the job data used in this paper to complement existing publicly-available data from Astra regarding power, temperature, and the occurrence of correctable memory failures.

Palavras-chave: Temperature distribution, High performance computing, Random access memory, Graphics processing units, Focusing, Reliability engineering, Market research, Hardware, Object recognition, Forecasting, high-performance computing, fault tolerance, memory