Predicting the Reliability Behavior of HPC Applications
Resumo
The error rate of current High Performance Computing (HPC) systems is already in the order of one per dozens of hours. Understanding the reliability behavior of HPC applications will be required for the next generation of supercomputers. Using the reliability behavior one can select efficient mitigation techniques for the application and fine-tune parameters such as checkpoint frequency. In this paper, we investigate the application of a machine learning model to predict the reliability behavior of HPC applications. We inject faults in more than 30 HPC applications executing in the Intel Xeon Phi Knights Landing (KNL) and use profiling information to build a predictive model with Support Vector Machines (SVM). We show that the model can predict the Program Vulnerability Factor (PVF) with an average relative error of 7% for certain classes of algorithm, such as linear algebra and sorting. The average relative error for all algorithm classes is 22%. Such a fast and straightforward prediction model can be effective as a filter to select the most unreliable applications to perform an in-depth analysis.
Palavras-chave:
Reliability, Benchmark testing, Circuit faults, Predictive models, Hardware, Fluid dynamics, Error analysis
Publicado
24/09/2018
Como Citar
OLIVEIRA, Daniel; MOREIRA, Francis Birck; RECH, Paolo; NAVAUX, Philippe.
Predicting the Reliability Behavior of HPC Applications. In: INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD), 30. , 2018, Lyon/FR.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2018
.
p. 124-131.
