To Protect or Not To Protect: Probability-Aware Selective Protection for Sparse Iterative Solvers
Resumo
With the increasing scale of high-performance computing (HPC) systems, transient bit-flip errors are now more likely than ever, posing a threat to long-running scientific applications. A substantial portion of these applications involve simulation of partial differential equations (PDEs), modeling physical processes over discretized spatial and temporal domains, with some requiring solving sparse linear systems of equations. While these applications are often paired with system-level application-agnostic resilience techniques, such as checkpointing and replication, using these techniques imposes significant overhead. In this work, we present a probability-aware framework that produces low-overhead selective protection schemes for the widely used Preconditioned Conjugate Gradient (PCG) method, whose performance can heavily degrade due to error propagation through the sparse matrix-vector multiplication (SpMV) operation. Through the use of a straightforward mathematical model and an optimized machine learning model, our selective protection schemes incorporate error probability to protect only certain crucial operations. An experimental evaluation using 15 matrices from the SuiteSparse Matrix Collection demonstrates that our protection schemes effectively reduce resilience overheads, outperforming two baseline and two existing protection schemes across all error probabilities.
Palavras-chave:
Linear systems, Error probability, High performance computing, Partial differential equations, Machine learning, Mathematical models, Sparse matrices, Protection, Transient analysis, Resilience, Fault tolerance, soft errors, selective protection, iterative solvers, preconditioned conjugate gradient
Publicado
13/11/2024
Como Citar
JOHNSON, Daniel Ryley; SUN, Hongyang; BOOTH, Joshua Dennis; RAGHAVAN, Padma.
To Protect or Not To Protect: Probability-Aware Selective Protection for Sparse Iterative Solvers. In: INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD), 36. , 2024, Hilo/Hawaii.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2024
.
p. 229-238.