Impacts of Three Soft-Fault Models on Hybrid Parallel Asynchronous Iterative Methods

  • Evan Coleman Old Dominion University
  • Erik J. Jensen Old Dominion University
  • Masha Sosonkina Old Dominion University

Resumo


This study seeks to understand the soft error vulnerability of asynchronous iterative methods, with a focus on stationary iterative solvers such as Jacobi. The implementations make use of hybrid parallelism where the computational work is distributed over multiple nodes using MPI and parallelized on each node using openMP. A series of experiments is conducted to measure the impact of an undetected soft fault on an asynchronous iterative method, and to compare and contrast several techniques for simulating the occurrence of a fault and then recovering from the effects of the faults. The data shows that the two numerical soft-fault models tested here more consistently than a “bit-flip” model produce bad enough behavior to test a variety of recovery strategies, such as those based on partial checkpointing.
Palavras-chave: Computational modeling, Numerical models, Instruction sets, Iterative methods, Jacobian matrices, Data models, Fault tolerance, Fault modeling, fault tolerance, hybrid parallelism, asynchronous iterative methods
Publicado
24/09/2018
COLEMAN, Evan; JENSEN, Erik J.; SOSONKINA, Masha. Impacts of Three Soft-Fault Models on Hybrid Parallel Asynchronous Iterative Methods. In: INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD), 30. , 2018, Lyon/FR. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2018 . p. 458-465.