An Algorithm-Based Fault Tolerance Strategy for the Bitonic Sort Parallel Algorithm
ResumoHigh Performance Computing (HPC) systems are employed to solve hard problems and rely on parallel algorithms which present very long execution times - up to several days. These systems are expensive in terms of the computational resources required, including energy consumption. Thus, after failures occur it is highly desirable to loose as little of the work that has already been done as possible. In this work we present an Algorithm-Based Fault Tolerance (ABFT) strategy that can be applied to make a robust version of any hypercube-based parallel algorithm. Note that we do not assume a physical hypercube: after nodes crash, fault-free nodes autonomously adapt themselves according to a logical topology called VCube, preserving several logarithmic properties. The proposed strategy guarantees that the algorithm does not halt even after up to (N - 1) nodes crash, in a system of N nodes. We use parallel sorting as a case study, describing how to make a fault-tolerant version of the Bitonic Sort parallel algorithm. The algorithm was implemented in MPI using ULMF to handle faults. Experimental results are presented showing the performance and robustness of the proposed solution.
Palavras-chave: Fault tolerance, Runtime, Fault tolerant systems, Hypercubes, Computer crashes, Robustness, Topology
CAMARGO, Edson T.; DUARTE, Elias P.. An Algorithm-Based Fault Tolerance Strategy for the Bitonic Sort Parallel Algorithm. In: LATIN-AMERICAN SYMPOSIUM ON DEPENDABLE COMPUTING (LADC), 10. , 2021, Florianópolis. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 .