Improved Failure Detection and Propagation Mechanisms for MPI

  • Pedro Henrique Di Francia Rosso UFABC
  • Emilio Francesquini UFABC

Abstract


The Message Passing Interface (MPI) standard is largely used in High-Performance Computing (HPC) systems. Such systems employ a large number of computing nodes. Thus, Fault Tolerance (FT) is a concern since a large number of nodes leads to more frequent failures. Two essential components of FT are Failure Detection (FD) and Failure Propagation (FP). This paper proposes improvements to existing FD and FP mechanisms to provide more portability, scalability, and low overhead. Results show that the methods proposed can achieve better or at least similar results to existing methods while providing portability to any MPI standard-compliant distribution.

Keywords: Languages, Compilers, and Tools for Parallel and Distributed Computing, Low Level Software for Parallel and Distributed Computing, Fault Tolerance

References

George Bosilca, Aurelien Bouteiller, Amina Guermouche, Thomas Herault, Yves Robert, Pierre Sens, and Jack Dongarra. A failure detector for hpc platforms. The International Journal of High Performance Computing Applications, 32(1):139–158, 2018.

Sourav Chakraborty, Ignacio Laguna, Murali Emani, Kathryn Mohror, Dhabaleswar K Panda, Martin Schulz, and Hari Subramoni. Ereinit: Scalable and efficient fault-tolerance for bulk-synchronous mpi applications. Concurrency and Computation: Practice and Experience, 32(3):e4863, 2020.

Abhinandan Das, Indranil Gupta, and Ashish Motivala. Swim: Scalable weakly-consistent infection-style process group membership protocol. In Proceedings International Conference on Dependable Systems and Networks, pages 303–312. IEEE, 2002.

Ifeanyi P Egwutuoha, David Levy, Bran Selic, and Shiping Chen. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 65(3):1302–1326, 2013.

Giorgis Georgakoudis, Luanzheng Guo, and Ignacio Laguna. Reinit++: Evaluating the performance of global-restart recovery methods for mpi fault tolerance. In International Conference on High Performance Computing, pages 536–554. Springer, 2020.

Ion Stoica, Robert Morris, David Karger, M Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM Computer Communication Review, 31(4):149–160, 2001.

Dong Zhong, Aurelien Bouteiller, Xi Luo, and George Bosilca. Runtime level failure detection and propagation in hpc systems. In Proceedings of the 26th European MPI Users’ Group Meeting, pages 1–11, 2019.
Published
2021-05-06
ROSSO, Pedro Henrique Di Francia; FRANCESQUINI, Emilio. Improved Failure Detection and Propagation Mechanisms for MPI. In: REGIONAL SCHOOL OF HIGH PERFORMANCE COMPUTING FROM SÃO PAULO (ERAD-SP), 12. , 2021, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 . p. 45-48. DOI: https://doi.org/10.5753/eradsp.2021.16702.

Most read articles by the same author(s)

1 2 3 > >>