Fault-tolerance in filter-labeled-stream applications
Resumo
Fault tolerance is a desirable feature in distributed high-performance systems, since applications tend to run for long periods of time and faults become more likely as the number of nodes in the system increase. However, most distributed environments lack any fault tolerant features, since they tend to be hard to implement and use, and often hurt performance dramatically. In this paper we discuss how we successfully added fault-tolerance to the Anthill distributed programming environment by using an application-level checkpoint/rollback solution. The programming model offers an abstraction where the programmer can easily identify points during the execution where the communication pattern is well defined, forming a consistent cut where checkpoints may be saved consistently without requiring extra communication, avoiding any domino effect during recovery from faults. We present the new abstractions for fault tolerance, describe how the solution was implemented and present performance results that show the efficiency of the solution with both regular and irregular applications.
Palavras-chave:
Fault tolerance, Fault tolerant systems, Programming profession, Programming environments, Application software, Availability, Computer architecture, High performance computing, Computer science, Fault diagnosis
Publicado
24/10/2007
Como Citar
COUTINHO, Bruno; GUEDES, Dorgival; MEIRA JR., Wagner; FERREIRA, Renato A..
Fault-tolerance in filter-labeled-stream applications. In: INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD), 19. , 2007, Gramado/RS.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2007
.
p. 229-236.
