Distributed Checkpointing in Dataflow with Static Scheduling

  • Tiago A. O. Alves UERJ


The Dataflow model, where instructions or tasks are fired as soon as their input data is ready, was proven to be a good fit for parallel/distributed computation. Previous works have presented DFER (Dataflow Error Recovery Model), that allows transient error and recovery in dataflow by adding special tasks and edges to the dataflow graph itself. However, permanent faults or faults that cause a processing element (PE) to become irresponsive are not addressed by DFER. For those cases it is necessary to adopt a checkpointing method. Since the whole purpose of Dataflow is to achieve high levels of parallelism and explore the potential asynchronicity between PEs, it is clear that the checkpointing method adopted must be uncoordinated and distributed. Current algorithms for distributed checkpointing rely solely on guaranteeing that causality between checkpoints can be trackable. In the context of Dataflow with static scheduling, i.e. when the dataflow graph is partitioned among the available PEs at compile-time, causality trackability is not sufficient as we will show. Since static scheduling of dataflow graphs is very important in various scenarios, it calls for a new algorithm for distributed checkpointing that can be adopted for the execution of statically scheduled dataflow graphs. In this paper we describe why the ability to track causality is not enough for statically scheduled dataflow and introduce a new algorithm for distributed checkpointing specifically tailored for such model of execution.
Palavras-chave: dataflow, distributed checkpointing, fault tolerance, parallel programming
ALVES, Tiago A. O.. Distributed Checkpointing in Dataflow with Static Scheduling. In: WORKSHOP ON APPLICATIONS FOR MULTI-CORE ARCHITECTURES - INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD), 35. , 2023, Porto Alegre/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 77-82.