Achieving Enhanced Performance Combining Checkpointing and Dynamic State Partitioning

  • Henrique S. Goulart UFSC
  • João Trombeta UFSC
  • Álvaro Franco UFSC
  • Odorico M. Mendizabal UFSC


Fault-tolerant systems rely on recovery techniques to enhance system resilience. In this regard, checkpointing procedures periodically take snapshots of the system state during failure-free operation, enabling recovery processes to resume from a previously saved, consistent state. Saving checkpoints, however, is costly, as it must synchronize snapshots with the processing of incoming requests to avoid inconsistency. One way to speed up checkpointing is to partition the service state, allowing a parallel checkpoint procedure to operate independently on each partition. State partitioning can also improve throughput by increasing parallelism in request processing. However, variations in the data access pattern over time can result in unbalanced partitions, posing a challenge to achieving optimal performance. In this paper, aiming to improve both checkpointing and overall system performance, we combine parallel checkpointing with a dynamic graph-based repartitioning algorithm. This work formalizes the optimization problem and presents a detailed performance assessment of the proposed approach. The experimental evaluation highlights the benefits of parallel checkpointing and emphasizes the performance gains achieved with repartitioning under realistic workloads. Comparing a cost-effective round-robin partitioning approach with our dynamic method, we examine the degree of execution parallelism achieved by checkpointing threads and the influence of repartitioning strategies on checkpoint performance. Although the rebalancing of state partitions incurs a cost, it comes for free in our technique since it takes advantage of processing idleness during the snapshot-taking process.
Palavras-chave: fault tolerance, checkpoint/restore, state partitioning, graph partitioning algorithms
GOULART, Henrique S.; TROMBETA, João; FRANCO, Álvaro; MENDIZABAL, Odorico M.. Achieving Enhanced Performance Combining Checkpointing and Dynamic State Partitioning. In: INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD), 35. , 2023, Porto Alegre/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 149-159.