A Time-Phased Partitioned Checkpoint Approach to Reduce State Snapshot Overhead

  • Everaldo Gomes Junior USP
  • Eduardo Alchieri UnB
  • Fernando Dotti UFRGS
  • Odorico Mendizabal UFSC

Resumo


Replication and recovery are essential techniques in developing fault-tolerant systems. Replication enhances availability by ensuring the system remains operational even in the presence of faults, while recovery improves resilience by replacing failed replicas or adding new ones during runtime. To achieve recovery, replicas must implement durability strategies such as logging, checkpointing, and state transfer. While these approaches enhance overall availability and resilience, they impact system performance. Among them, checkpointing is especially expensive due to the synchronization needed to create a consistent snapshot of the replica’s state and the overhead to persistently store it, leading to reduced throughput, increased latency, and even causing momentary service interruptions. To mitigate the performance degradation caused by checkpointing during normal execution, this work proposes a new checkpoint strategy that divides the replica’s state into partitions and takes snapshots of only a few partitions simultaneously. During checkpointing, incoming requests experience delays only if they access the partition being saved. Meanwhile, replicas can continue executing requests directed to other partitions without interruption. Our approach allows checkpointing different partitions at different moments while maintaining strong consistency. By employing this new approach using Parallel State Machine Replication, we can observe a reduction in the snapshot duration proportional to the number of partitions and lower latency observed by clients during checkpointing. Furthermore, the approach speeds up the system’s recovery by implementing a collaborative state transfer.
Palavras-chave: state machine replication, Fault-tolerance, checkpoint, recovery
Publicado
16/10/2023
GOMES JUNIOR, Everaldo; ALCHIERI, Eduardo; DOTTI, Fernando; MENDIZABAL, Odorico. A Time-Phased Partitioned Checkpoint Approach to Reduce State Snapshot Overhead. In: LATIN-AMERICAN SYMPOSIUM ON DEPENDABLE COMPUTING (LADC), 12. , 2023, La Paz/Bolívia. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 100–109.