Checkpointing Techniques in Distributed Systems: A Synopsis of Diverse Strategies Over the Last Decades

Henrique Goulart; Álvaro Franco; Odorico Mendizabal

doi:10.5753/wtf.2023.785

Henrique Goulart UFSC
Álvaro Franco UFSC
Odorico Mendizabal UFSC

DOI: https://doi.org/10.5753/wtf.2023.785

Resumo

This paper concisely reviews checkpointing techniques in distributed systems, focusing on various aspects such as coordinated and uncoordinated checkpointing, incremental checkpoints, fuzzy checkpoints, adaptive checkpoint intervals, and kernel-based and user-space checkpoints. The review highlights interesting points, outlines how each checkpoint approach works, and discusses their advantages and drawbacks. It also provides a brief overview of the adoption of checkpoints in different contexts in distributed computing, including Database Management Systems (DBMS), State Machine Replication (SMR), and High-Performance Computing (HPC) environments. Additionally, the paper briefly explores the application of checkpointing strategies in modern cloud and container environments, discussing their role in live migration and application state management. The review offers valuable insights into their adoption and application across various distributed computing contexts by summarizing the historical development, advances, and challenges in checkpointing techniques.

Referências

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X. (2016). Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’16, page 265–283.

Bessani, A., Santos, M., Felix, J., Neves, N., and Correia, M. (2013). On the Efficiency of durable state machine replication. In 2013 USENIX Annual Technical Conference (USENIX ATC 13), pages 169–180, San Jose, CA. USENIX Association.

Castro, M. and Liskov, B. (1999). Practical byzantine fault tolerance. In Proceedings of the Third Symposium on Operating Systems Design and Implementation, OSDI ’99, page 173–186, USA. USENIX Association.

Chandy, K. M. and Lamport, L. (1985). Distributed snapshots: Determining global states of distributed systems. ACM Transactions on Computer Systems (TOCS), 3(1):63–75.

Chandy, K. M. and Ramamoorthy, C. V. (1972). Rollback and recovery strategies for computer programs. IEEE Transactions on computers, 100(6):546–556.

Chen, Y. (2015). Checkpoint and restore of micro-service in docker containers. In 2015 3rd International Conference on Mechatronics and Industrial Informatics (ICMII 2015), pages 915–918. Atlantis Press.

Cully, B., Lefebvre, G., Meyer, D., Feeley, M., Hutchinson, N., and Warfield, A. (2008). Remus: High availability via asynchronous virtual machine replication. In Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation, NSDI’08, page 161–174, USA. USENIX Association.

Duell, J. (2005). The design and implementation of berkeley lab’s linux checkpoint/restart. Technical report, Lawrence Berkeley National Laboratory.

Egwutuoha, I. P., Levy, D., Selic, B., and Chen, S. (2013). A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 65:1302–1326.

Elmore, A. J., Das, S., Agrawal, D., and El Abbadi, A. (2011). Zephyr: live migration in shared nothing databases for elastic cloud platforms. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 301–312.

Elnozahy, E. N., Alvisi, L., Wang, Y.-M., and Johnson, D. B. (2002). A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (CSUR), 34(3):375–408.

Elnozahy, E. N., Johnson, D. B., and Zwaenepoel, W. (1992). Measured performance of consistent checkpointing. In Proceedings of the Eleventh Symposium on Reliable Distributed Systems, number CONF.

Frank, A., Baumgartner, M., Salkhordeh, R., and Brinkmann, A. (2021). Improving checkpointing intervals by considering individual job failure probabilities. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 299–309.

Gerofi, B. and Ishikawa, Y. (2011). Workload adaptive checkpoint scheduling of virtual machine replication. In 2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing, pages 204–213.

Janakiraman, G. and Tamir, Y. (1994). Coordinated checkpointing-rollback error recovery for distributed shared memory multicomputers. In Proceedings of IEEE 13th Symposium on Reliable Distributed Systems, pages 42–51. IEEE.

Junior, E. d. A. G. (2020). Redução do custo da durabilidade em replicação máquina de estados através de checkpoints particionados. Master’s thesis, Universidade Federal do Rio Grande.

Kapritsos, M., Wang, Y., Quema, V., Clement, A., Alvisi, L., and Dahlin, M. (2012). All about eve: Execute-verify replication for multi-core servers. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI’12, page 237–250, USA. USENIX Association.

Kotla, R. and Dahlin, M. (2004). High throughput byzantine fault tolerance. In Proceedings of the 2004 International Conference on Dependable Systems and Networks, DSN ’04, page 575, USA. IEEE Computer Society.

Lamport, L. (2019). Time, clocks, and the ordering of events in a distributed system. In Concurrency: the Works of Leslie Lamport, pages 179–196.

Leu, P.-J. and Bhargava, B. (1988). Concurrent robust checkpointing and recovery in distributed systems. In Fourth International Conference on Data Engineering, pages 154–155. IEEE Computer Society.

Marandi, P. J., Bezerra, C. E., and Pedone, F. (2014). Rethinking state-machine replication for parallelism. In 2014 IEEE 34th International Conference on Distributed Computing Systems, pages 368–377. IEEE.

McGee, W. C. (1977). The informatiom management system ims/vs: Part ii: Data base facilities. IBM Syst. J., 16(2):96–122.

Mendizabal, O. M., Dotti, F. L., and Pedone, F. (2016). Analysis of checkpointing overhead in parallel state machine replication. In Proceedings of the 31st Annual ACM Symposium on Applied Computing, pages 534–537.

Mendizabal, O. M., Dotti, F. L., and Pedone, F. (2017). High performance recovery for parallel state machine replication. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pages 34–44.

Mendizabal, O. M., Jalili Marandi, P., Dotti, F. L., and Pedone, F. (2014). Checkpointing in parallel state-machine replication. In Principles of Distributed Systems: 18th International Conference, OPODIS 2014, Cortina d’Ampezzo, Italy, December 16-19, 2014. Proceedings 18, pages 123–138. Springer.

Mostefaoui, A. and Raynal, M. (1996). Efficient message logging for uncoordinated checkpointing protocols. In Dependable Computing—EDCC-2: Second European Dependable Computing Conference Taormina, Italy, October 2–4, 1996 Proceedings 2, pages 353–364. Springer.

Müller, R. H., Meinhardt, C., and Mendizabal, O. M. (2022). An architecture proposal for checkpoint/restore on stateful containers. In Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing, pages 267–270.

Munhoz, V., Castro, M., and Mendizabal, O. (2022). Strategies for fault-tolerant tightly-coupled hpc workloads running on low-budget spot cloud infrastructures. In 2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pages 263–272.

Nicolae, B., Li, J., Wozniak, J. M., Bosilca, G., Dorier, M., and Cappello, F. (2020). Deep-freeze: Towards scalable asynchronous checkpointing of deep learning models. In 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), pages 172–181.

Oh, S. and Kim, J. (2018). Stateful container migration employing checkpoint-based restoration for orchestrated container clusters. In 2018 International Conference on Information and Communication Technology Convergence (ICTC), pages 25–30.

Oppenheimer, G. and Clancy, K. (1968). Considerations for software protection and recovery from hardware failures in a multiaccess, multiprogramming, single processor system. In Proceedings of the December 9-11, 1968, fall joint computer conference, part I, pages 29–37.

Plank, J. S., Beck, M., Kingsley, G., and Li, K. (1995). Libckpt: Transparent checkpointing under UNIX. In USENIX 1995 Technical Conference. USENIX Association.

Schneider, F. B. (1990). Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys (CSUR), 22(4):299–319.

Sousa, P., Bessani, A. N., Correia, M., Neves, N. F., and Verissimo, P. (2009). Highly available intrusion-tolerant services with proactive-reactive recovery. IEEE Transactions on Parallel and Distributed Systems, 21(4):452–465.

Stoyanov, R. and Kollingbaum, M. J. (2018). Efficient live migration of linux containers. In High Performance Computing: ISC High Performance 2018 International Workshops, Frankfurt/Main, Germany, June 28, 2018, Revised Selected Papers 33, pages 184–193. Springer.

Tamir, Y. and Sequin, C. H. (1984). Error recovery in multicomputers using global checkpoints. In 13th International Conference on Parallel Processing, pages 32–41.

Tiwari, D., Gupta, S., and Vazhkudai, S. S. (2014). Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pages 25–36. IEEE.

Xing, E. P., Ho, Q., Dai, W., Kim, J. K., Wei, J., Lee, S., Zheng, X., Xie, P., Kumar, A., and Yu, Y. (2015). Petuum: A new platform for distributed machine learning on big data. IEEE Transactions on Big Data, 1(2):49–67.

Xu, B., Wu, S., Xiao, J., Jin, H., Zhang, Y., Shi, G., Lin, T., Rao, J., Yi, L., and Jiang, J. (2020). Sledge: Towards Efficient Live Migration of Docker Containers. IEEE International Conference on Cloud Computing, CLOUD, 2020-Octob:321–328.

Yan, Y., Gao, Y., Chen, Y., Guo, Z., Chen, B., and Moscibroda, T. (2016). Tr-spark: Transient computing for big data analytics. In Proceedings of the Seventh ACM Symposium on Cloud Computing, SoCC ’16, page 484–496, New York, NY, USA. Association for Computing Machinery.

Zhao, J., Xiang, Y., Lan, T., Huang, H. H., and Subramaniam, S. (2017). Elastic reliability optimization through peer-to-peer checkpointing in cloud computing. IEEE Transactions on Parallel and Distributed Systems, 28(2):491–502.

Zheng, W., Tu, S., Kohler, E., and Liskov, B. (2014). Fast databases with fast durability and recovery through multicore parallelism. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI’14, page 465–477, USA. USENIX Association.

Zhong, H. and Nieh, J. (2001). Crak: Linux checkpoint/restart as a kernel module. Technical Report CUCS-014-01, Department of Computer Science, Columbia University.

Zhou, A., Sun, Q., and Li, J. (2017). Enhancing reliability via checkpointing in cloud computing systems. China Communications, 14(7):1–10.