Uso de Broadcast na Sincronização de Checkpoints em Protocolos Minimais

  • Tiemi C. Sakata UNICAMP
  • Islene C. Garcia UNICAMP
  • Luiz E. Buzato UNICAMP

Abstract


In a synchronous checkpoiting, the application can be easily recovered in case of failures because the processes may rollback to their last checkpoint on stable storage. This article examine the minimal protocols, in which just a minimal number of processes take checkpoints to construct a consistent global checkpoint. Cao and Singhal propose a new approach to develop a minimal protocol. This approach uses a broadcast to block all processes and centralizes to a unique process the task of determine which processes should take checkpoints during the consistent global checkpoint construction. This article contains a prove the protocol proposed by Cao and Singhal is not minimal and we propose a correction to change the protocol and to guarantee the minimality.

References

G. Cao and M. Singhal. On Coordinated Checkpointing in Distributed Systems. IEEE Trans. on Parallel and Distributed Systems, 9(12):1213–1225, Dec. 1998.

G. Cao and M. Singhal. Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems. IEEE Transaction on Parallel and Distributed Systems, 12(2):157–172, 2001.

G. Cao and M. Singhal. Checkpointing with Mutable Checkpoints. Theoretical Computer Science, 290(2):1127–1148, jan 2003.

M. Chandy and L. Lamport. Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transaction on Computing Systems, 3(1):63–75, Feb. 1985.

Ö. Babaoğlu and K. Marzullo. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms. In S. Mullender, editor, Distributed Systems, pages 55–96. Addison-Wesley, 1993.

I. C. Garcia and L. E. Buzato. Progressive Construction of Consistent Global Checkpoints. In 19th IEEE International Conference on Distributed Computing Systems, Austin, Texas, EUA, June 1999.

E. Gendelman, L. Bic, and M. B. Dillencourt. An Efficient Checkpointing Algorithm for Distributed Systems Implementing Reliable Communication Channels. In Symposium on Reliable Distributed Systems, pages 290–291, 1999.

R. Koo and S. Toueg. Checkpointing and Rollback-Recovery for Distributed Systems. IEEE Transaction on Software Engineering, 13:23–31, Jan. 1987.

L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Commun. ACM, 21(7):558–565, July 1978.

P. J. Leu and B. Bhargava. Concurrent Robust Checkpointing and Recovery in Distributed Systems. In 4th IEEE Int. Conference on Data Engineering, pages 154–163, 1988.

R. H. B. Netzer and J. Xu. Necessary and Sufficient Conditions for Consistent Global Snapshots. IEEE Transaction on Parallel and Distributed Systems, 6(2):165–169, 1995.

R. Prakash and M. Singhal. Minimal Global Snapshot and Failure Recovery using Infection. Technical Report OSU-CISRC-12/93-TR42, Department of Computer Science, The Ohio State University, 1993.

R. Prakash and M. Singhal. Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems. IEEE Transaction on Parallel and Distributed Systems, 7(10):1035–1048, Oct. 1996.

B. Randell. System Structure for Software Fault Tolerance. IEEE Transaction on Software Engineering, 1(2):220–232, June 1975.

T. C. Sakata, I. C. Garcia, and L. E. Buzato. Checkpointing Síncrono Bloqueante Minimal com Iniciadores Concorrentes. In Simpósio Brasileiro de Redes de Computadores, pages 681–696, Natal, Rio Grande do Norte, May 2003.

R. Schwarz and F. Mattern. Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail. Distributed Computing, 7(3):149–174, Mar. 1994.

A. S. Tanenbaum and M. Steen. Distributed Systems Principles and Paradigms. Alan Apt, 2002.
Published
2004-05-10
SAKATA, Tiemi C.; GARCIA, Islene C.; BUZATO, Luiz E.. Uso de Broadcast na Sincronização de Checkpoints em Protocolos Minimais. In: FAULT TOLERANCE WORKSHOP (WTF), 5. , 2004, Gramado/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2004 . p. 145-156. ISSN 2595-2684. DOI: https://doi.org/10.5753/wtf.2004.23387.