Uso de Broadcast na Sincronização de Checkpoints em Protocolos Minimais
Resumo
Nos protocolos de checkpointing síncronos, a aplicação pode ser facilmente restabelecida após a ocorrência de uma falha, pois basta que todos os processos retornem ao seu último checkpoint salvo. Neste artigo, exploramos a classe de protocolos síncronos minimais na qual um número minimal de processos salva checkpoints a cada invocação do protocolo para a construção de um checkpoint global consistente. Cao e Singhal propuseram uma nova abordagem para desenvolver um protocolo minimal que utiliza broadcast para bloquear todos os processos e centralizar a um único processo a tarefa de determinar quais processos devem salvar checkpoints durante a a construção do checkpoint global consistente. Neste texto, mostramos a não minimalidade do protocolo de Cao e Sighal e propomos uma correção para tornar o protocolo minimal.Referências
G. Cao and M. Singhal. On Coordinated Checkpointing in Distributed Systems. IEEE Trans. on Parallel and Distributed Systems, 9(12):1213–1225, Dec. 1998.
G. Cao and M. Singhal. Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems. IEEE Transaction on Parallel and Distributed Systems, 12(2):157–172, 2001.
G. Cao and M. Singhal. Checkpointing with Mutable Checkpoints. Theoretical Computer Science, 290(2):1127–1148, jan 2003.
M. Chandy and L. Lamport. Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transaction on Computing Systems, 3(1):63–75, Feb. 1985.
Ö. Babaoğlu and K. Marzullo. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms. In S. Mullender, editor, Distributed Systems, pages 55–96. Addison-Wesley, 1993.
I. C. Garcia and L. E. Buzato. Progressive Construction of Consistent Global Checkpoints. In 19th IEEE International Conference on Distributed Computing Systems, Austin, Texas, EUA, June 1999.
E. Gendelman, L. Bic, and M. B. Dillencourt. An Efficient Checkpointing Algorithm for Distributed Systems Implementing Reliable Communication Channels. In Symposium on Reliable Distributed Systems, pages 290–291, 1999.
R. Koo and S. Toueg. Checkpointing and Rollback-Recovery for Distributed Systems. IEEE Transaction on Software Engineering, 13:23–31, Jan. 1987.
L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Commun. ACM, 21(7):558–565, July 1978.
P. J. Leu and B. Bhargava. Concurrent Robust Checkpointing and Recovery in Distributed Systems. In 4th IEEE Int. Conference on Data Engineering, pages 154–163, 1988.
R. H. B. Netzer and J. Xu. Necessary and Sufficient Conditions for Consistent Global Snapshots. IEEE Transaction on Parallel and Distributed Systems, 6(2):165–169, 1995.
R. Prakash and M. Singhal. Minimal Global Snapshot and Failure Recovery using Infection. Technical Report OSU-CISRC-12/93-TR42, Department of Computer Science, The Ohio State University, 1993.
R. Prakash and M. Singhal. Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems. IEEE Transaction on Parallel and Distributed Systems, 7(10):1035–1048, Oct. 1996.
B. Randell. System Structure for Software Fault Tolerance. IEEE Transaction on Software Engineering, 1(2):220–232, June 1975.
T. C. Sakata, I. C. Garcia, and L. E. Buzato. Checkpointing Síncrono Bloqueante Minimal com Iniciadores Concorrentes. In Simpósio Brasileiro de Redes de Computadores, pages 681–696, Natal, Rio Grande do Norte, May 2003.
R. Schwarz and F. Mattern. Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail. Distributed Computing, 7(3):149–174, Mar. 1994.
A. S. Tanenbaum and M. Steen. Distributed Systems Principles and Paradigms. Alan Apt, 2002.
G. Cao and M. Singhal. Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems. IEEE Transaction on Parallel and Distributed Systems, 12(2):157–172, 2001.
G. Cao and M. Singhal. Checkpointing with Mutable Checkpoints. Theoretical Computer Science, 290(2):1127–1148, jan 2003.
M. Chandy and L. Lamport. Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transaction on Computing Systems, 3(1):63–75, Feb. 1985.
Ö. Babaoğlu and K. Marzullo. Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms. In S. Mullender, editor, Distributed Systems, pages 55–96. Addison-Wesley, 1993.
I. C. Garcia and L. E. Buzato. Progressive Construction of Consistent Global Checkpoints. In 19th IEEE International Conference on Distributed Computing Systems, Austin, Texas, EUA, June 1999.
E. Gendelman, L. Bic, and M. B. Dillencourt. An Efficient Checkpointing Algorithm for Distributed Systems Implementing Reliable Communication Channels. In Symposium on Reliable Distributed Systems, pages 290–291, 1999.
R. Koo and S. Toueg. Checkpointing and Rollback-Recovery for Distributed Systems. IEEE Transaction on Software Engineering, 13:23–31, Jan. 1987.
L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Commun. ACM, 21(7):558–565, July 1978.
P. J. Leu and B. Bhargava. Concurrent Robust Checkpointing and Recovery in Distributed Systems. In 4th IEEE Int. Conference on Data Engineering, pages 154–163, 1988.
R. H. B. Netzer and J. Xu. Necessary and Sufficient Conditions for Consistent Global Snapshots. IEEE Transaction on Parallel and Distributed Systems, 6(2):165–169, 1995.
R. Prakash and M. Singhal. Minimal Global Snapshot and Failure Recovery using Infection. Technical Report OSU-CISRC-12/93-TR42, Department of Computer Science, The Ohio State University, 1993.
R. Prakash and M. Singhal. Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems. IEEE Transaction on Parallel and Distributed Systems, 7(10):1035–1048, Oct. 1996.
B. Randell. System Structure for Software Fault Tolerance. IEEE Transaction on Software Engineering, 1(2):220–232, June 1975.
T. C. Sakata, I. C. Garcia, and L. E. Buzato. Checkpointing Síncrono Bloqueante Minimal com Iniciadores Concorrentes. In Simpósio Brasileiro de Redes de Computadores, pages 681–696, Natal, Rio Grande do Norte, May 2003.
R. Schwarz and F. Mattern. Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail. Distributed Computing, 7(3):149–174, Mar. 1994.
A. S. Tanenbaum and M. Steen. Distributed Systems Principles and Paradigms. Alan Apt, 2002.
Publicado
10/05/2004
Como Citar
SAKATA, Tiemi C.; GARCIA, Islene C.; BUZATO, Luiz E..
Uso de Broadcast na Sincronização de Checkpoints em Protocolos Minimais. In: WORKSHOP DE TESTES E TOLERÂNCIA A FALHAS (WTF), 5. , 2004, Gramado/RS.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2004
.
p. 145-156.
ISSN 2595-2684.
DOI: https://doi.org/10.5753/wtf.2004.23387.