A Hierarchical Failure Detection Service with Perfect Semantics

  • Francisco V. Brasileiro UFPB
  • Jorge C. A. de Figueiredo UFPB
  • Lívia M. R. Sampaio UFPB

Resumo


A failure detector is an important abstraction to support the implementation f higher level fault-tolerant protocols on distributed asynchronous systems. In this paper we show, via a counter example, that using the best possible failure detector of a given class is not always the key to achieve the best performance for a specific higher level consensus protocol. We argument that this behaviour is due to structural limitations of the consensus protocol that are unlikely to be circumvented, unless stronger abstractions are provided. Thus, we advocate that the designer of a generic failure detection service should concentrate her efforts in implementing the strongest failure detector possible - even if it is not the best within its class, instead of trying to implement the fest failure detector of a weaker class. Following this philosophy, we present the bases of the design of a hierarchical failure detection service with the strongest semantics known, namely that of a perfect failure detector.

Referências

C. Almeida and P. Veríssimo. Timing failure detection and real-time group communication in quasi-synchronous systems. In Proceedings of the 8th Euromicro Workshop on Real-Time System, L’Aquila, Italy, Jun 1996.

V.S. Catão and F. V. Brasileiro. Serviço de comunicação síncrona para nodos replicados. In Anais do VII Simpósio de Computadores Tolerantes a Falhas, Paraiba, Brazil, Jul 1997.

T. D. Chandra, V. Hadzilacos, and S. Toueg. The weakest failure detector for solving consensus. Journal of the ACM, 43(4):685-722, Jul 1996.

Design/cpn: User’s manual. CPN group, 1993.

T. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225-267, Mar 1996.

W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. In DSN’2000, Jun 2000.

M. J. Fischer. The consensus problem in unreliable distributed systems. Research Report 273, Yale University, Jun 1983.

M. J. Fischer, N. A. Lynch, and M. D. Paterson. Impossibility of distributed consensus with one faulty process. Journal of ACM, 32(2):374-382, Apr 1985.

N. Sergent, X. Défago, and A. Schiper. Failure detectors: Implementation issues and impact on consensus performance. Technical Report SSC/1999/019, Ecole Polytechnique Fédérale de Lausanne, Switzerland, May 1999.

P. Verissimo, A. Casimiro, and C. Fetzer. The timely computing base: Timely actions in the presence of uncertain timeliness. In DSN’2000, Jun 2000.
Publicado
21/05/2002
BRASILEIRO, Francisco V.; FIGUEIREDO, Jorge C. A. de; SAMPAIO, Lívia M. R.. A Hierarchical Failure Detection Service with Perfect Semantics. In: WORKSHOP DE TESTES E TOLERÂNCIA A FALHAS (WTF), 3. , 2002, Búzios/RJ. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2002 . p. 25-32. ISSN 2595-2684. DOI: https://doi.org/10.5753/wtf.2002.23399.