Um Serviço Distribuído de Detecção de Falhas Baseado em Disseminação Epidêmica

  • Leandro P. de Sousa UFPR
  • Elias P. Duarte Jr. UFPR

Abstract


Failure detectors are abstractions that can be used to solve consensus in asynchronous systems. This work presents a failure detection service based on a gossip strategy. The service was implemented on the JXTA platform. A simulator was also implemented so the detector could be evaluated for a larger number of processes. Experimental results show CPU and memory usage, fault and recovery detection time, mistake rate and how the detector performs when used in a simple election algorithm. The results indicate that the service scales well as the number of processes grow.

References

Chandra, T. D. and Toueg, S. (1996). Unreliable failure detectors for reliable distributed systems. J. ACM, 43(2):225–267.

Chen, W., Toueg, S., and Aguilera, M. K. (2002). On the quality of service of failure detectors. IEEE Trans. Comput., 51(1):13–32.

Das, A., Gupta, I., and Motivala, A. (2002). Swim: scalable weakly-consistent infection-style process group membership protocol. In Proc. International Conference on Dependable Systems and Networks DSN 2002, pages 303–312.

Fischer, M. J., Lynch, N. A., and Paterson, M. S. (1985). Impossibility of distributed consensus with one faulty process. J. ACM, 32(2):374–382.

Greve, F. G. P. (2005). Protocolos fundamentais para o desenvolvimento de aplicações robustas. SBRC’05.

Gupta, I., Birman, K. P., and van Renesse, R. (2002). Fighting fire with fire: using randomized gossip to combat stochastic scalability limits. Quality and Reliability Engineering International, 18(3):165–184.

Gupta, I., Chandra, T. D., and Goldszmidt, G. S. (2001). On scalable and efficient distributed failure detectors. In PODC ’01: Proceedings of the twentieth annual ACM symposium on Principles of distributed computing, pages 170–179, New York, NY, USA. ACM.

JXTA (2009). Jxta community website. https://jxta.dev.java.net/, acessado em junho de 2009.

Lamport, L. (1998). The part-time parliament. ACM Trans. Comput. Syst., 16(2):133–169.

MacDougall, M. H. (1997). Simulating Computer Systems, Techniques and Tools. The MIT Press.

Raynal, M. (2005). A short introduction to failure detectors for asynchronous distributed systems. SIGACT News, 36(1):53–70.

Turek, J. and Shasha, D. (1992). The many faces of consensus in distributed systems. Computer, 25(6):8–17.

van Renesse, R., Minsky, Y., and Hayden, M. (1998). A gossip-style failure detection service. Technical report, Cornell University, Ithaca, NY, USA.
Published
2010-05-28
SOUSA, Leandro P. de; DUARTE JR., Elias P.. Um Serviço Distribuído de Detecção de Falhas Baseado em Disseminação Epidêmica. In: FAULT TOLERANCE WORKSHOP (WTF), 11. , 2010, Gramado/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2010 . p. 31-44. ISSN 2595-2684. DOI: https://doi.org/10.5753/wtf.2010.23094.