Engineering a Failure Detection Service for Widely Distributed Systems

Bruno G. Catão; Francisco V. Brasileiro; Ana Cristina A. Oliveira

doi:10.5753/wtf.2005.23369

Bruno G. Catão UFCG
Francisco V. Brasileiro UFCG
Ana Cristina A. Oliveira UFCG

DOI: https://doi.org/10.5753/wtf.2005.23369

Resumo

Unreliable failure detectors are recognized as important building blocks for implementing fault-tolerant distributed systems. Further, there has been a lot of discussion on how to provide them with sophisticated features that allow for adaptation, flexible use, scalability and quality of service enforcement. Despite that, we are not aware of any real distributed system that uses a sophisticated failure detection service. In fact, most systems deployed use the trivial failure detection scheme provided by the underlying communication technologies (e.g. TCP/IP timeouts). We believe that this state of affairs is due to two main reasons: i) there is no widely supported failure detection service API that incorporates these advanced features in a suitable way; and ii) the benefits of using a sophisticated failure detection service are not clearly understood. This paper targets the first issue by proposing a failure detection service that addresses the main necessities of widely distributed systems and implements the state-of-the-art in failure detection mechanisms. Moreover, to improve the usability of the service we took special care in the design of its programming interface.

Referências

Paxson, V., Allman, M. (2000). Computing tcp’s retransmission. N. W. Group. Rfc 2988. http://www.rfc-editor.org/rfc/rfc2988.txt.

Jini (2005). The Jini Community. Sun Microsystems. http://www.jini.org.

Ourgrid (2005). The OurGrid Project. http://www.ourgrid.org.

Ballardie, T., Francis, P., and Crowcroft, J. (1995). Core based trees (cbt): An architecture for scalable multicast routing. In ACM Sigcomm, pages 88–95, San Francisco, USA.

Bertier, M., Marin, O., and Sens, P. (2002). Implementation and performance evaluation of an adaptable failure detector. In DSN ’02: Proceedings of the 2002 International Conference on Dependable Systems and Networks, pages 354–363. IEEE Computer Society.

Birman, K. P., Hayden, M., Ozkasap, O., Xiao, Z., Budiu, M., and Minsky, Y. (1999). Bimodal multicast. ACM Transactions on Computer Systems, 17(2):41–88.

Brasileiro, F. V., Greve, F., Hurfin, M., Narzul, J. P. L., and Tronel, F. (2002). Eva: an event-based framework for developing specialised communication protocols. In IEEE International Symposium on Network Computing and Applications, pages 108–119.

Chandra, T. and Toueg, S. (1996). Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267.

Chen, W., Toueg, S., and Aguilera, M. K. (2000). On the quality of sevice of failure detectors. In International Conference on Dependable Systems and Networks (DSN’2000), pages 191–200, New York, USA.

Chu, Y.-H., Rao, S. G., and Zhang, H. (2000). A case for end system multicast. In Measurement and Modeling of Computer Systems, pages 1–12.

Defago, X. (2000). Agreement-Related Problems: From SemiPassive Replication to Totally Ordered Broadcast. PhD thesis, École Polytechnique Fédérale de Lausanne, Switzerland. Number 2229.

Defago, X., Felber, P., and Schiper, A. (1999). Optimization techniques for replicating corba objects. In 4th Int’l Workshop on Object-oriented Real-time Dependable Systems (WORDS’99), pages 1–8, Santa Barbara, CA, USA.

Défago, X., Hayashibara, N., and Katayama, T. (2003). On the design of a failure detection service for large scale distributed systems. In Proc. Int’l Symp. Towards Peta-Bit Ultra-Networks (PBit 2003), pages 88–95, Ishikawa, Japan.

Defago, X., Schiper, A., and Sergent, N. (1998). Semi-passive replication. In Symposium on Reliable Distributed Systems, pages 43–50.

Felber, P., Guerraoui, R., Défago, X., and Oser, P. (1999). Failure detector as first class objects. In International Symposium on Distributed Objects and Applications (DOA), pages 132–141, Edinburgh, Scotland,.

Ganesh, A. J., Kermarrec, A.-M., and Massoulie, L. (2001). SCAMP: Peer-to-peer lightweight membership service for large-scale group communication. In Networked Group Communication, pages 44–55.

Gemmell, J. (1997). Scalable reliable multicast using erasure-correcting re-sends. Technical report msr-tr-97-20, Microsoft Research Center.

Guerraoui, R. and Raynal, M. (2005). The information structure of indulgent consensus. IEEE Transactions on Software Enginnering, 54(4):453–466.

Gupta, I., Kermarrec, A., and Ganesh, A. (2002). Efficient epidemic-style protocols for reliable and scalable multicast. In IEEE International Symposium on Reliable Distributed Systems (SRDS), pages 180–189.

Hayashibara, N., Défago, X., and Katayama, T. (2004). The φ accrual failure detector. In Symposium on Reliable Distributed Systems (SRDS’2004), pages 66–78, Florianópolis, Brazil.

Hayashibara, N., Défago, X., and Katayama, T. (2003). Two-ways adaptive failure detection with the φ-failure detector. In Workshop on Adaptive Distributed Systems (WADiS03), pages 22–27.

Jannotti, J., Gifford, D. K., Johnson, K. L., Kaashoek, M. F., and O’Toole, Jr., J. W. (2000). Overcast: Reliable multicasting with an overlay network. pages 197–212.

Starovic, G., Cahill, V., and Tangney, B. (1995). An event based object model for distributed programming. In OOIS (Object-Oriented Information Systems) ’95, pages 72–86, London. Springer-Verlag.

Stelling, P., DeMatteis, C., Foster, I. T., Kesselman, C., Lee, C. A., and von Laszewski, G. (1999). A fault detection service for wide area distributed computations. Cluster Computing, 2(2):117–128.