Failure Detection in Asynchronous Distributed Systems
Resumo
Being able to detect failures is an important issue in designing fault-tolerant distributed systems. However, the actual behaviour of a system limits the ability to provide such a mechanism. From one extreme of the spectrum, synchronous systems (i.e., with bounded message transmission delay and processing times) allow for the construction of perfect failure detection based simply on local timeouts. At the other extreme, accurate failure detection cannot be developed for asynchronous systems (i.e. systems with no bounds on message transmission delays and processing times), unless some extra properties can be guaranteed, such the ones specified in a seminal article by Chandra and Toueg [1]. The present paper discusses the requirements and describes the implementations of failure detectors for two important fault-tolerant mechanisms meant to asynchronous environments: process group membership and <>S Failure Detector based distributed consensus [1]. These implementations are based on a mechanism called the Time Connectivity Indicator, introduced in this paper.
Referências
Chandra T., Hadzilacos V. and Toueg S., The Weakest Failure Detector for Solving Consensus. Journal of the ACM, 43(4):685--722, July 1996.
Fischer M.J., Lynch N. and Paterson M.S., Impossibility of Distributed Consensus with One Faulty Process. Journal of the ACM, 32(2):374--382, April 1985.
Amir, Y., Dolev, D., Kramer, S., Malki, D. Transis: A Communication Subsystem for High Availability. In Proc. of the 22nd Int. Symp. on Fault-Tolerant Comp. pp. 76-84, Boston, July, 1992.
K. Birman. The Process Group Approach to Reliable Distributed Computing. Communications of the ACM, Vol. 9, No. 12. pp. 36-53, December 1993.
P. Ezhilchelvan, R. Macêdo, S. Shrivastava. Newtop: A Fault-Tolerant Group Communication Protocol. In Proc. of the IEEE 15th Int. Conf. on Dist. Comp. Syst. Vancouver, pp. 296-306, 1995.
M. Kashoek, A. Tanenbaum. Group Communication in the Amoeba Dist. Op. System. In Proc. of the Int. Workshop on Parallel and Distributed Systems, Vol.5, No.5, pp. 459-473, May, 1994.
M. Shivakant, L. Peterson, R. Schlichting. A Membership Protocol based on Partial Order. In Proc. of the IEEE Int. Working Conf. on Dep. Comp. for Critical Applications, pp 137-145, February, 1991.
M. P. Melliar-Smith, L. E. Moser, V. Agarwala. Processor Membership in Asynchronous Distributed Systems. IEEE Trans. on Parallel and Distributed Systems, 5(5):459-473, May1994.
R. Renesse, K. Birman, R. Cooper, B. Glade, P. Stephenson. The Horus System. In K. Birman e R. Renesse, editores, Reliable Distributed Computing with the Isis Toolkit, pp. 133-147. IEEE Computer Society Press, Los Alamitos, CA, 1993.
Hurfin, M., Macêdo, R., Raynal, M., Tronel, F. A General Framework to Solve Agreement Problems. Proc. of the IEEE Int. Symp. on Reliable Distributed Systems, SRDS'99, Lausanne. 1999.
Badache, N., Hurfin, M., Macêdo, R. Solving The Consensus Problem In A Mobile Environment. Proc. of the IEEE International Performance, Computing, and Communications Conference –IPCCC'99, Phoenix/Scottsdale, USA: IEEE Press, 1999. p.29-35.
Greve, F., Hurfin, M., Macêdo, R., Raynal, M. Consensus Based on Strong Failure Detectors : A Time and Message Efficient Protocol. Lecture Notes in Computer Science, v.1800, p.1258-1267,May/2000.
Rampath, S., Dahbura, A. A Distributed System-Level Diagnosis Algorithm for Arbitrary Network Topologies. IEEE Trans. on Computers, vol. 44, No 4, Feb 1995.
Duarte Jr, L., Nanya, T. A Hierarchical Adaptative Dist. System-level Diagnosis Algorithm. IEEE Trans. on Computers, vol. 47, No 1, Jan/1998
Lamport, L., Shostak, R., Pease, M. The Byzantine Generals Problem. ACM Trans Program. Lang. Syst. 4, 3 (July/1982), pp. 382-401.
Cristian, F. Reaching Agreement on Processor-group Membership in Synchronous Distributed Systems. Distributed Comp. 4, 175-187, 1991.
Schiper, A., Early Consensus in an Asynchronous System with a Weak Failure Detector. Distributed Computing, 10:149-157. 1997.
Batalha, M., Macêdo, R. Arquitetura Orientada a Objetos para um Serviço Distribuído de Diagnóstico de Falhas sobre CORBA. Technical Report RT002/2000, Laboratório de Sistemas Distribuídos – LaSiD, UFBA, May/2000.
Macêdo, Raimundo. “Implementing Failure Detection through the use of a self-tuned Time Connectivity Indicator”. Relatório Técnico RT008/98, Laboratorio de Sistemas Distribuidos, UFBA, Agosto/98.