Detectores Perfeitos em Sistemas Distribuídos Não Síncronos
Abstract
In this paper we show that is possible to implement a perfect failure detector P (one that suspect all faulty processes if and only if those processes failed) in a system weaker than the synchronous system (contradicting a common believe). To realize that, we introduce the partitioned synchronous system (Spa) that is strictly weaker than the conventional synchronous system (it is not always possible to implement global synchronous computations in Spa such as internal clock synchronization). From some properties we introduce (such as strong partitioned synchrony), we show how to implement P over Spa. Moreover, we show that even if strong partitioned synchrony cannot be sustained, we still are able to take advantage of the existing synchronous partitions to improve the robustness of applications, by introducing a partially perfect detector named xP. The necessary properties and algorithms to implement P and xP are presented in the paper, as well as the related correctness proofs.
References
Chandra, T. D. and Toueg, S. (1996). Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267.
Charron-Bost, B., Guerraoui, R., and Schiper, A. (2000). Synchronous system and perfect failure detector: solvability and efficiency issues. In Proceedings of the International Conference on Dependable System and Networks, pages 523–532.
Chen, W., Toueg, S., and Aguilera, M. K. (2000). On the quality of service of failure detectors. In Proceedings of the International Conference on Dependable Systems and Networks (ICDSN/FTCS-30), pages 561–580.
Cristian, F. and Fetzer, C. (1999). The timed asynchronous distributed system model. IEEE Transactions on Parallel and Distributed Systems, 10(6):642–657.
Dwork, C., Lynch, N., and Stockmeyer, L. (1988). Consensus in the presence of partial synchrony. Journal of the ACM, 35(2):288–323.
Falai, L. and Bondavalli, A. (2005). Experimental evaluation of the qos of failure detectors on wide area network. In Proceedings of the International Conference on Dependable Systems and Networks, DSN, pages 624–633.
Fetzer, C. and Cristian, F. (1996). Fail-awareness in timed asynchronous systems. In Proceedings of the 15th Symposium on Principles of Distributed Computing, pages 314–321.
Fisher, M. J., Lynch, N., and Paterson, M. S. (1985). Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2):374–382.
Gorender, S. (2005). Um modelo híbrido e adaptativo para sistemas distribuídos tolerantes a falhas. Tese de Doutorado aprovada pelo CIN/UFPE.
Gorender, S., Macêdo, R., and Raynal, M. (2007]). An adaptive programming model for fault-tolerant distributed computing. IEEE Transactions on Dependable and Secure Computing, 4(1):18–31.
Gorender, S. and Macêdo, R. J. d. A. (2002). Um modelo para tolerância a falhas em sistemas distribuídos com qos. In Anais do Simpósio Brasileiro de Redes de Computadores, SBRC 2002, pages 277–292.
Lamport, L., Shostak, R., and Pease, M. (1982). The byzantine generals problem. ACM Trans. Program. Lang. Syst., 4(3):382–401.
Lima, F. R. L. and Macêdo, R. J. d. A. (2005). Adapting failure detectors to communication network load fluctuations using snmp and artificial neural nets. In Second Latin-American Symposium on Dependable Computing (LADC2005), Lecture Notes in Computer Science, pages 191 – 205.
Lynch, N. A. (1996). Distributed Algorithms. Morgan Kaufmann Publishers, Inc.
Macêdo, R. (2000). Failure detection in asynchronous distributed systems. In Proc. of II Workshop on Tests and Fault-Tolerance (II WFT 2000), pages 76–81.
Macêdo, R., Gorender, S., and Cunha, P. (2005). The implementation of a qos-based adaptive distributed system model for fault tolerance. In Anais do Simposio Brasileiro de Redes de Computadores (SBRC 2005), pages 827–840.
Mullender, S. (1993). Distributed Systems. Addison-Wesley Pub Co.
Nunes, R. C. and Jansch-Pôrto, I. (2004). Qos of time-out based self-tuned failure detector: the effects of its communication delay predictor and its safety margin. In IEEE Int. Conf. on Dependable System and Network, track on Performance and Dependability Symposium (DSN-PDS), pages 753–761.
Schiper, A. (1997). Early consensus in an asynchronous systems with a weak failure detector. Distributed Computing, 10(3):149–157.
Veríssimo, P. and Casimiro, A. (2002). The timely computing base model and architecture. IEEE Transactions on Computers, 51(8):916–930.
