O Elo Perdido: Um Modelo de Diagnóstico Distribuído para a Implementação de Detectores de Falhas Não Confiáveis

Elias P. Duarte Jr.; Luiz A. Rodrigues; Edson T. Camargo; Rogério Turchetti

doi:10.5753/wtf.2022.223441

Elias P. Duarte Jr. UFPR
Luiz A. Rodrigues UNIOESTE
Edson T. Camargo UTFPR
Rogério Turchetti UFSM

DOI: https://doi.org/10.5753/wtf.2022.223441

Resumo

O monitoramento de sistemas computacionais para a identificação de falhas é essencial para a construção de sistemas confiáveis. O diagnóstico em nível de sistema foi proposto inicialmente nos anos 1960 como uma abordagem baseada em testes para o monitoramento e identificação de componentes falhos. Ao longo dos últimas décadas, diversos modelos e estratégias para diagnóstico foram propostos, baseados em diferentes modelos de falha, aplicados nos mais diversos tipos de sistemas computacionais. Nos anos 1990, os detectores de falhas não confiáveis surgiram como uma abstração para, a partir do monitoramento de falhas de processos, permitir a execução do consenso em sistemas assíncronos sujeitos a falhas crash. A partir do modelo original, os detectores de falhas se transformaram no padrão de facto para monitoramento de sistemas distribuídos. O presente trabalho visa fechar uma lacuna conceitual, apresentando um modelo de diagnóstico distribuído consistente com os detectores de falhas não confiáveis. São apresentados resultados sobre os limites do número de mensagens de monitoramento, a latência para detecção de eventos, bem como sua completude e precisão.

Referências

Avizienis, A., Laprie, J.-C., Randell, B., and Landwehr, C. (2004). Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1):11-33.

Bertier, M., Marin, O., and Sens, P. (2002). Implementation and performance evaluation of an adaptable failure detector. In The 32nd International Conference on Dependable Systems and Networks (DSN), pages 354-363. IEEE.

Beyer, B., Jones, C., Petoff, J., and Murphy, N. R. (2016). Site reliability engineering: How Google runs production systems. O'Reilly.

Bianchini Jr, R. P. and Buskens, R. W. (1992). Implementation of online distributed system-level diagnosis theory. IEEE Transactions on Computers, 41(05):616-626.

Camargo, E. T. d. and Duarte, E. P. (2018). Running resilient mpi applications on a dynamic group of recommended processes. Journal Braz. Comp. Soc., 24(1):1-16.

Chandra, T. D., Hadzilacos, V., and Toueg, S. (1996). The weakest failure detector for solving consensus. Journal of the ACM (JACM), 43(4):685-722.

Chandra, T. D. and Toueg, S. (1996). Unreliable failure detectors for reliable distributed systems. Journal of the ACM (JACM), 43(2):225-267.

Codestone (2017). The True Impact of IT Failures. https://www.codestone.net/our-thoughts/true-impact-of-it-failures.

De Bona, L. C. E. and Duarte, E. P. (2004). A flexible approach for defining distributed dependable tests in snmp-based network management systems. Journal of electronic testing, 20(4):447-454.

Duarte, E. P., Bona, L. C., and Ruoso, V. K. (2014). Vcube: A provably scalable distributed diagnosis algorithm. In 2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, pages 17-22. IEEE.

Duarte, E. P., Weber, A., and Fonseca, K. (2011). Distributed diagnosis of dynamic events in partitionable arbitrary topology networks. IEEE transactions on parallel and distributed systems, 23(8):1415-1426.

Duarte Jr, E. (1998). Um algoritmo para diagnóstico de redes de topologia arbitrária. In Proc. 1st SBC Workshop on Test and Fault Tolerance, SBCWTF, volume 1, pages 50-55.

Duarte Jr, E. P., Mansfield, G., Nanya, T., and Noguchi, S. (1998). Improving the dependability of network management systems. Int. Journal Net Mgmt, 8(4):244-253.

Duarte Jr, E. P., Mansfield, G., Noguchi, S., and Miyazaki, M. (1994). Fault-tolerant network management. The 2nd International Symposium on Applied Corporate Computing (ISACC), pages 109-116.

Duarte Jr, E. P., Santini, R., and Cohen, J. (2004). Delivering packets during the routing convergence latency interval through highly connected detours. In The 34th International Conference on Dependable Systems and Networks (DSN), pages 495-504. IEEE.

Duarte Jr, E. P., Ziwich, R. P., and Albini, L. C. (2011). A survey of comparison-based system-level diagnosis. ACM Computing Surveys (CSUR), 43(3):1-56.

Fischer, M. J., Lynch, N. A., and Paterson, M. S. (1985). Impossibility of distributed consensus with one faulty process. Journal of the ACM (JACM), 32(2):374-382.

Greve, F. (2005). Protocolos fundamentais para o desenvolvimento de aplicações robustas. Minicurso do SBRC'2005, pages 330-398.

Gupta, I., Chandra, T. D., and Goldszmidt, G. S. (2001). On scalable and efficient distributed failure detectors. In The 20th ACM PODC, pages 170-179.

Hakimi, S. L. and Amin, A. T. (1974). Characterization of connection assignment of diagnosable systems. IEEE Transactions on Computers, 100(1):86-88.

Hakimi, S. L. and Nakajima, K. (1984). On adaptive system diagnosis. IEEE Transactions on Computers, 33(3):234-240.

Hosseini, S. H., Kuhl, J. G., and Reddy, S. M. (1984). A diagnosis algorithm for distributed computing systems with dynamic failure and repair. IEEE Transactions on Computers, 33(03):223-233.

Jeanneau, D., Rodrigues, L. A., Arantes, L., and Duarte Jr, E. P. (2017). An autonomic hierarchical reliable broadcast protocol for asynchronous distributed systems with failure detection. Journal of the Brazilian Computer Society, 23(1):1-14.

Masson, G. M., Blough, D. M., and Sullivan, G. F. (1996). System diagnosis. In Fault-Tolerant Computer System Design, pages 478-536. Prentice-Hall.

Moraes, D. M. and Duarte Jr, E. P. (2011). A failure detection service for internet-based multi-as distributed systems. In 2011 IEEE 17th International Conference on Parallel and Distributed Systems, pages 260-267. IEEE.

NYT, N. Y. T. (2021). Gone in Minutes, Out for Hours: Outage Shakes Facebook. https://www.nytimes.com/2021/10/04/technology/facebook-down.html.

Pradhan, D. K. (1996). Fault-Tolerant Computer System Design. Prentice-Hall.

Preparata, F. P., Metze, G., and Chien, R. T. (1967). On the connection assignment problem of diagnosable systems. IEEE Transactions on Electronic Computers, 16(6):848-854.

Rodrigues, L. A., Arantes, L., and Duarte, E. P. (2016). An autonomic majority quorum system. In The 30th International Conference on Advanced Information Networking and Applications (AINA), pages 524-531. IEEE.

Ruoso, V. K. (2013). Uma estratégia de testes logarítmica para o algoritmo hi-adsd. Dissertação de mestrado, UFPR.

Siqueira, J., Fabris, E., and Duarte Jr, E. (2000). A token based testing strategy for non-broadcast network diagnosis. In 1st IEEE Latin American Test Workshop, pages 166-171.

Turchetti, R. C. and Duarte, E. P. (2015). Implementation of failure detector based on network function virtualization. In 2015 IEEE International Conference on Dependable Systems and Networks Workshops, pages 19-25. IEEE.

Turchetti, R. C., Duarte, E. P., Arantes, L., and Sens, P. (2016). A qos-configurable failure detection service for internet applications. Journal of Internet Services and Applications, 7(1):1-14.

Turchetti, R. C. and Duarte Jr, E. P. (2017). Nfv-fd: Implementation of a failure detector using network virtualization technology. International Journal of Network Management, 27(6):e1988.

Von Neumann, J. (1956). Probabilistic logics and the synthesis of reliable organisms from unreliable components. Automata Studies, 34(34):43-98.

Ziwich, R. P., Duarte, E., and Albini, L. C. P. (2005). Distributed integrity checking for systems with replicated data. In 11th International Conference on Parallel and Distributed Systems (ICPADS'05), volume 1, pages 363-369. IEEE.

Ziwich, R. P. and Duarte, E. P. (2016). A nearly optimal comparison-based diagnosis algorithm for systems of arbitrary topology. IEEE Transactions on Parallel and Distributed Systems, 27(11):3131-3143.