Utilizando o vCube para Detecção de Falhas em Sistemas Assíncronos

Gabriela Stein; Luiz Antonio Rodrigues; Elias Procópio Duarte Jr.

doi:10.5753/wtf.2023.800

Gabriela Stein UNIOESTE
Luiz Antonio Rodrigues UNIOESTE
Elias Procópio Duarte Jr. UFPR

DOI: https://doi.org/10.5753/wtf.2023.800

Resumo

Este trabalho apresenta uma solução para a detecção de falhas em sistemas distribuídos assíncronos. Qualquer par de processos do sistema pode executar testes mutuamente entre si, mas o grafo de testes é mantido com base na topologia virtual vCube. Dada a propriedade de não haver limites para o tempo de execução dos processos e do atraso de comunicação, falsas suspeitas podem ser sinalizadas. Para melhorar a acurácia do detector, quando um processo identifica que foi suspeito por outro, ele deixa o sistema. O algoritmo proposto foi comparado com uma solução típica todos-para-todos. Os resultados mostram que, embora a latência de diagnóstico de falhas/falsas suspeitas seja maior, o número de mensagens e o tempo de execução reduzem (comparativamente) na medida que o número de processos aumenta.

Referências

Brawerman, A. and Jr., E. P. D. (2001). An isochronous testing strategy for hierarchical adaptive distributed system-level diagnosis. J. Electron. Test., 17(2):185–195.

Camargo, E. T. d. and Duarte, E. P. (2018). Running resilient mpi applications on a dynamic group of recommended processes. Journal of the Brazilian Computer Society, 24:1–16.

Chandra, T. D., Hadzilacos, V., and Toueg, S. (1996). The weakest failure detector for solving consensus. J. ACM, 43(4):685–722.

Chandra, T. D. and Toueg, S. (1996). Unreliable failure detectors for reliable distributed systems. J. ACM, 43(2):225–267.

Das, A., Gupta, I., and Motivala, A. (2002). Swim: scalable weakly-consistent infection-style process group membership protocol. In Proceedings International Conference on Dependable Systems and Networks, pages 303–312.

de Araujo, J. P., Arantes, L., Duarte, E. P., Rodrigues, L. A., and Sens, P. (2017). A publish/subscribe system using causal broadcast over dynamically built spanning trees. In 29th SBAC-PAD, pages 161–168.

Duarte, E. P., Bona, L. C. E., and Ruoso, V. K. (2014). Vcube: A provably scalable distributed diagnosis algorithm. In 2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, pages 17–22.

Duarte Jr., E., Rodrigues, L., Camargo, E., and Turchetti, R. (2022). O elo perdido: Um modelo de diagnóstico distribuído para a implementação de detectores de falhas não confiáveis. In Anais do XXIII WTF, pages 29–42, Porto Alegre, RS, Brasil. SBC.

Duarte Jr, E. P., Rodrigues, L. A., Camargo, E. T., and Turchetti, R. (2022). A distributed system-level diagnosis model for the implementation of unreliable failure detectors. arXiv preprint arXiv:2210.02847.

Fischer, M. J., Lynch, N. A., and Paterson, M. S. (1985). Impossibility of distributed consensus with one faulty process. J. ACM, 32(2):374–382.

Jr., E. D. and Mattos, G. (2000). Diagnóstico em redes de topologia arbitrária: Um algoritmo baseado em inundação de mensagens. In Anais do II Workshop de Testes e Tolerância a Falhas, pages 82–87, Porto Alegre, RS, Brasil. SBC.

Larrea, M., Arevalo, S., and Fernndez, A. (1999). Efficient algorithms to implement unreliable failure detectors in partially synchronous systems. In Jayanti, P., editor, Distributed Computing, pages 34–49, Berlin, Heidelberg. Springer Berlin Heidelberg.

Larrea, M., Fernandez, A., and Arevalo, S. (2000). Optimal implementation of the weakest failure detector for solving consensus. In Proceedings 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000, pages 52–59.

Leners, J. B., Wu, H., Hung, W.-L., Aguilera, M. K., and Walfish, M. (2011). Detecting failures in distributed systems with the falcon spy network. In 23rd ACM Symposium on Operating Systems Principles, SOSP ’11, page 279–294.

Masson, G. M., Blough, D. M., and Sullivan, G. F. (1996). System Diagnosis, page 478–536. Prentice-Hall, Inc.

Rodrigues, L. A. (2006). Extensão do suporte para simulação de defeitos em algoritmos distribuídos utilizando o neko. Master’s thesis, UFRGS.

Rodrigues, L. A., Duarte, E. P., and Arantes, L. (2018). A distributed k-mutual exclusion algorithm based on autonomic spanning trees. Journal of Parallel and Distributed Computing, 115:41–55.

Rodrigues, L. A., Duarte Jr, E. P., and Arantes, L. (2014). Arvores geradoras mınimas distribuıdas e autonômicas. In Anais do XXXII Simpósio Brasileiro de Redes de Computadores e Sistemas Distribuıdos (SBRC), Porto Alegre, RS, Brasil. SBC.

Sergent, N., Defago, X., and Schiper, A. (2001). Impact of a failure detection mechanism on the performance of consensus. In Proceedings 2001 Pacific Rim International Symposium on Dependable Computing, pages 137–145.

Urban, P., Defago, X., and Schiper, A. (2000). Contention-aware metrics for distributed algorithms: comparison of atomic broadcast algorithms. In 9th Int’l Conf.on Computer Communications and Networks (ICCCN), pages 582–589.

Urban, P., Defago, X., and Schiper, A. (2001). Neko: a single environment to simulate and prototype distributed algorithms. In Proceedings 15th International Conference on Information Networking, pages 503–511.

Urban, P., Shnayderman, I., and Schiper, A. (2003). Comparison of failure detectors and group membership: performance study of two atomic broadcast algorithms. In Int’l Conf. on Dependable Systems and Networks, pages 645–654.

Ziwich, R., Duarte, E., and Albini, L. (2005). Distributed integrity checking for systems with replicated data. In 11th International Conference on Parallel and Distributed Systems (ICPADS’05), volume 1, pages 363–369 Vol. 1.