Synchronous Partitioning and its Effects on Fault Tolerance

Raimundo José de Araújo Macêdo

doi:10.5753/wtf.2025.8803

Raimundo José de Araújo Macêdo UFBA https://orcid.org/0000-0002-7672-2208

DOI: https://doi.org/10.5753/wtf.2025.8803

Resumo

Este artigo investiga a resiliência do modelo Síncrono Particionado (PaS)—um modelo híbrido de sistemas distribuídos no qual subsistemas síncronos se comunicam de forma assíncrona. Para isso, avaliamos dois algoritmos de consenso: um novo protocolo baseado em flooding e novas simulações de um protocolo baseado em líder previamente desenvolvido. Os resultados mostram que algoritmos de consenso no modelo PaS alcançam resiliência comparável à de sistemas totalmente síncronos, ao mesmo tempo em que reduzem os custos de implementação, já que apenas as partições precisam operar sob Suposições estritas de sincronismo. O modelo PaS é particularmente adequado para sistemas ciberfísicos, infraestruturas de edge computing e data centers modulares, nos quais o sincronismo localizado pode ser imposto dentro das partições, enquanto a comunicação global permanece assíncrona.

Palavras-chave: Dependabilidade de Sistemas Distribuídos, Tolerância a Falhas, Modelos de Sistemas Distribuídos, Algoritmos Distribuídos, Algoritmos de Consenso

Referências

Attiya, H. and Welch, J. (1998). Distributed Computing: Fundamentals, Simulations, and Advanced Topics. McGraw-Hill.

Blagojevic, A. and de Araújo Macêdo, R. (2012). Simulação de um algoritmo de consenso distribuído no modelo síncrono particionado utilizando o hddss. In Anais da ERBASE 2012 - WTICG 2012. SBC.

Blagojevic, A. P. (2012). Algoritmos de consenso em sistemas distribuídos: Uma análise de desempenho em modelos parcialmente síncrono e síncrono particionado. Monografia de Conclusão, Bacharelado em Ciência da Computação. Orientador: Raimundo Macêdo. IM/DCC/UFBA, 2012.

Casimiro, A. and Veríssimo, P. (1999). Timing failure detection with a timely computing base. In Proceedings of the Third European Research Seminar on Advances in Distributed Systems (ERSADS’99), Madeira Island, Portugal.

Chandra, T. D. and Toueg, S. (1996). Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267.

Cristian, F. (1991). Reaching agreement on processor group membership in synchronous distributed systems. Distributed Computing, 4(4):175–187.

Cristian, F. and Fetzer, C. (1999). The timed asynchronous distributed system model. IEEE Transactions on Parallel and Distributed Systems, 10(6):642–657.

da Silva, W. L. S., Ramos, M. A. D., and Macedo, R. J. d. A. (2018). Byzantine fault tolerance in the partitioned synchronous system model. In 2018 VIII Brazilian Symposium on Computing Systems Engineering (SBESC), pages 106–113.

de Araújo Macêdo, R. J. and Gorender, S. (2008). Detectores perfeitos em sistemas distribuídos não síncronos. In IX Workshop de Teste e Tolerância a Falhas (WTF 2008), Rio de Janeiro, Brazil.

de Araújo Macêdo, R. J. and Gorender, S. (2009). Perfect failure detection in the partitioned synchronous distributed system model. In Proceedings of the Fourth International Conference on Availability, Reliability and Security (ARES 2009), pages 273–280.

de Araújo Macêdo, R. J. and Gorender, S. (2012). Exploiting partitioned synchrony to implement accurate failure detectors. International Journal of Critical Computer-Based Systems, 3(3):168–186.

Dolev, D. and Strong, H. (1983). Authenticated algorithms for byzantine agreement. SIAM Journal of Computing, 12(4):656–66.

Dwork, C., Lynch, N., and Stockmeyer, L. (1988). Consensus in the presence of partial synchrony. Journal of the ACM, 35(2):288–323.

Fetzer, C. (2003). Perfect failure detection in timed asynchronous systems. IEEE Transactions on Computers, 52(2):99–112.

Fischer, M. J., Lynch, N. A., and Lynch, N. A. (1981). A lower bound for the time to assure interactive consistency. Information Processing Letters, 14:183–186.

Fisher, M. J., Lynch, N., and Paterson, M. S. (1985). Impossibility of distributed consensus with one faulty process. Journal of the ACM, 32(2):374–382.

Freitas, A. E. and de Araújo Macêdo, R. J. (2014). A performance evaluation tool for hybrid and dynamic distributed systems. In ACM SIGOPS - Operating Systems Review, Vol. 4, Issue 1, pp. 11-18.

Gorender, S. and de Araújo Macêdo, R. J. (2002). A dynamically qos adaptable consensus and failure detector. In Proceedings of The IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2002 - Fast Abstract Track, pages B80–B81.

Gorender, S. and de Araújo Macêdo, R. J. (2011). Um algoritmo eficiente de consenso distribuído para o modelo síncrono particionado. In Anais do XXIX Simpósio Brasileiro de Redes de Computadores e Sistemas Distribuídos.

Gorender, S., Macêdo, R., and Raynal, M. (2005). A hybrid and adaptive model for fault-tolerant distributed computing. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems and Networks (DSN2005), pages 412–421.

Lamport, L. (2001). Paxos made simple. Distributed Computing Column of ACM SIGACT News, 32(4):51–58.

Lamport, L., Shostak, R., and Pease, M. (1982). The byzantine generals problem. ACM Trans. Program. Lang. Syst., 4(3):382–401.

Larrea, M. (2002). Understanding perfect failure detectors. In PODC ’02: Proceedings of the twenty-first Annual Symposium on Principles of Distributed Computing, pages 257–257, New York, NY, USA. ACM.

Lynch, N. A. (1996). Distributed Algorithms. Morgan Kaufmann Publishers, Inc.

Macêdo, R. J. A., Gorender, S., and Cunha, P. (2005). The implementation of a qos-based adaptive distributed system model for fault tolerance. In Anais do Simposio Brasileiro de Redes de Computadores (SBRC 2005), pages 827–840.

Ongaro, D. and Ousterhout, J. (2014). In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC’14, page 305–320, USA. USENIX Association.

Ramos, M. A. D., Macêdo, R. J. d. A., and Blagojevic, A. (2012). An efficient mutual exclusion algorithm for redundant resources in distributed operating systems. In 2012 Brazilian Symposium on Computing System Engineering, pages 208–213.

Raynal, M. (2018). Fault-Tolerant Message-Passing Distributed Systems: An Algorithmic Approach. Springer.

Santos, M. C. and Gorender, S. (2021). Um algoritmo de membership para o modelo síncrono particionado (spa) de sistemas distribuídos com particionamento forte. In Anais do XXII Workshop de Testes e Tolerância a Falhas, pages 43–56, Porto Alegre, RS, Brasil. SBC.

Veríssimo, P. E. (2006). Travelling through wormholes: a new look at distributed systems models. SIGACT News, 37(1):66–81.

Veríssimo, P. and Casimiro, A. (2002). The timely computing base model and architecture. IEEE Transactions on Computers, 51(8):916–930.