ABSTRACT
In this work, we introduce ◊P-vCube, a push-based failure detector for asynchronous distributed systems. Processes running ◊P-vCube are hierarchically organized on the vCube virtual topology, which presents several logarithmic properties. As there are no bounds on communication delays and process execution times, false suspicions may occur. To ensure eventual accuracy, a process exits the system when it learns that it has been suspected by another process or suspects every other process. We show that ◊P-vCube guarantees completeness in any situation, and eventual accuracy if all correct processes remain strongly connected among themselves. We implemented the proposed algorithm using simulation and presented comparison results with the traditional all-test-all solution and a ring approach. The results demonstrate the trade-off between the speed of detecting actual failures and the number of messages used, as well as the scalability of ◊P-vCube.
- Edson T. Camargo and Elias P Duarte Jr.2018. Running resilient MPI applications on a dynamic group of recommended processes. Journal of the Brazilian Computer Society 24 (2018), 1–16.Google ScholarCross Ref
- Tushar Deepak Chandra and Sam Toueg. 1996. Unreliable Failure Detectors for Reliable Distributed Systems. J. ACM 43, 2 (March 1996), 225–267. https://doi.org/10.1145/226643.226647Google ScholarDigital Library
- Abhinandan Das, Indranil Gupta, and Ashish Motivala. 2002. SWIM: scalable weakly-consistent infection-style process group membership protocol. In Proceedings International Conference on Dependable Systems and Networks. IEEE, Washington, DC, USA, 303–312. https://doi.org/10.1109/DSN.2002.1028914Google ScholarCross Ref
- João Paulo de Araujo, Luciana Arantes, Elias P. Duarte Jr., Luiz A. Rodrigues, and Pierre Sens. 2017. A Publish/Subscribe System Using Causal Broadcast over Dynamically Built Spanning Trees. In 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, Campinas, Brazil, 161–168. https://doi.org/10.1109/SBAC-PAD.2017.28Google ScholarCross Ref
- João Paulo de Araujo, Luciana Arantes, Elias Procópio Duarte Jr., Luiz A. Rodrigues, and Pierre Sens. 2019. VCube-PS: A causal broadcast topic-based publish/subscribe system. J. Parallel Distributed Comput. 125 (2019), 18–30. https://doi.org/10.1016/j.jpdc.2018.10.011Google ScholarCross Ref
- Elias P Duarte, Thiago Garrett, Luis CE Bona, Renato Carmo, and Alexandre P Züge. 2010. Finding stable cliques of PlanetLab nodes. In 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN). IEEE, 317–322. https://doi.org/10.1109/DSN.2010.5544300Google ScholarCross Ref
- Elias P. Duarte Jr., Luiz C. P. Albini, Alessandro Brawerman, and Andre L. P. Guedes. 2009. A hierarquical distributed fault diagnosis algorithm based on clusters with Detours. In 2009 Latin American Network Operations and Management Symposium. IEEE, Punta del Este, Uruguay, 1–6. https://doi.org/10.1109/LANOMS.2009.5338797Google ScholarCross Ref
- Elias P. Duarte Jr., Luiz C. E. Bona, and Vinicius K. Ruoso. 2014. VCube: A Provably Scalable Distributed Diagnosis Algorithm. In 2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems. IEEE, New Orleans, LA, USA, 17–22. https://doi.org/10.1109/ScalA.2014.14Google ScholarDigital Library
- Elias P. Duarte Jr. and Takashi Nanya. 1998. A Hierarchical Adaptive Distributed System-Level Diagnosis Algorithm. IEEE Trans. Computers 47, 1 (1998), 34–45. https://doi.org/10.1109/12.656078Google ScholarDigital Library
- Elias P Duarte Jr, Luiz A Rodrigues, Edson T Camargo, and Rogério C Turchetti. 2023. The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors. Computing (2023), 1–25. https://doi.org/10.1007/s00607-023-01211-8Google ScholarDigital Library
- E Procopio Duarte Jr and Andréa Weber. 2003. A distributed network connectivity algorithm. In The Sixth International Symposium on Autonomous Decentralized Systems, 2003. ISADS 2003. IEEE, 285–292. https://doi.org/10.1109/ISADS.2003.1193959Google ScholarCross Ref
- Vitor Enes, Carlos Baquero, Tuanir França Rezende, Alexey Gotsman, Matthieu Perrin, and Pierre Sutra. 2020. State-Machine Replication for Planet-Scale Systems. In Proceedings of the Fifteenth European Conference on Computer Systems (Heraklion, Greece) (EuroSys ’20). Association for Computing Machinery, New York, NY, USA, Article 24, 15 pages. https://doi.org/10.1145/3342195.3387543Google ScholarDigital Library
- Christof Fetzer and Flaviu Cristian. 1996. Fail-aware failure detectors. In Proceedings 15th Symposium on Reliable Distributed Systems. IEEE, Niagra-on-the-Lake, ON, Canada, 200–209. https://doi.org/10.1109/RELDIS.1996.559722Google ScholarCross Ref
- Ian Gorton. 2022. Foundations of Scalable Systems. " O’Reilly Media, Inc.".Google Scholar
- Naohiro Hayashibara, Xavier Défago, Rami YaredRami, and Takuya Katayama. 2004. The /spl phi/ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.IEEE, Florianopolis, Brazil, 66–78. https://doi.org/10.1109/RELDIS.2004.1353004Google ScholarCross Ref
- Denis Jeanneau, Luiz A. Rodrigues, Luciana Arantes, and Elias P. Duarte Jr.2017. An autonomic hierarchical reliable broadcast protocol for asynchronous distributed systems with failure detection. J. Braz. Comput. Soc. (JBCS) 23, 1 (2017), 15:1–15:14. https://doi.org/10.1186/s13173-017-0064-9Google ScholarCross Ref
- Márk Jelasity, Alberto Montresor, and Ozalp Babaoglu. 2005. Gossip-based aggregation in large dynamic networks. ACM Transactions on Computer Systems (TOCS) 23, 3 (2005), 219–252.Google ScholarDigital Library
- Mikel Larrea, Sergio Arevalo, and Antonio Fernndez. 1999. Efficient Algorithms to Implement Unreliable Failure Detectors in Partially Synchronous Systems. In Distributed Computing, Prasad Jayanti (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 34–49.Google Scholar
- Mikel Larrea, Antonio Fernandez, and Sergio Arevalo. 2000. Optimal implementation of the weakest failure detector for solving consensus. In Proceedings 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000. IEEE, Nurnberg, Germany, 52–59. https://doi.org/10.1109/RELDI.2000.885392Google ScholarCross Ref
- Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos K. Aguilera, and Michael Walfish. 2011. Detecting Failures in Distributed Systems with the Falcon Spy Network. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (Cascais, Portugal) (SOSP ’11). Association for Computing Machinery, New York, NY, USA, 279–294. https://doi.org/10.1145/2043556.2043583Google ScholarDigital Library
- Luiz A. Rodrigues, Luciana Arantes, and Elias P. Duarte Jr.2014. An Autonomic Implementation of Reliable Broadcast Based on Dynamic Spanning Trees. In 2014 Tenth European Dependable Computing Conference. IEEE, Newcastle, UK, 1–12. https://doi.org/10.1109/EDCC.2014.31Google ScholarDigital Library
- Luiz A. Rodrigues, Luciana Arantes, and Elias P. Duarte Jr.2016. An Autonomic Majority Quorum System. In 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA). IEEE, Crans-Montana, Switzerland, 524–531. https://doi.org/10.1109/AINA.2016.73Google ScholarCross Ref
- Luiz A. Rodrigues, Jaime Cohen, Luciana Arantes, and Elias P. Duarte Jr.2013. A Robust Permission-Based Hierarchical Distributed k-Mutual Exclusion Algorithm. In 2013 IEEE 12th International Symposium on Parallel and Distributed Computing. IEEE, Bucharest, Romania, 151–158. https://doi.org/10.1109/ISPDC.2013.28Google ScholarCross Ref
- Luiz A. Rodrigues, Elias P. Duarte Jr., and Luciana Arantes. 2018. A distributed k-mutual exclusion algorithm based on autonomic spanning trees. J. Parallel and Distrib. Comput. 115 (2018), 41–55. https://doi.org/10.1016/j.jpdc.2018.01.008Google ScholarCross Ref
- Anubis Graciela de Moraes Rossetto, Cláudio F R Geyer, Luciana Arantes, and Pierre Sens. 2018. Impact FD: An Unreliable Failure Detector Based on Process Relevance and Confidence in the System. Comput. J. 61, 10 (04 2018), 1557–1576. https://doi.org/10.1093/comjnl/bxy041Google ScholarCross Ref
- Lucas V. Ruchel, Luiz A. Rodrigues, Rogério C. Turchetti, Luciana Arantes, Elias P. Duarte Jr., and Edson T. Camargo. 2023. A Leaderless Hierarchical Atomic Broadcast Algorithm. In Proceedings of the 11th Latin-American Symposium on Dependable Computing (Fortaleza/CE, Brazil) (LADC ’22). Association for Computing Machinery, New York, NY, USA, 61–66. https://doi.org/10.1145/3569902.3569914Google ScholarDigital Library
- Ilya Shnayderman, André Schiper, and Péter Urbán. 2003. Comparison of Failure Detectors and Group Membership: Performance Study of Two Atomic Broadcast Algorithms. In 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE Computer Society, Los Alamitos, CA, USA, 645. https://doi.org/10.1109/DSN.2003.1209974Google ScholarCross Ref
- Péter Urban, Xavier Defago, and André Schiper. 2000. Contention-aware metrics for distributed algorithms: comparison of atomic broadcast algorithms. In 9th Int’l Conf.on Computer Communications and Networks (ICCCN). IEEE, Las Vegas, NV, USA, 582–589. https://doi.org/10.1109/ICCCN.2000.885548Google ScholarCross Ref
- Péter Urban, Xavier Defago, and André Schiper. 2001. Neko: a single environment to simulate and prototype distributed algorithms. In Proceedings 15th International Conference on Information Networking. IEEE, Beppu, Japan, 503–511. https://doi.org/10.1109/ICOIN.2001.905471Google ScholarCross Ref
- Roverli Pereira Ziwich, E.P. Duarte Jr., and Luiz C. P. Albini. 2005. Distributed integrity checking for systems with replicated data. In 11th International Conference on Parallel and Distributed Systems (ICPADS’05), Vol. 1. IEEE, Fukuoka, Japan, 363–369 Vol. 1. https://doi.org/10.1109/ICPADS.2005.130Google ScholarDigital Library
Index Terms
- Diamond-P-vCube: An Eventually Perfect Hierarchical Failure Detector for Asynchronous Distributed Systems
Recommendations
Eventually consistent failure detectors
The concept of unreliable failure detector was introduced by Chandra and Toueg as a mechanism that provides information about process failures. This mechanism has been used to solve different problems in asynchronous systems, in particular the Consensus ...
Asynchronous failure detectors
PODC '12: Proceedings of the 2012 ACM symposium on Principles of distributed computingFailure detectors - oracles that provide information about process crashes - are an important abstraction for crash tolerance in distributed systems. Although current failure-detector theory provides great generality and expressiveness, it also poses ...
A simple and fast asynchronous consensus protocol based on a weak failure detector
The Consensus problem is a fundamental paradigm for fault-tolerant asynchronous systems. It abstracts a family of problems known as Agreement (or Coordination) problems. Any solution to consensus can serve as a basic building block for solving such ...
Comments