research-article

Diamond-P-vCube: An Eventually Perfect Hierarchical Failure Detector for Asynchronous Distributed Systems

Authors:
Gabriela Stein

Western Parana State University, Brazil

Western Parana State University, Brazil

0009-0007-7294-4544
View Profile

,
Luiz Antonio Rodrigues

Western Parana State University, Brazil

Western Parana State University, Brazil

0000-0002-9516-1282
View Profile

,
Elias ProcóPio Duarte Jr.

Federal University of Parana, Brazil

Federal University of Parana, Brazil

0000-0002-8916-3302
View Profile

,
Luciana Arantes

Sorbonne Université/LIP6/CNRS, France

Sorbonne Université/LIP6/CNRS, France

0000-0002-0938-2004
View Profile

LADC '23: Proceedings of the 12th Latin-American Symposium on Dependable and Secure ComputingOctober 2023Pages 40–49https://doi.org/10.1145/3615366.3615420

Published:17 October 2023Publication History

LADC '23: Proceedings of the 12th Latin-American Symposium on Dependable and Secure Computing

Pages 40–49

ABSTRACT

In this work, we introduce ◊P-vCube, a push-based failure detector for asynchronous distributed systems. Processes running ◊P-vCube are hierarchically organized on the vCube virtual topology, which presents several logarithmic properties. As there are no bounds on communication delays and process execution times, false suspicions may occur. To ensure eventual accuracy, a process exits the system when it learns that it has been suspected by another process or suspects every other process. We show that ◊P-vCube guarantees completeness in any situation, and eventual accuracy if all correct processes remain strongly connected among themselves. We implemented the proposed algorithm using simulation and presented comparison results with the traditional all-test-all solution and a ring approach. The results demonstrate the trade-off between the speed of detecting actual failures and the number of messages used, as well as the scalability of ◊P-vCube.

References

Edson T. Camargo and Elias P Duarte Jr.2018. Running resilient MPI applications on a dynamic group of recommended processes. Journal of the Brazilian Computer Society 24 (2018), 1–16.Google ScholarCross Ref
Tushar Deepak Chandra and Sam Toueg. 1996. Unreliable Failure Detectors for Reliable Distributed Systems. J. ACM 43, 2 (March 1996), 225–267. https://doi.org/10.1145/226643.226647Google ScholarDigital Library
Abhinandan Das, Indranil Gupta, and Ashish Motivala. 2002. SWIM: scalable weakly-consistent infection-style process group membership protocol. In Proceedings International Conference on Dependable Systems and Networks. IEEE, Washington, DC, USA, 303–312. https://doi.org/10.1109/DSN.2002.1028914Google ScholarCross Ref
João Paulo de Araujo, Luciana Arantes, Elias P. Duarte Jr., Luiz A. Rodrigues, and Pierre Sens. 2017. A Publish/Subscribe System Using Causal Broadcast over Dynamically Built Spanning Trees. In 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, Campinas, Brazil, 161–168. https://doi.org/10.1109/SBAC-PAD.2017.28Google ScholarCross Ref
João Paulo de Araujo, Luciana Arantes, Elias Procópio Duarte Jr., Luiz A. Rodrigues, and Pierre Sens. 2019. VCube-PS: A causal broadcast topic-based publish/subscribe system. J. Parallel Distributed Comput. 125 (2019), 18–30. https://doi.org/10.1016/j.jpdc.2018.10.011Google ScholarCross Ref
Elias P Duarte, Thiago Garrett, Luis CE Bona, Renato Carmo, and Alexandre P Züge. 2010. Finding stable cliques of PlanetLab nodes. In 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN). IEEE, 317–322. https://doi.org/10.1109/DSN.2010.5544300Google ScholarCross Ref
Elias P. Duarte Jr., Luiz C. P. Albini, Alessandro Brawerman, and Andre L. P. Guedes. 2009. A hierarquical distributed fault diagnosis algorithm based on clusters with Detours. In 2009 Latin American Network Operations and Management Symposium. IEEE, Punta del Este, Uruguay, 1–6. https://doi.org/10.1109/LANOMS.2009.5338797Google ScholarCross Ref
Elias P. Duarte Jr., Luiz C. E. Bona, and Vinicius K. Ruoso. 2014. VCube: A Provably Scalable Distributed Diagnosis Algorithm. In 2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems. IEEE, New Orleans, LA, USA, 17–22. https://doi.org/10.1109/ScalA.2014.14Google ScholarDigital Library
Elias P. Duarte Jr. and Takashi Nanya. 1998. A Hierarchical Adaptive Distributed System-Level Diagnosis Algorithm. IEEE Trans. Computers 47, 1 (1998), 34–45. https://doi.org/10.1109/12.656078Google ScholarDigital Library
Elias P Duarte Jr, Luiz A Rodrigues, Edson T Camargo, and Rogério C Turchetti. 2023. The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors. Computing (2023), 1–25. https://doi.org/10.1007/s00607-023-01211-8Google ScholarDigital Library
E Procopio Duarte Jr and Andréa Weber. 2003. A distributed network connectivity algorithm. In The Sixth International Symposium on Autonomous Decentralized Systems, 2003. ISADS 2003. IEEE, 285–292. https://doi.org/10.1109/ISADS.2003.1193959Google ScholarCross Ref
Vitor Enes, Carlos Baquero, Tuanir França Rezende, Alexey Gotsman, Matthieu Perrin, and Pierre Sutra. 2020. State-Machine Replication for Planet-Scale Systems. In Proceedings of the Fifteenth European Conference on Computer Systems (Heraklion, Greece) (EuroSys ’20). Association for Computing Machinery, New York, NY, USA, Article 24, 15 pages. https://doi.org/10.1145/3342195.3387543Google ScholarDigital Library
Christof Fetzer and Flaviu Cristian. 1996. Fail-aware failure detectors. In Proceedings 15th Symposium on Reliable Distributed Systems. IEEE, Niagra-on-the-Lake, ON, Canada, 200–209. https://doi.org/10.1109/RELDIS.1996.559722Google ScholarCross Ref
Ian Gorton. 2022. Foundations of Scalable Systems. " O’Reilly Media, Inc.".Google Scholar
Naohiro Hayashibara, Xavier Défago, Rami YaredRami, and Takuya Katayama. 2004. The /spl phi/ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.IEEE, Florianopolis, Brazil, 66–78. https://doi.org/10.1109/RELDIS.2004.1353004Google ScholarCross Ref
Denis Jeanneau, Luiz A. Rodrigues, Luciana Arantes, and Elias P. Duarte Jr.2017. An autonomic hierarchical reliable broadcast protocol for asynchronous distributed systems with failure detection. J. Braz. Comput. Soc. (JBCS) 23, 1 (2017), 15:1–15:14. https://doi.org/10.1186/s13173-017-0064-9Google ScholarCross Ref
Márk Jelasity, Alberto Montresor, and Ozalp Babaoglu. 2005. Gossip-based aggregation in large dynamic networks. ACM Transactions on Computer Systems (TOCS) 23, 3 (2005), 219–252.Google ScholarDigital Library
Mikel Larrea, Sergio Arevalo, and Antonio Fernndez. 1999. Efficient Algorithms to Implement Unreliable Failure Detectors in Partially Synchronous Systems. In Distributed Computing, Prasad Jayanti (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 34–49.Google Scholar
Mikel Larrea, Antonio Fernandez, and Sergio Arevalo. 2000. Optimal implementation of the weakest failure detector for solving consensus. In Proceedings 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000. IEEE, Nurnberg, Germany, 52–59. https://doi.org/10.1109/RELDI.2000.885392Google ScholarCross Ref
Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos K. Aguilera, and Michael Walfish. 2011. Detecting Failures in Distributed Systems with the Falcon Spy Network. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (Cascais, Portugal) (SOSP ’11). Association for Computing Machinery, New York, NY, USA, 279–294. https://doi.org/10.1145/2043556.2043583Google ScholarDigital Library
Luiz A. Rodrigues, Luciana Arantes, and Elias P. Duarte Jr.2014. An Autonomic Implementation of Reliable Broadcast Based on Dynamic Spanning Trees. In 2014 Tenth European Dependable Computing Conference. IEEE, Newcastle, UK, 1–12. https://doi.org/10.1109/EDCC.2014.31Google ScholarDigital Library
Luiz A. Rodrigues, Luciana Arantes, and Elias P. Duarte Jr.2016. An Autonomic Majority Quorum System. In 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA). IEEE, Crans-Montana, Switzerland, 524–531. https://doi.org/10.1109/AINA.2016.73Google ScholarCross Ref
Luiz A. Rodrigues, Jaime Cohen, Luciana Arantes, and Elias P. Duarte Jr.2013. A Robust Permission-Based Hierarchical Distributed k-Mutual Exclusion Algorithm. In 2013 IEEE 12th International Symposium on Parallel and Distributed Computing. IEEE, Bucharest, Romania, 151–158. https://doi.org/10.1109/ISPDC.2013.28Google ScholarCross Ref
Luiz A. Rodrigues, Elias P. Duarte Jr., and Luciana Arantes. 2018. A distributed k-mutual exclusion algorithm based on autonomic spanning trees. J. Parallel and Distrib. Comput. 115 (2018), 41–55. https://doi.org/10.1016/j.jpdc.2018.01.008Google ScholarCross Ref
Anubis Graciela de Moraes Rossetto, Cláudio F R Geyer, Luciana Arantes, and Pierre Sens. 2018. Impact FD: An Unreliable Failure Detector Based on Process Relevance and Confidence in the System. Comput. J. 61, 10 (04 2018), 1557–1576. https://doi.org/10.1093/comjnl/bxy041Google ScholarCross Ref
Lucas V. Ruchel, Luiz A. Rodrigues, Rogério C. Turchetti, Luciana Arantes, Elias P. Duarte Jr., and Edson T. Camargo. 2023. A Leaderless Hierarchical Atomic Broadcast Algorithm. In Proceedings of the 11th Latin-American Symposium on Dependable Computing (Fortaleza/CE, Brazil) (LADC ’22). Association for Computing Machinery, New York, NY, USA, 61–66. https://doi.org/10.1145/3569902.3569914Google ScholarDigital Library
Ilya Shnayderman, André Schiper, and Péter Urbán. 2003. Comparison of Failure Detectors and Group Membership: Performance Study of Two Atomic Broadcast Algorithms. In 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE Computer Society, Los Alamitos, CA, USA, 645. https://doi.org/10.1109/DSN.2003.1209974Google ScholarCross Ref
Péter Urban, Xavier Defago, and André Schiper. 2000. Contention-aware metrics for distributed algorithms: comparison of atomic broadcast algorithms. In 9th Int’l Conf.on Computer Communications and Networks (ICCCN). IEEE, Las Vegas, NV, USA, 582–589. https://doi.org/10.1109/ICCCN.2000.885548Google ScholarCross Ref
Péter Urban, Xavier Defago, and André Schiper. 2001. Neko: a single environment to simulate and prototype distributed algorithms. In Proceedings 15th International Conference on Information Networking. IEEE, Beppu, Japan, 503–511. https://doi.org/10.1109/ICOIN.2001.905471Google ScholarCross Ref
Roverli Pereira Ziwich, E.P. Duarte Jr., and Luiz C. P. Albini. 2005. Distributed integrity checking for systems with replicated data. In 11th International Conference on Parallel and Distributed Systems (ICPADS’05), Vol. 1. IEEE, Fukuoka, Japan, 363–369 Vol. 1. https://doi.org/10.1109/ICPADS.2005.130Google ScholarDigital Library

Index Terms

Diamond-P-vCube: An Eventually Perfect Hierarchical Failure Detector for Asynchronous Distributed Systems
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Fault-tolerant network topologies

Recommendations

Eventually consistent failure detectors

The concept of unreliable failure detector was introduced by Chandra and Toueg as a mechanism that provides information about process failures. This mechanism has been used to solve different problems in asynchronous systems, in particular the Consensus ...
Read More
Asynchronous failure detectors
PODC '12: Proceedings of the 2012 ACM symposium on Principles of distributed computing

Failure detectors - oracles that provide information about process crashes - are an important abstraction for crash tolerance in distributed systems. Although current failure-detector theory provides great generality and expressiveness, it also poses ...
Read More
A simple and fast asynchronous consensus protocol based on a weak failure detector

The Consensus problem is a fundamental paradigm for fault-tolerant asynchronous systems. It abstracts a family of problems known as Agreement (or Coordination) problems. Any solution to consensus can serve as a basic building block for solving such ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

LADC '23: Proceedings of the 12th Latin-American Symposium on Dependable and Secure Computing
October 2023
242 pages
ISBN:9798400708442
DOI:10.1145/3615366

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Autonomic Systems
Dependability
Distributed Algorithms
Fault-Tolerance
Unreliable Failure Detectors
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 14
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Diamond-P-vCube: An Eventually Perfect Hierarchical Failure Detector for Asynchronous Distributed Systems

LADC '23: Proceedings of the 12th Latin-American Symposium on Dependable and Secure Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Eventually consistent failure detectors

Asynchronous failure detectors

A simple and fast asynchronous consensus protocol based on a weak failure detector

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Diamond-P-vCube: An Eventually Perfect Hierarchical Failure Detector for Asynchronous Distributed Systems

LADC '23: Proceedings of the 12th Latin-American Symposium on Dependable and Secure Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Eventually consistent failure detectors

Asynchronous failure detectors

A simple and fast asynchronous consensus protocol based on a weak failure detector

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media