skip to main content
10.1145/3615366.3615420acmotherconferencesArticle/Chapter ViewAbstractPublication PagesladcConference Proceedingsconference-collections
research-article

Diamond-P-vCube: An Eventually Perfect Hierarchical Failure Detector for Asynchronous Distributed Systems

Published:17 October 2023Publication History

ABSTRACT

In this work, we introduce ◊P-vCube, a push-based failure detector for asynchronous distributed systems. Processes running ◊P-vCube are hierarchically organized on the vCube virtual topology, which presents several logarithmic properties. As there are no bounds on communication delays and process execution times, false suspicions may occur. To ensure eventual accuracy, a process exits the system when it learns that it has been suspected by another process or suspects every other process. We show that ◊P-vCube guarantees completeness in any situation, and eventual accuracy if all correct processes remain strongly connected among themselves. We implemented the proposed algorithm using simulation and presented comparison results with the traditional all-test-all solution and a ring approach. The results demonstrate the trade-off between the speed of detecting actual failures and the number of messages used, as well as the scalability of ◊P-vCube.

References

  1. Edson T. Camargo and Elias P Duarte Jr.2018. Running resilient MPI applications on a dynamic group of recommended processes. Journal of the Brazilian Computer Society 24 (2018), 1–16.Google ScholarGoogle ScholarCross RefCross Ref
  2. Tushar Deepak Chandra and Sam Toueg. 1996. Unreliable Failure Detectors for Reliable Distributed Systems. J. ACM 43, 2 (March 1996), 225–267. https://doi.org/10.1145/226643.226647Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Abhinandan Das, Indranil Gupta, and Ashish Motivala. 2002. SWIM: scalable weakly-consistent infection-style process group membership protocol. In Proceedings International Conference on Dependable Systems and Networks. IEEE, Washington, DC, USA, 303–312. https://doi.org/10.1109/DSN.2002.1028914Google ScholarGoogle ScholarCross RefCross Ref
  4. João Paulo de Araujo, Luciana Arantes, Elias P. Duarte Jr., Luiz A. Rodrigues, and Pierre Sens. 2017. A Publish/Subscribe System Using Causal Broadcast over Dynamically Built Spanning Trees. In 2017 29th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, Campinas, Brazil, 161–168. https://doi.org/10.1109/SBAC-PAD.2017.28Google ScholarGoogle ScholarCross RefCross Ref
  5. João Paulo de Araujo, Luciana Arantes, Elias Procópio Duarte Jr., Luiz A. Rodrigues, and Pierre Sens. 2019. VCube-PS: A causal broadcast topic-based publish/subscribe system. J. Parallel Distributed Comput. 125 (2019), 18–30. https://doi.org/10.1016/j.jpdc.2018.10.011Google ScholarGoogle ScholarCross RefCross Ref
  6. Elias P Duarte, Thiago Garrett, Luis CE Bona, Renato Carmo, and Alexandre P Züge. 2010. Finding stable cliques of PlanetLab nodes. In 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN). IEEE, 317–322. https://doi.org/10.1109/DSN.2010.5544300Google ScholarGoogle ScholarCross RefCross Ref
  7. Elias P. Duarte Jr., Luiz C. P. Albini, Alessandro Brawerman, and Andre L. P. Guedes. 2009. A hierarquical distributed fault diagnosis algorithm based on clusters with Detours. In 2009 Latin American Network Operations and Management Symposium. IEEE, Punta del Este, Uruguay, 1–6. https://doi.org/10.1109/LANOMS.2009.5338797Google ScholarGoogle ScholarCross RefCross Ref
  8. Elias P. Duarte Jr., Luiz C. E. Bona, and Vinicius K. Ruoso. 2014. VCube: A Provably Scalable Distributed Diagnosis Algorithm. In 2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems. IEEE, New Orleans, LA, USA, 17–22. https://doi.org/10.1109/ScalA.2014.14Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Elias P. Duarte Jr. and Takashi Nanya. 1998. A Hierarchical Adaptive Distributed System-Level Diagnosis Algorithm. IEEE Trans. Computers 47, 1 (1998), 34–45. https://doi.org/10.1109/12.656078Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Elias P Duarte Jr, Luiz A Rodrigues, Edson T Camargo, and Rogério C Turchetti. 2023. The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors. Computing (2023), 1–25. https://doi.org/10.1007/s00607-023-01211-8Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. E Procopio Duarte Jr and Andréa Weber. 2003. A distributed network connectivity algorithm. In The Sixth International Symposium on Autonomous Decentralized Systems, 2003. ISADS 2003. IEEE, 285–292. https://doi.org/10.1109/ISADS.2003.1193959Google ScholarGoogle ScholarCross RefCross Ref
  12. Vitor Enes, Carlos Baquero, Tuanir França Rezende, Alexey Gotsman, Matthieu Perrin, and Pierre Sutra. 2020. State-Machine Replication for Planet-Scale Systems. In Proceedings of the Fifteenth European Conference on Computer Systems (Heraklion, Greece) (EuroSys ’20). Association for Computing Machinery, New York, NY, USA, Article 24, 15 pages. https://doi.org/10.1145/3342195.3387543Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Christof Fetzer and Flaviu Cristian. 1996. Fail-aware failure detectors. In Proceedings 15th Symposium on Reliable Distributed Systems. IEEE, Niagra-on-the-Lake, ON, Canada, 200–209. https://doi.org/10.1109/RELDIS.1996.559722Google ScholarGoogle ScholarCross RefCross Ref
  14. Ian Gorton. 2022. Foundations of Scalable Systems. " O’Reilly Media, Inc.".Google ScholarGoogle Scholar
  15. Naohiro Hayashibara, Xavier Défago, Rami YaredRami, and Takuya Katayama. 2004. The /spl phi/ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004.IEEE, Florianopolis, Brazil, 66–78. https://doi.org/10.1109/RELDIS.2004.1353004Google ScholarGoogle ScholarCross RefCross Ref
  16. Denis Jeanneau, Luiz A. Rodrigues, Luciana Arantes, and Elias P. Duarte Jr.2017. An autonomic hierarchical reliable broadcast protocol for asynchronous distributed systems with failure detection. J. Braz. Comput. Soc. (JBCS) 23, 1 (2017), 15:1–15:14. https://doi.org/10.1186/s13173-017-0064-9Google ScholarGoogle ScholarCross RefCross Ref
  17. Márk Jelasity, Alberto Montresor, and Ozalp Babaoglu. 2005. Gossip-based aggregation in large dynamic networks. ACM Transactions on Computer Systems (TOCS) 23, 3 (2005), 219–252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Mikel Larrea, Sergio Arevalo, and Antonio Fernndez. 1999. Efficient Algorithms to Implement Unreliable Failure Detectors in Partially Synchronous Systems. In Distributed Computing, Prasad Jayanti (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 34–49.Google ScholarGoogle Scholar
  19. Mikel Larrea, Antonio Fernandez, and Sergio Arevalo. 2000. Optimal implementation of the weakest failure detector for solving consensus. In Proceedings 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000. IEEE, Nurnberg, Germany, 52–59. https://doi.org/10.1109/RELDI.2000.885392Google ScholarGoogle ScholarCross RefCross Ref
  20. Joshua B. Leners, Hao Wu, Wei-Lun Hung, Marcos K. Aguilera, and Michael Walfish. 2011. Detecting Failures in Distributed Systems with the Falcon Spy Network. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (Cascais, Portugal) (SOSP ’11). Association for Computing Machinery, New York, NY, USA, 279–294. https://doi.org/10.1145/2043556.2043583Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Luiz A. Rodrigues, Luciana Arantes, and Elias P. Duarte Jr.2014. An Autonomic Implementation of Reliable Broadcast Based on Dynamic Spanning Trees. In 2014 Tenth European Dependable Computing Conference. IEEE, Newcastle, UK, 1–12. https://doi.org/10.1109/EDCC.2014.31Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Luiz A. Rodrigues, Luciana Arantes, and Elias P. Duarte Jr.2016. An Autonomic Majority Quorum System. In 2016 IEEE 30th International Conference on Advanced Information Networking and Applications (AINA). IEEE, Crans-Montana, Switzerland, 524–531. https://doi.org/10.1109/AINA.2016.73Google ScholarGoogle ScholarCross RefCross Ref
  23. Luiz A. Rodrigues, Jaime Cohen, Luciana Arantes, and Elias P. Duarte Jr.2013. A Robust Permission-Based Hierarchical Distributed k-Mutual Exclusion Algorithm. In 2013 IEEE 12th International Symposium on Parallel and Distributed Computing. IEEE, Bucharest, Romania, 151–158. https://doi.org/10.1109/ISPDC.2013.28Google ScholarGoogle ScholarCross RefCross Ref
  24. Luiz A. Rodrigues, Elias P. Duarte Jr., and Luciana Arantes. 2018. A distributed k-mutual exclusion algorithm based on autonomic spanning trees. J. Parallel and Distrib. Comput. 115 (2018), 41–55. https://doi.org/10.1016/j.jpdc.2018.01.008Google ScholarGoogle ScholarCross RefCross Ref
  25. Anubis Graciela de Moraes Rossetto, Cláudio F R Geyer, Luciana Arantes, and Pierre Sens. 2018. Impact FD: An Unreliable Failure Detector Based on Process Relevance and Confidence in the System. Comput. J. 61, 10 (04 2018), 1557–1576. https://doi.org/10.1093/comjnl/bxy041Google ScholarGoogle ScholarCross RefCross Ref
  26. Lucas V. Ruchel, Luiz A. Rodrigues, Rogério C. Turchetti, Luciana Arantes, Elias P. Duarte Jr., and Edson T. Camargo. 2023. A Leaderless Hierarchical Atomic Broadcast Algorithm. In Proceedings of the 11th Latin-American Symposium on Dependable Computing (Fortaleza/CE, Brazil) (LADC ’22). Association for Computing Machinery, New York, NY, USA, 61–66. https://doi.org/10.1145/3569902.3569914Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Ilya Shnayderman, André Schiper, and Péter Urbán. 2003. Comparison of Failure Detectors and Group Membership: Performance Study of Two Atomic Broadcast Algorithms. In 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE Computer Society, Los Alamitos, CA, USA, 645. https://doi.org/10.1109/DSN.2003.1209974Google ScholarGoogle ScholarCross RefCross Ref
  28. Péter Urban, Xavier Defago, and André Schiper. 2000. Contention-aware metrics for distributed algorithms: comparison of atomic broadcast algorithms. In 9th Int’l Conf.on Computer Communications and Networks (ICCCN). IEEE, Las Vegas, NV, USA, 582–589. https://doi.org/10.1109/ICCCN.2000.885548Google ScholarGoogle ScholarCross RefCross Ref
  29. Péter Urban, Xavier Defago, and André Schiper. 2001. Neko: a single environment to simulate and prototype distributed algorithms. In Proceedings 15th International Conference on Information Networking. IEEE, Beppu, Japan, 503–511. https://doi.org/10.1109/ICOIN.2001.905471Google ScholarGoogle ScholarCross RefCross Ref
  30. Roverli Pereira Ziwich, E.P. Duarte Jr., and Luiz C. P. Albini. 2005. Distributed integrity checking for systems with replicated data. In 11th International Conference on Parallel and Distributed Systems (ICPADS’05), Vol. 1. IEEE, Fukuoka, Japan, 363–369 Vol. 1. https://doi.org/10.1109/ICPADS.2005.130Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Diamond-P-vCube: An Eventually Perfect Hierarchical Failure Detector for Asynchronous Distributed Systems

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      LADC '23: Proceedings of the 12th Latin-American Symposium on Dependable and Secure Computing
      October 2023
      242 pages
      ISBN:9798400708442
      DOI:10.1145/3615366

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 October 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited
    • Article Metrics

      • Downloads (Last 12 months)14
      • Downloads (Last 6 weeks)2

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format