ABSTRACT
Distributed application developers typically use resiliency patterns like Retry, Circuit Breaker, and Fail Fast for handling remote service failures. However, limited research exists on how these patterns may impact performance across various operational conditions. This paper presents a controlled experiment assessing the performance of over 100 Retry pattern configurations in Java and C# using Resilience4j and Polly libraries, under different workloads and failure rates. Our experimental results indicate increasing any of the three Retry parameters investigated (i.e., the initial backoff delay, the backoff delay multiplier, and the maximum number of retries) reduces response time but raises execution time, with effects intensifying exponentially as failure rates grow. An analysis using a state-of-the-art model explainer reveals the initial backoff delay’s impact is twice that of other parameters at low to moderate failure rates, with more balanced effects at high rates. These findings apply to both Resilience4j and Polly, with Polly’s impact being slightly higher due to subtle implementation differences. Our results can benefit both distributed application developers and researchers. Developers can learn from our findings to tailor the Retry pattern to their applications’ needs. Researchers can expand upon our work to enhance our collective understanding of resiliency patterns’ impact and implications.
- Carlos M. Aderaldo and Nabor C. Mendonça. 2022. ResilienceBench: Um Ambiente para Avaliação Experimental de Padrões de Resiliência para Microsserviços. In Anais Estendidos do XL Simpósio Brasileiro de Redes de Computadores e Sistemas Distribuídos (Fortaleza, CE). SBC, Porto Alegre, RS, Brasil, 65–72.Google Scholar
- Gibeon Aquino, Rafael Queiroz, Geoff Merrett, and Bashir Al-Hashimi. 2019. The circuit breaker pattern targeted to future iot applications. In International Conference on Service-Oriented Computing. Springer, 390–396.Google ScholarDigital Library
- Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. 2016. Site Reliability Engineering: How Google Runs Production Systems. O’Reilly.Google Scholar
- Alessandro Birolini. 2013. Reliability Engineering: Theory and Practice. Springer Science & Business Media.Google Scholar
- Steve Bourne. 2004. A Conversation with Bruce Lindsay: Designing for Failure May Be the Key to Success. ACM Queue 2, 8 (2004), 22–33.Google ScholarDigital Library
- Marc Brooker. 2015. Exponential Backoff And Jitter. AWS Architecture Blog, https://aws.amazon.com/pt/blogs/architecture/exponential-backoff-and-jitter/.Google Scholar
- Franz Brosch, Barbora Buhnova, Heiko Koziolek, and Ralf Reussner. 2011. Reliability Prediction for Fault-Tolerant Software Architectures. In Joint ACM SIGSOFT Conference and ACM SIGSOFT Symposium on Quality of Software Architectures (QoSA) and Architecting Critical Systems (ISARCS). 75–84.Google Scholar
- Franz Brosch, Heiko Koziolek, Barbora Buhnova, and Ralf Reussner. 2011. Architecture-Based Reliability Prediction with the Palladio Component Model. IEEE Transactions on Software Engineering 38, 6 (2011), 1319–1339.Google ScholarDigital Library
- Giuliano Casale, Ningfang Mi, Ludmila Cherkasova, and Evgenia Smirni. 2012. Dealing with Burstiness in Multi-Tier Applications: Models and Their Parameterization. IEEE Transactions on Software Engineering 38, 5 (2012), 1040–1053.Google ScholarDigital Library
- Thiago Costa, Davi Vasconcelos, Carlos Aderaldo, and Nabor Mendonça. 2022. Avaliação de Desempenho de Dois Padrões de Resiliência para Microsserviços: Retry e Circuit Breaker. In Anais do XL Simpósio Brasileiro de Redes de Computadores e Sistemas Distribuídos (Fortaleza, CE). SBC, Porto Alegre, RS, Brasil, 517–530.Google Scholar
- Docker. 2021. Overview of Docker Compose. https://docs.docker.com/compose/.Google Scholar
- Envoy. 2023. Envoy Proxy. https://www.envoyproxy.io.Google Scholar
- Martin Fowler. 2014. CircuitBreaker. https://martinfowler.com/bliki/CircuitBreaker.html.Google Scholar
- Google Cloud. 2019. Rate-limiting strategies and techniques. https://cloud.google.com/architecture/rate-limiting-strategies-techniques.Google Scholar
- gRPC Authors. 2023. gRPC: A high performance, open source universal RPC framework. https://grpc.io/.Google Scholar
- Jiawei Han, Jian Pei, and Hanghang Tong. 2022. Data mining: concepts and techniques. Morgan kaufmann.Google Scholar
- Victor Heorhiadi, Shriram Rajagopalan, Hani Jamjoom, Michael K Reiter, and Vyas Sekar. 2016. Gremlin: Systematic Resilience Testing of Microservices. In 2016 IEEE 36th International Conference on Distributed Computing Systems (ICDCS). 57–66.Google Scholar
- Bilgin Ibryam. 2017. It takes more than a Circuit Breaker to create a resilient application. https://developers.redhat.com/blog/2017/05/16/it-takes-more-than-a-circuit-breaker-to-create-a-resilient-application/.Google Scholar
- Istio.io. 2023. The Istio service mesh. https://istio.io/.Google Scholar
- Lalita J Jagadeesan and Veena B Mendiratta. 2020. When Failure is (Not) an Option: Reliability Models for Microservices Architectures. In 2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). IEEE, 19–24.Google ScholarCross Ref
- Pooyan Jamshidi, Claus Pahl, Nabor C Mendonça, James Lewis, and Stefan Tilkov. 2018. Microservices: The Journey So Far and Challenges Ahead. IEEE Software 35, 3 (2018), 24–35.Google ScholarCross Ref
- Marta Kwiatkowska, Gethin Norman, and David Parker. 2007. Stochastic Model Checking. In Formal Methods for the Design of Computer, Communication and Software Systems: Performance Evaluation (SFM’07)(LNCS (Tutorial Volume), Vol. 4486), M. Bernardo and J. Hillston (Eds.). Springer, 220–270.Google Scholar
- Marta Kwiatkowska, Gethin Norman, and David Parker. 2011. PRISM 4.0: Verification of Probabilistic Real-time Systems. In Proc. 23rd International Conference on Computer Aided Verification (CAV’11)(LNCS, Vol. 6806), G. Gopalakrishnan and S. Qadeer (Eds.). Springer, 585–591.Google ScholarDigital Library
- Xabier Larrakoetxea. 2018. Goresilience: a Go library to improve applications resiliency. https://slok.medium.com/goresilience-a-go-library-to-improve-applications-resiliency-14d229aee385.Google Scholar
- Leo Liberti, Carlile Lavor, Nelson Maculan, and Antonio Mucherino. 2014. Euclidean distance geometry and applications. SIAM review 56, 1 (2014), 3–69.Google Scholar
- Zhenyue Long, Guoquan Wu, Xiaojiang Chen, Chengxu Cui, Wei Chen, and Jun Wei. 2020. Fitness-guided Resilience Testing of Microservice-based Applications. In 2020 IEEE International Conference on Web Services (ICWS). IEEE, 151–158.Google Scholar
- Scott M Lundberg. 2022. SHAP: A game theoretic approach to explain the output of any machine learning model. https://github.com/slundberg/shap.Google Scholar
- Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 4765–4774.Google Scholar
- Nabor C Mendonca and Carlos M Aderaldo. 2021. Towards First-Class Architectural Connectors: The Case for Self-Adaptive Service Meshes. In 35th Brazilian Symposium on Software Engineering (SBES). 404–409.Google Scholar
- Nabor C. Mendonca, Carlos Mendes Aderaldo, Javier Cámara, and David Garlan. 2020. Model-based analysis of microservice resiliency patterns. In 2020 IEEE International Conference on Software Architecture (ICSA). IEEE, 114–124.Google ScholarCross Ref
- Microsoft. 2022. Polly. https://github.com/App-vNext/Polly.Google Scholar
- Microsoft Azure. 2017. Resiliency patterns. https://docs.microsoft.com/en-us/azure/architecture/patterns/category/resiliency.Google Scholar
- Microsoft Azure. 2017. Retry Pattern. https://docs.microsoft.com/en-us/azure/architecture/patterns/retry.Google Scholar
- Piotr Minkowski. 2020. Circuit breaker and retries on Kubernetes with Istio and Spring Boot. Piotr’s TechBlog, https://piotrminkowski.com/2020/06/03/circuit-breaker-and-retries-on-kubernetes-with-istio-and-spring-boot/.Google Scholar
- Raffaela Mirandola, Pasqualina Potena, Elvinia Riccobene, and Patrizia Scandurra. 2014. A Reliability Model for Service Component Architectures. Journal of Systems and Software 89 (2014), 109–127.Google ScholarDigital Library
- Netflix. 2018. Hystrix: Latency and Fault Tolerance for Distributed Systems. https://github.com/Netflix/Hystrix.Google Scholar
- Netflix. 2020. Chaos Monkey. https://github.com/Netflix/chaosmonkey.Google Scholar
- Michael Nygard. 2007. Release It!: Design and Deploy Production-Ready Software. Pragmatic Bookshelf.Google Scholar
- Roberto Pietrantuono, Stefano Russo, and Antonio Guerriero. 2020. Testing microservice architectures for operational reliability. Software Testing, Verification and Reliability 30, 2 (2020), e1725.Google ScholarCross Ref
- PingCAP. 2023. Chaos Mesh. https://github.com/chaos-mesh/chaos-mesh.Google Scholar
- Postman Inc.2017. HttpBin. https://github.com/postmanlabs/httpbinGoogle Scholar
- Resilience4j. 2022. Resilience4j: A Fault tolerance library designed for functional programming. https://github.com/resilience4j/resilience4j.Google Scholar
- Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, and Ali Basiri. 2017. Chaos Engineering: Building Confidence in System Behavior through Experiments. O’Reilly.Google Scholar
- Mohammad Reza Saleh Sedghpour, Cristian Klein, and Johan Tordsson. 2022. An Empirical Study of Service Mesh Traffic Management Policies for Microservices. In ACM/SPEC Int. Conf. Performance Engineering (ICPE). 17–27.Google Scholar
- Corey Scott. 2018. Designing Resilient Systems: Circuit Breakers or Retries? (Part 1). Grab Tech Blog, https://engineering.grab.com/designing-resilient-systems-part-1.Google Scholar
- Corey Scott. 2019. Designing Resilient Systems: Circuit Breakers or Retries? (Part 2). Grab Tech Blog, https://engineering.grab.com/designing-resilient-systems-part-2.Google Scholar
- Mohammad Reza Saleh Sedghpour, Cristian Klein, and Johan Tordsson. 2021. Service mesh circuit breaker: From panic button to performance management tool. In 1st Workshop on High Availability and Observability of Cloud Systems (HAOC). 4–10.Google ScholarDigital Library
- Gráinne Sheerin. 2018. gRPC and Deadlines. https://grpc.io/blog/deadlines/.Google Scholar
- Systems Engineering Body of Knowledge. 2020. System Resilience. https://www.sebokwiki.org/wiki/System_Resilience.Google Scholar
- Dan Tran. 2018. Circuit Breaker and Retry. https://dantt.medium.com/circuit-breaker-and-retry-64830e71d0f6.Google Scholar
- Twitter. 2022. Finagle: A fault tolerant, protocol-agnostic RPC system. https://github.com/twitter/finagle.Google Scholar
- Kanglin Yin, Qingfeng Du, Wei Wang, Juan Qiu, and Jincheng Xu. 2019. On representing and eliciting resilience requirements of microservice architecture systems. arXiv preprint arXiv:1909.13096 (2019).Google Scholar
Index Terms
- How The Retry Pattern Impacts Application Performance: A Controlled Experiment
Recommendations
The supportive effect of patterns in architecture decision recovery - A controlled experiment
The documentation of software architectural design decisions is important to help people understand the system and the rationale behind architectural solutions. In practice, the documentation of such decisions is regularly done after the fact, or ...
A controlled experiment for assessing the contribution of design pattern documentation on software maintenance
ESEM '10: Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and MeasurementIn this paper we present the preliminary results of a controlled experiment to assess the contribution provided by the design patterns on the maintenance of source code. In particular, the study aimed at assessing the effort and the efficiency to ...
Performance implications of design pattern usage in distributed applications: case studies in J2EE and .NET
ROSATEA '06: Proceedings of the ISSTA 2006 workshop on Role of software architecture for testing and analysisIn this paper, we investigate how design patterns used for designing remote interfaces influence the performance of distributed applications. The studied design patterns are considered 'good' designs. A choice between the designs can be made on the ...
Comments