Building the Landing Zone for Site Reliability Engineering: A Multivocal Literature Review

  • Luiz Alexandre Costa UNIRIO
  • Awdren Fontão UFMS
  • Eleni Constantinou UCY
  • Rodrigo Pereira dos Santos UNIRIO
  • Alexander Serebrenik Tu/e

Abstract


In the era of software-intensive business (SiB), the interdependence between business and software is increasingly prominent. Organizations at the forefront of the digital revolution rely on complex systems to deliver innovative services and products. One of the major challenges is maintaining 24/7 online applications with consistent performance, a task that falls to both the development (Dev) and operations (Ops) teams. Traditionally, Dev focuses on innovation and new features, while Ops ensures the stability of the applications. This division leads to reliability gaps, especially in systems where availability is critical, such as financial services, e-commerce, or mission-critical applications. In such contexts, Site Reliability Engineering (SRE) offers a structured approach that applies Software Engineering (SE) practices to Ops, promotes reliability management through service-level objectives and error budgets, and reduces toil through automation. Given that several emerging practices in SRE occur outside scientific literature, we choose a multivocal approach to capture both theoretical advances and practical experiences. Our work aims to investigate what the multivocal literature is saying about the SRE approach, building the landing zone (a comprehensive foundation) for further studies. Thus, a multivocal literature review (MLR) was performed to identify practices, tools, benefits, and future perspectives from the point of view of researchers and practitioners.We selected 28 studies after applying the review procedures. Based on the results, we categorized SRE practices using the Stacey matrix to guide organizations in prioritizing their adoption. In addition, we propose a Flywheel model to illustrate the iterative and continuous nature of the SRE implementation.
Keywords: software-intensive business, site reliability engineering, multivocal literature review, software quality

References

Sudheer Amgothu. 2024. Innovative CI/CD Pipeline Optimization through Canary and Blue-Green Deployment. International Journal of Computer Applications 186, 50 (2024), 1–5.

Jakob Axelsson and Mats Skoglund. 2016. Quality assurance in software ecosystems: A systematic literature mapping and research agenda. Journal of Systems and Software 114 (2016), 69–81.

Victor R Basili. 1992. Software modeling and measurement: the Goal/Question/Metric paradigm. University of Maryland at College Park.

Ali Basiri, Niosha Behnam, Ruud De Rooij, Lorin Hochstein, Luke Kosewski, Justin Reynolds, and Casey Rosenthal. 2016. Chaos engineering. IEEE Software 33, 3 (2016), 35–41.

Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. 2016. Site reliability engineering: How Google runs production systems. " O’Reilly Media, Inc.".

Betsy Beyer, Niall Richard Murphy, David K Rensin, Kent Kawahara, and Stephen Thorne. 2018. The site reliability workbook: practical ways to implement SRE. " O’Reilly Media, Inc.".

David N Blank-Edelman. 2018. Seeking SRE: conversations about running production systems at scale. " O’Reilly Media, Inc.".

Kathy Charmaz. 2006. Constructing grounded theory: A practical guide through qualitative analysis. Sage.

Tarun Kumar Chawdhury. 2024. Beyond the Falcon: A Generative AI Approach to Robust Endpoint Security. Technical Report. Technical report.

Felipe Cordeiro, Aline Vasconcelos, Rodrigo Pereira dos Santos, and Patricia Lago. 2024. Investigating Accountability in Business-intensive Systems-of-Systems. In Simpósio Brasileiro de Engenharia de Software (SBES). SBC, 35–46.

Luiz Alexandre Costa, Edson Dias, Danilo Ribeiro, Awdren Fontão, Gustavo Pinto, Rodrigo Pereira Dos Santos, and Alexander Serebrenik. 2024. An Actionable Framework for Understanding and Improving Talent Retention as a Competitive Advantage in IT Organizations. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings. 290–291.

Vijay Datla. 2023. Site Reliability Engineering A Modern Approach to Ensuring Cloud Service Uptime and Reliability. International Journal of Computer Engineering and Technology (IJCET) 14, 03 (2023), 181–186.

Breno B Nicolau de França, Helvio Jeronimo, and Guilherme Horta Travassos. 2016. Characterizing DevOps by hearing multiple voices. In Proceedings of the XXX Brazilian Symposium on Software Engineering. 53–62.

Sayed Mehdi Hejazi Dehaghani and Nafiseh Hajrahimi. 2013. Which factors affect software projects maintenance cost more? Acta Informatica Medica 21, 1 (2013), 63.

Tore Dybå and Torgeir Dingsøyr. 2008. Strength of evidence in systematic reviews in software engineering. In Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement. 178–187.

Christof Ebert, Gorka Gallardo, Josune Hernantes, and Nicolas Serrano. 2016. DevOps. Ieee Software 33, 3 (2016), 94–100.

Fernando Vedoin Garcia, Jean Carlo Rossa Hauck, and Adriano Borgatto. 2024. How do Agile Organizations Manage Risks: An Analysis of the State of Practice in Brazil. In Simpósio Brasileiro de Engenharia de Software (SBES). SBC, 80–91.

Vahid Garousi, Michael Felderer, and Mika V Mäntylä. 2019. Guidelines for including grey literature and conducting multivocal literature reviews in software engineering. Information and software technology 106 (2019), 101–121.

Vahid Garousi and Mika V Mäntylä. 2016. When and what to automate in software testing? A multi-vocal literature review. Information and Software Technology 76 (2016), 92–117.

Robert Glass. 2006. Software Creativity 2.0. developer.* Books.

Robert L Glass. 2002. Facts and fallacies of software engineering. Addison-Wesley Professional.

Jayanna Hallur. 2024. The Future of SRE: Trends, Tools, and Techniques for the Next Decade. International Journal of Science and Research (IJSR) 13, 9 (2024), 1688–1698.

Jez Humble and David Farley. 2010. Continuous delivery: reliable software releases through build, test, and deployment automation. Pearson Education.

ISO/IEC/IEEE. 2017. ISO/IEC/IEEE International Standard - Systems and software engineering–Software life cycle processes–Part 2: Relation and mapping between ISO/IEC/IEEE 12207:2017 and ISO/IEC 12207:2008. ISO/IEC/IEEE 12207-2:2020(E) (2017), 1–278. DOI: 10.1109/IEEESTD.2020.9238529

Muhammad Shoaib Khan, AbudulWahid Khan, Faheem Khan, Muhammad Adnan Khan, and Taeg Keun Whangbo. 2022. Critical challenges to adopt DevOps culture in software organizations: A systematic review. IEEE Access 10 (2022), 14339–14349.

Barbara Kitchenham and Stuart Charters. 2007. Guidelines for performing systematic literature reviews in software engineering. EBSE Technical Report EBSE-2007-01 (2007).

Barbara Kitchenham, Lech Madeyski, and David Budgen. 2022. How should software engineering secondary studies include grey material? IEEE Transactions on Software Engineering 49, 2 (2022), 872–882.

Eriks Klotins, Tony Gorschek, and Magnus Wilson. 2023. Continuous software engineering: Introducing an industry readiness model. IEEE Software 40, 4 (2023), 77–87.

Leonardo Leite, Carla Rocha, Fabio Kon, Dejan Milojicic, and Paulo Meirelles. 2019. A survey of DevOps concepts and challenges. ACM Computing Surveys (CSUR) 52, 6 (2019), 1–35.

Ajay Mahimkar, Carlos Eduardo de Andrade, Rakesh Sinha, and Giritharan Rana. 2021. A composition framework for change management. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference. 788–806.

Javier Mancebo, Félix García, and Coral Calero. 2021. A process for analysing the energy efficiency of software. Information and Software Technology 134 (2021), 106560.

Rodney T Ogawa and Betty Malen. 1991. Towards rigor in reviews of multivocal literatures: Applying the exploratory case study method. Review of educational research 61, 3 (1991), 265–286.

Shravan Pargaonkar. 2023. Cultivating Software Excellence: The Intersection of Code Quality and Dynamic Analysis in Contemporary Software Development within the Field of Software Quality Engineering. International Journal of Science and Research (IJSR) 12, 9 (2023), 10–13.

Gustavo Pinto and Fernando Castor. 2017. Energy efficiency: a new concern for application software developers. Commun. ACM 60, 12 (2017), 68–75.

Saikiran Reddy, A Catharine, J Jeslin Shanthamalar, et al. 2024. Efficient Application Deployment: GitOps for Faster and Secure CI/CD Cycles. In 2024 International Conference on Advances in Modern Age Technologies for Health and Engineering Science (AMATHE). IEEE, 1–7.

Patrick Riehmann, Manfred Hanfler, and Bernd Froehlich. 2005. Interactive sankey diagrams. In IEEE Symposium on Information Visualization, 2005. INFOVIS 2005. IEEE, 233–240.

Per Runeson, Martin Host, Austen Rainer, and Bjorn Regnell. 2012. Case study research in software engineering: Guidelines and examples. John Wiley & Sons.

Vivek Sharma. 2022. Managing multi-cloud deployments on kubernetes with istio, prometheus and grafana. In 2022 8th International Conference on Advanced Computing and Communication Systems (ICACCS), Vol. 1. IEEE, 525–529.

Francisco Silva, Valéria Lelli, Ismayle Santos, and Rossana Andrade. 2022. Towards a fault taxonomy for microservices-based applications. In Proceedings of the XXXVI Brazilian Symposium on Software Engineering. 247–256.

Eliezio Soares, Gustavo Sizilio, Jadson Santos, Daniel Alencar Da Costa, and Uirá Kulesza. 2022. The effects of continuous integration on software development: a systematic literature review. Empirical Software Engineering 27, 3 (2022), 78.

Ralph D Stacey. 2007. Strategic management and organisational dynamics: The challenge of complexity to ways of thinking about organisations. Pearson education.

Edith Tom, Aybüke Aurum, and Richard Vidgen. 2013. An exploration of technical debt. Journal of Systems and Software 86, 6 (2013), 1498–1516.

Vladyslav Ukis. 2022. Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations. Addison-Wesley Professional.

Joseph Uzoma, Olakunle Falana, Callistus Obunadike, Kunle Oloyede, and Echezona Obunadike. 2023. Using artificial intelligence for automated incidence response in cybersecurity. International Journal of Information Technology (IJIT) 1, 4 (2023).

Affan Yasin, Rubia Fatima, LijieWen,Wasif Afzal, Muhammad Azhar, and Richard Torkar. 2020. On using grey literature and google scholar in systematic literature reviews in software engineering. IEEE Access 8 (2020), 36226–36243.
Published
2025-09-22
COSTA, Luiz Alexandre; FONTÃO, Awdren; CONSTANTINOU, Eleni; SANTOS, Rodrigo Pereira dos; SEREBRENIK, Alexander. Building the Landing Zone for Site Reliability Engineering: A Multivocal Literature Review. In: BRAZILIAN SYMPOSIUM ON SOFTWARE ENGINEERING (SBES), 39. , 2025, Recife/PE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 93-103. ISSN 2833-0633. DOI: https://doi.org/10.5753/sbes.2025.9772.