Scalable and Decoupled Logging for State Machine Replication
State Machine Replication (SMR) is a widely used approach for fault tolerance of important services. Support for SMR implementations on shared infrastructures has emerged, allowing wider adoption. However, there are still non-trivial aspects that developers have to handle to build and deploy their dependable services. In this paper, we tackle the need for recovery to keep fault-tolerance levels, and propose an approach to: (i) simplify the development of logging; (ii) improve resource sharing in shared infrastructures; (iii) alleviate costs with replication in pay-per-use infrastructures. The central idea is to decouple service execution from logging and offer logging functionality as a service attachable to SMR deployments. Beyond the added simplicity to deploy an SMR, we show that this approach does not penalize performance of replicated services, and that a logging service can scale to look to several applications.
Attiya, H. and Welch, J. (2004). Distributed Computing: Fundamentals, Simulations, and Advanced Topics. Wiley-Interscience.
AWS (2019). AWS EBS and EC2 pricing values. https://aws.amazon.com/.
Balakrishnan, M., Malkhi, D., Davis, J. D., Prabhakaran, V., Wei, M., and Wobber, T. (2013). Corfu: A distributed shared log. ACM TOCS, 31(4):1–24.
Benz, S., Marandi, P. J., Pedone, F., and Garbinato, B. (2014). Building global and scalable systems with atomic multicast. In ACM MIDDLEWARE 2014.
Bessani, A., Santos, M., Felix, J., Neves, N., and Correia, M. (2013). On the efficiency of durable state machine replication. In USENIX ATC 2013.
Boichat, R., Dutta, P., Frølund, S., and Guerraoui, R. (2003). Deconstructing paxos. ACM SIGACT 2003.
Borges, F., Pacheco, L., Alchieri, E., Caetano, M. F., and Solis, P. (2019). Transparent state machine replication for kubernetes. In IEE AINA 2019.
Chandra, T. D., Griesemer, R., and Redstone, J. (2007). Paxos made live: an engineering perspective. In ACM SIGACT 2007. ACM.
Chandra, T. D. and Toueg, S. (1996). Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267.
Clement, A., Kapritsos, M., Lee, S., Wang, Y., Alvisi, L., Dahlin, M., and Riche, T. (2009). Upright cluster services. In ACM SIGOPS 2009.
Hashicorp (2014). Raft GitHub repository. https://github.com/hashicorp/raft.
Herlihy, M. P. and Wing, J. M. (1990). Linearizability: A correctness condition for concurrent objects. ACM TOPLAS 1990, pages 463–492.
Kończak, J. Z., Wojciechowski, P. T., Santos, N., Żurkowski, T., and Schiper, A. (2019). Recovery algorithms for paxos-based state machine replication. IEEE TDSC 2019.
Lamport, L. (1978). Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565.
Lamport, L. (1998). The part-time parliament. ACM TOCS, 16(2):133–169.
Marandi, P. J., Gkantsidis, C., Junqueira, F., and Narayanan, D. (2016). Filo: Consolidated consensus as a cloud service. In USENIX ATC 2016.
Mendizabal, O. M., Dotti, F. L., and Pedone, F. (2016). Analysis of checkpointing overhead in parallel state machine replication. In ACM SAC/DADS 2016.
Mendizabal, O. M., Dotti, F. L., and Pedone, F. (2017). High performance recovery for parallel state machine replication. In IEEE ICDCS 2017.
Netto, H. V., Lung, L. C., Correia, M., Luiz, A. F., and de Souza, L. M. S. (2017). State machine replication in containers managed by kubernetes. JSA, 73:53–59.
Ongaro, D. and Ousterhout, J. (2014). In search of an understandable consensus algorithm. In USENIX ATC 2014.
Pereira, P. M., Dotti, F. L., Meinhardt, C., and Mendizabal, O. M. (2019). A library for services transparent replication. In ACM SAC/DADS 2019.
Rao, J., Shekita, E. J., and Tata, S. (2011). Using paxos to build a scalable, consistent, and highly available datastore. Proceedings of the VLDB Endowment.
Schneider, F. B. (1990). Implementing fault-tolerant services using the state machine approach: A tutorial. ACM CSUR 1990.