Decentralized Validation for Non-malicious Arbitrary Fault Tolerance in Paxos
Resumo
Fault-tolerant distributed systems offer high reliability because even if faults in their components occur, they do not exhibit erroneous behavior. Depending on the fault model adopted, hardware and software errors that do not result in a process crashing are usually not tolerated. To tolerate these rather common failures the usual solution is to adopt a stronger fault model, such as the arbitrary or Byzantine fault model. Algorithms created for this fault model, however, are considerably more complex and require more system resources than the ones developed for less strict fault models. One approach to reach a middle ground is the non-malicious arbitrary fault model. In this paper we describe how we incremented an implementation of active replication in the non-malicious fault model with a basic type of distributed validation, where a deviation from the expected algorithm behavior will make a process crash. We experimentally evaluate this implementation using a fault injection framework showing that it is feasible to extend the concept of non-malicious failures beyond hardware failures.
Referências
Behrens, D., Weigert, S., and Fetzer, C. (2013). Automatically tolerating arbitrary faults in non-malicious settings. In Dependable Computing (LADC), 2013 Sixth Latin-American Symposium on, pages 114-123.
Bhatotia, P., Wieder, A., Rodrigues, R., Junqueira, F., and Reed, B. (2010). Reliable data-center scale computations. In Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware, LADIS '10, pages 1-6, New York, NY, USA. ACM.
Castro, M. and Liskov, B. (2002). Practical byzantine fault tolerance and proactive recov-ery. ACM Trans. Comput. Syst., 20(4):398-461.
Chandra, T. D., Griesemer, R., and Redstone, J. (2007). Paxos made live: An engineering perspective. In Proceedings of the Twenty-sixth Annual ACM Symposium on Principles of Distributed Computing, PODC '07, pages 398-407, New York, NY, USA. ACM.
Correia, M., Ferro, D. G., Junqueira, F. P., and Serafini, M. (2012). Practical hardening of crash-tolerant systems. In USENIX Annual Technical Conference, pages 453-466.
Lamport, L. (1998). The part-time parliament. ACM Trans. Comput. Syst., 16(2):133-169.
Lamport, L. (2006). Fast paxos. Distributed Computing, 19(2):79-103.
Lamport, L., Shostak, R., and Pease, M. (1982). The byzantine generals problem. ACM Trans. Program. Lang. Syst., 4(3):382-401.
Schneider, F. B. (1990). Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Comput. Surv., 22(4):299-319.
Vieira, G. M. D. and Buzato, L. E. (2008). Treplica: ubiquitous replication. In SBRC'08: Proc. of the 26th Brazilian Symposium on Computer Networks and Distributed Sys-tems.
Vieira, G. M. D. and Buzato, L. E. (2010). Implementation of an object-oriented specifi-cation for active replication using consensus. Technical Report IC-10-26, Institute of Computing, University of Campinas.