Evolving Paxos for the Non-Malicious Arbitrary Model

Enrique S. dos Santos; Rodrigo R. Barbieri; Gustavo M. D. Vieira

doi:10.5753/wtf.2023.720

Enrique S. dos Santos UFSCar
Rodrigo R. Barbieri UFSCar
Gustavo M. D. Vieira UFSCar

DOI: https://doi.org/10.5753/wtf.2023.720

Resumo

The non-malicious arbitrary fault model presents a compromise between the weak fault tolerance of the fail-stop fault model and the complexity of the Byzantine fault model. This model describes processes that are capable of using validation codes to detect and drop invalid messages. In this paper we propose a more robust non-malicious version of the distributed consensus algorithm Paxos that can withstand algorithm-level (distributed) faults, while ensuring the application will only see correct states. Our approach to extend the non-malicious model beyond local validations is to define global invariants that must be respected by the algorithm and track them as the algorithm executes. Experimental work shows our solution works for a representative subset of faults and that the performance cost is minimal compared to a local-only non-malicious implementation based on validation codes.

Referências

Barbieiri, R. R., dos Santos, E. S., and Vieira, G. M. D. (2019). Decentralized validation for non-malicious arbitrary fault tolerance in Paxos. In WTF SBRC ’19: Proc. of the 20th Workshop on Tests and Fault Tolerance, WTF SBRC ’19, Gramado, RS. SBC.

Barbieri, R. R. and Vieira, G. M. D. (2015). Hardened Paxos through consistency validation. In SBESC ’15: Proceedings of the V Brazilian Symposium on Computing Systems Engineering, SBESC ’15, pages 13–18, Foz do Iguaçu, Brazil. IEEE Computer Society.

Behrens, D., Serafini, M., Arnautov, S., Junqueira, F. P., and Fetzer, C. (2015). Scalable error isolation for distributed systems. In Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation, NSDI’15, page 605–620, USA. USENIX Association.

Behrens, D., Weigert, S., and Fetzer, C. (2013). Automatically tolerating arbitrary faults in non-malicious settings. In Dependable Computing (LADC), 2013 Sixth Latin-American Symposium on, pages 114–123.

Bhatotia, P., Wieder, A., Rodrigues, R., Junqueira, F., and Reed, B. (2010). Reliable datacenter scale computations. In Proceedings of the 4th International Workshop on Large Scale Distributed Systems and Middleware, LADIS ’10, pages 1–6, New York, NY, USA. ACM.

Cachin, C., Guerraoui, R., and Rodrigues, L. (2011). Introduction to reliable and secure distributed programming. Springer.

Chandra, T. D., Griesemer, R., and Redstone, J. (2007). Paxos made live: an engineering perspective. In PODC ’07: Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing, pages 398–407, New York, NY, USA. ACM Press.

Correia, M., Ferro, D. G., Junqueira, F. P., and Serafini, M. (2012). Practical hardening of crash-tolerant systems. In USENIX Annual Technical Conference, pages 453–466.

Lamport, L. (1998). The part-time parliament. ACM Trans. Comput. Syst., 16(2):133–169.

Lamport, L. (2006). Fast Paxos. Distrib. Comput., 19(2):79–103.

Vieira, G. M. D. and Buzato, L. E. (2008). Treplica: Ubiquitous replication. In SBRC ’08: Proc. of the 26th Brazilian Symposium on Computer Networks and Distributed Systems, Rio de Janeiro, Brazil.

Vieira, G. M. D. and Buzato, L. E. (2010). Implementation of an object-oriented specification for active replication using consensus. Technical Report IC-10-26, Institute of Computing, University of Campinas.