Reusable TLA+ Communication Primitives for Modeling and Verifying Distributed Systems

Abstract


Challenges associated with developing distributed systems go beyond understanding business requirements, specific protocols or technologies. Modular design, inherent concurrency, and the occurrence of faults are some of the main difficulties when designing a correct and reliable distributed system. Traditional testing and debugging methods are often insufficient for distributed systems due to the complexity of reproducing rare but critical failure scenarios. Alternatively, verification techniques, such as model checking, can address these challenges by enabling the formal verification of system properties. By representing a system model and expressing temporal logic formulas, designers can identify potential safety and liveness violations. This paper introduces a modular and reusable TLA+ library for modeling communication primitives over perfect, fair-loss, and stubborn links, allowing system designers to describe and verify their solutions using these primitives as the foundation of their communication subsystem. In addition, it enables fault injection, facilitating the analysis of system behavior under unreliable communication conditions, including message drop, duplication, and out-of-order delivery. To illustrate its utility, we implemented a simple protocol and demonstrated how the assumptions and guarantees provided by the underlying communication affect correctness, even in fault-prone scenarios.
Keywords: Distributed Algorithms, Model Verification, Fault-Tolerance

References

Baier, C. and Katoen, J.-P. (2008). Principles of Model Checking. The MIT Press.

Braithwaite, S., Buchman, E., Konnov, I. V., Milosevic, Z., Stoilkovska, I., Widder, J., and Zamfir, A. (2020). Formal specification and model checking of the tendermint blockchain synchronization protocol (short paper). In FMBC@CAV.

Cachin, C., Guerraoui, R., and Rodrigues, L. (2011). Introduction to Reliable and Secure Distributed Programming (2nd ed.). Springer.

dmilstein (2019). Channels: TLA+ modules for modeling message-passing with different guarantees. Available at [link], accessed: 2025-01-28.

Dotti, F. L., Mendizabal, O. M., and dos Santos, O. M. (2005). Verifying fault-tolerant distributed systems using object-based graph grammars. In Dependable Computing, pages 80–100. Springer Berlin Heidelberg.

Gärtner, F. C. (1999). Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Computing Surveys, 31(1):1–26.

Grundmann, M. and Hartenstein, H. (2023). Towards a formal verification of the lightning network with TLA+. arXiv:2307.02342 [cs.LO].

House, A. and Tang, P. (2018). A TLA+ module for asynchronous message-passing systems. In SoutheastCon 2018, pages 1–7.

Kolb, J., Yang, J., Katz, R. H., and Culler, D. E. (2020). Quartz: A framework for engineering secure smart contracts. Technical Report UCB/EECS-2020-178, EECS Department, University of California, Berkeley.

Lamport, L. (1994). The temporal logic of actions. ACM Transactions on Programming Languages and Systems, 16(3):872–923.

Lamport, L. (2009). The PlusCal algorithm language. In Leucker, M. and Morgan, C. (Eds.), Theoretical Aspects of Computing - ICTAC 2009, pages 36–60. Springer Berlin Heidelberg.

Lu, T., Merz, S., and Weidenbach, C. (2011). Towards verification of the Pastry protocol using TLA+. In Lecture Notes in Computer Science, volume 6722, pages 244–258. Springer.

Salamah, S., Gates, A., Roach, S., and Mondragon, O. (2005). Verifying pattern-generated LTL formulas: A case study. In Model Checking Software, Proceedings of the 12th International SPIN Workshop, volume 3639 of Lecture Notes in Computer Science, pages 200–220. Springer.

Saltzer, J. H., Reed, D. P., and Clark, D. D. (1984). End-to-end arguments in system design. ACM Transactions on Computer Systems, 2(4):277–288.

Schneider, F. B. (1993). What good are models and what models are good? In Distributed Systems, pages 17–26. ACM Press/Addison-Wesley Publishing Co., USA, 2nd ed.

Yin, J.-Q., Zhu, H.-B., and Fei, Y. (2020). Specification and verification of the ZAB protocol with TLA+. Journal of Computer Science and Technology, 35:1312–1323.
Published
2025-05-19
PEIXOTO, Diogo Canut Freitas; MENDIZABAL, Odorico Machado. Reusable TLA+ Communication Primitives for Modeling and Verifying Distributed Systems. In: FAULT TOLERANCE WORKSHOP (WTF), 25. , 2025, Natal/RN. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 113-125. ISSN 2595-2684. DOI: https://doi.org/10.5753/wtf.2025.8866.