An approach to minimize useless checkpoints on distributed optimistic simulations
Abstract
Distributed architectures for modeling and simulation can scale the execution of large and complex models. These architectures frequently utilize checkpoint strategies to guarantee the execution of synchronous and asynchro- nous components. However, the complete avoidance of useless checkpoints is impractical, and it can severely decrease the simulation performance. In this paper, we present a set of metrics to identify useless checkpoints at run-time. Additionally, we extended a probabilistic decision that employs our proposed metrics to create only checkpoints with high probability to be loaded by roll- back operations. The method identifies inconsistent checkpoints based on the communication patterns and granularity of the events since the last rollback. The results showed that the proposed metrics allow reducing the number of useless checkpoints without negative impacts on simulation performance and outperforms traditional probabilistic strategies in terms of rollback time.
References
Carvalho, F. M. M., M. B. A. (2015). Hybrid synchronization in the dcb based on unco- ordinated checkpoints. Proceedings of ESM’ 2015.
Elnozahy, E. N. M., Alvisi, L., Wang, Y.-M., and Johnson, D. B. (2002). A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375–408.
Fagin, R., Fagin, R., Fagin, R., and Halpern, J. Y. (1994). Reasoning about knowledge and probability. J. ACM, 41(2):340–367.
Fu, D., Becker, M., and Szczerbicka, H. (2013). On the potential of semi-conservative look-ahead estimation in approximative distributed discrete event simulation. In Proce- edings of the 2013 Summer Computer Simulation Conference, SCSC ’13, pages 28:1– 28:8, Vista, CA. Society for Modeling & Simulation International.
Jefferson, D. R. (1985). Virtual time. ACM Trans. Program. Lang. Syst., 7(3):404–425.
Johnson, D. B. (1990). Distributed System Fault Tolerance Using Message Logging and Checkpointing. PhD thesis, Houston, TX, USA. AAI9110983.
Kumar, S., Chauhan, R., and Kumar, P. (2010). A low overhead minimum process glo- bal snapshop collection algorithm for mobile distributed system. arXiv preprint ar- Xiv:1005.5440.
Kunz, G., Stoffers, M., Gross, J., and Wehrle, K. (2012). Know thy simulation model: Analyzing event interactions for probabilistic synchronization in parallel simulations. In Proceedings of the 5th International ICST Conference on Simulation Tools and Tech- niques, SIMUTOOLS ’12, pages 119–128, ICST, Brussels, Belgium, Belgium. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engine- ering).
Lamport, L. (1978). Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565.
Mattern, F. et al. (1989). Virtual time and global states of distributed systems. Parallel and Distributed Algorithms, 1(23):215–226.
Netzer, R. H. B. and Xu, J. (1995). Necessary and sufficient conditions for consistent global snapshots. IEEE Trans. Parallel Distrib. Syst., 6(2):165–169.
Quaglia, F. (1999). Combining periodic and probabilistic checkpointing in optimistic simulation. In Proceedings of the Thirteenth Workshop on Parallel and Distributed Si- mulation, PADS ’99, pages 109–116, Washington, DC, USA. IEEE Computer Society.
Reynolds, Jr., P. F. (1988). A spectrum of options for parallel simulation. In Proceedings of the 20th Conference on Winter Simulation, WSC ’88, pages 325–332, New York, NY, USA. ACM.
Saker, S. and Agbaria, A. (2015). Communication pattern-based distributed snapshots in large-scale systems. In Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE International, pages 1062–1071. IEEE.
Sato, K., Maruyama, N., Mohror, K., Moody, A., Gamblin, T., de Supinski, B. R., and Matsuoka, S. (2012). Design and modeling of a non-blocking checkpointing sys- tem. In Proceedings of the International Conference on High Performance Compu- ting, Networking, Storage and Analysis, SC ’12, pages 19:1–19:10, Los Alamitos, CA, USA. IEEE Computer Society Press.
Wang, Y., Gao, S., Jia, Z., and Li, X. (2009). Make a strategic decision using markov for dynamic checkpoint interval. In Proceedings of the 2009 Ninth IEEE International Conference on Computer and Information Technology - Volume 02, CIT ’09, pages 197–202, Washington, DC, USA. IEEE Computer Society.
