Validation of Policies for Dynamic Establishment of Checkpoints in Apache Spark
Abstract
Apache Spark is a platform designed for in-memory distributed data processing. For a reliable and fault-tolerant persistence, it uses the checkpointing technique. Establishing checkpoints on Spark, however, needs to be done manually in the source code, which makes efficient setup a big challenge. This paper presents and validates a dynamic configuration architecture for checkpoints in Spark. The proposed architecture initiates checkpoint procedures automatically, based on monitoring policies that observe the system and the applications. The evaluation results show that using suitable dynamic policies can increase Spark's reliability without compromising its performance.
References
Cardoso, P. V. and Barcelos, P. P. (2018b). Validation of a dynamic checkpoint mechanism for apache hadoop with failure scenarios. In 2018 IEEE 19th Latin-American Test Symposium (LATS), pages 1–6. IEEE.
Egwutuoha, I. P., Levy, D., Selic, B., and Chen, S. (2013). A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 65(3):1302–1326.
Foundation, A. S. (2019). “Apache Spark: Quick Start”. https://spark.apache.org/docs/2.4.1/rdd-programming-guide.html. Novembro.
Karau, H. and Warren, R. (2017). High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark. ”O’Reilly Media, Inc.”.
Laprie, J.-C. (1985). Dependable computing and fault tolerance: Concepts and terminology. In 25th International Symposium on Fault-Tolerant Computing, page 2. IEEE.
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al. (2016). MLlib: Machine learning in apache spark. The Journal of Machine Learning Research, 17(1):1235–1241.
Verma, J. P. and Patel, A. (2016). Comparison of mapreduce and spark programming frameworks for big data analytics on HDFS. International Journal of Computer Science and Communication, 7(2):80–84.
White, T. (2015). Hadoop: The Definitive Guide, 4th Edition. “O’Reilly Media, Inc.”.
Yan, Y., Gao, Y., Chen, Y., Guo, Z., Chen, B., and Moscibroda, T. (2016). Tr-spark: Transient computing for big data analytics. In Proceedings of the Seventh ACM Symposium on Cloud Computing, pages 484–496. ACM.
Zhu, W., Chen, H., and Hu, F. (2016). ASC: Improving spark driver performance with automatic spark checkpoint. In 2016 18th International Conference on Advanced Communication Technology (ICACT), pages 607–611. IEEE.
