Validation of Policies for Dynamic Establishment of Checkpoints in Apache Spark

  • Paulo Vinicus Cardoso Federal University of Santa Maria
  • Rhauani Weber Aita Fazul Federal University of Santa Maria http://orcid.org/0000-0003-0705-9833
  • Patrícia Pitthan Barcelos Federal University of Santa Maria

Abstract


Apache Spark is a platform designed for in-memory distributed data processing. For a reliable and fault-tolerant persistence, it uses the checkpointing technique. Establishing checkpoints on Spark, however, needs to be done manually in the source code, which makes efficient setup a big challenge. This paper presents and validates a dynamic configuration architecture for checkpoints in Spark. The proposed architecture initiates checkpoint procedures automatically, based on monitoring policies that observe the system and the applications. The evaluation results show that using suitable dynamic policies can increase Spark's reliability without compromising its performance.

Keywords: fault tolerance, checkpoints, monitoring policies, dynamic architecture

References

Cardoso, P. V. and Barcelos, P. P. (2018a). Dynamic checkpoint architecture for reliability improvement on distributed frameworks. In 2018 IEEE 37th Symposium on Reliable Distributed Systems (SRDS), pages 261–263. IEEE.

Cardoso, P. V. and Barcelos, P. P. (2018b). Validation of a dynamic checkpoint mechanism for apache hadoop with failure scenarios. In 2018 IEEE 19th Latin-American Test Symposium (LATS), pages 1–6. IEEE.

Egwutuoha, I. P., Levy, D., Selic, B., and Chen, S. (2013). A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 65(3):1302–1326.

Foundation, A. S. (2019). “Apache Spark: Quick Start”. https://spark.apache.org/docs/2.4.1/rdd-programming-guide.html. Novembro.

Karau, H. and Warren, R. (2017). High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark. ”O’Reilly Media, Inc.”.

Laprie, J.-C. (1985). Dependable computing and fault tolerance: Concepts and terminology. In 25th International Symposium on Fault-Tolerant Computing, page 2. IEEE.

Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al. (2016). MLlib: Machine learning in apache spark. The Journal of Machine Learning Research, 17(1):1235–1241.

Verma, J. P. and Patel, A. (2016). Comparison of mapreduce and spark programming frameworks for big data analytics on HDFS. International Journal of Computer Science and Communication, 7(2):80–84.

White, T. (2015). Hadoop: The Definitive Guide, 4th Edition. “O’Reilly Media, Inc.”.

Yan, Y., Gao, Y., Chen, Y., Guo, Z., Chen, B., and Moscibroda, T. (2016). Tr-spark: Transient computing for big data analytics. In Proceedings of the Seventh ACM Symposium on Cloud Computing, pages 484–496. ACM.

Zhu, W., Chen, H., and Hu, F. (2016). ASC: Improving spark driver performance with automatic spark checkpoint. In 2016 18th International Conference on Advanced Communication Technology (ICACT), pages 607–611. IEEE.
Published
2020-12-07
CARDOSO, Paulo Vinicus; FAZUL, Rhauani Weber Aita; BARCELOS, Patrícia Pitthan. Validation of Policies for Dynamic Establishment of Checkpoints in Apache Spark. In: BRAZILIAN SYMPOSIUM ON COMPUTER NETWORKS AND DISTRIBUTED SYSTEMS (SBRC), 38. , 2020, Rio de Janeiro. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 29-42. ISSN 2177-9384. DOI: https://doi.org/10.5753/sbrc.2020.12271.