Impact of Salvage Cost and Configuration Attributes on Apache Hadoop Checkpoint
Abstract
The Apache Hadoop framework, which is used to process and store large amounts of data, uses the Checkpoint and Recovery technique to assist with failed recoveries of your distributed file system. However, efficient adaptations for time between Hadoop checkpoints depend on accurate system observations. The purpose of this paper is to estimate the cost of performing checkpoints and the average time between system failures from a history of observations. Factors are observed and analyzed for different variations of framework configuration and benchmark used.
References
Cardoso, P. V. and Barcelos, P. P. (2018a). Experimentação e análise de checkpoint dinâmico no apache hadoop sob cenários de falha. In XIX Simpósio de Sistemas Computacionais de Alto Desempenho (WSCAD 2018). No prelo.
Cardoso, P. V. and Barcelos, P. P. (2018b). Validation of a dynamic checkpoint mechanism for apache hadoop with failure scenarios. In Test Symposium (LATS), 2018 IEEE 19th Latin-American, pages 1–6. IEEE.
Cui, L., Hao, Z., Li, L., Fei, H., Ding, Z., Li, B., and Liu, P. (2015). Lightweight virtual machine checkpoint and rollback for long-running applications. In Int. Conference on Algorithms and Architectures for Parallel Processing, pages 577–596. Springer.
Daly, J. T. (2006). A higher order estimate of the optimum checkpoint interval for restart dumps. Future generation computer systems, 22(3):303–312.
Egwutuoha, I. P., Levy, D., Selic, B., and Chen, S. (2013). A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 65(3):1302–1326.
Ghit, B. and Epema, D. (2017). Better safe than sorry: Grappling with failures of inmemory data analytics frameworks. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing. ACM.
Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. (2010). Zookeeper: Wait-free coordination for internet-scale systems. In USENIX annual technical conference, page 9.
Laprie, J.-C. (1985). Dependable computing and fault tolerance: Concepts and terminology. In 25th International Symposium on Fault-Tolerant Computing, 1995. IEEE.
Noll, M. (2011). Benchmarking and stress testing an hadoop cluster with terasort, testdfsio & co. Online: http://www. michael-noll. com/blog/2011/04/09/benchmarking-andstress- testing-an-hadoopcluster-with-terasort-testdfsio-nnbench-mrbench.
White, T. (2015). Hadoop: The Definitive Guide, 4th Edition. ”O’Reilly Media, Inc.”.
Young, J. W. (1974). A first order approximation to the optimum checkpoint interval. Communications of the ACM, 17(9):530–531.
