Impact of Salvage Cost and Configuration Attributes on Apache Hadoop Checkpoint

  • Paulo V. M. Cardoso Universidade Federal de Santa Maria
  • Patricia Pitthan Barcelos Universidade Federal de Santa Maria

Abstract


The Apache Hadoop framework, which is used to process and store large amounts of data, uses the Checkpoint and Recovery technique to assist with failed recoveries of your distributed file system. However, efficient adaptations for time between Hadoop checkpoints depend on accurate system observations. The purpose of this paper is to estimate the cost of performing checkpoints and the average time between system failures from a history of observations. Factors are observed and analyzed for different variations of framework configuration and benchmark used.

Keywords: Distributed Systems, Fault Tolerance, Checkpoints

References

Balouek, D. and et al. (2013). Adding virtualization capabilities to the Grid’5000 testbed. In Cloud Computing and Services Science, volume 367 of Communications in Computer and Information Science. Springer Intl Publishing.

Cardoso, P. V. and Barcelos, P. P. (2018a). Experimentação e análise de checkpoint dinâmico no apache hadoop sob cenários de falha. In XIX Simpósio de Sistemas Computacionais de Alto Desempenho (WSCAD 2018). No prelo.

Cardoso, P. V. and Barcelos, P. P. (2018b). Validation of a dynamic checkpoint mechanism for apache hadoop with failure scenarios. In Test Symposium (LATS), 2018 IEEE 19th Latin-American, pages 1–6. IEEE.

Cui, L., Hao, Z., Li, L., Fei, H., Ding, Z., Li, B., and Liu, P. (2015). Lightweight virtual machine checkpoint and rollback for long-running applications. In Int. Conference on Algorithms and Architectures for Parallel Processing, pages 577–596. Springer.

Daly, J. T. (2006). A higher order estimate of the optimum checkpoint interval for restart dumps. Future generation computer systems, 22(3):303–312.

Egwutuoha, I. P., Levy, D., Selic, B., and Chen, S. (2013). A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 65(3):1302–1326.

Ghit, B. and Epema, D. (2017). Better safe than sorry: Grappling with failures of inmemory data analytics frameworks. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing. ACM.

Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. (2010). Zookeeper: Wait-free coordination for internet-scale systems. In USENIX annual technical conference, page 9.

Laprie, J.-C. (1985). Dependable computing and fault tolerance: Concepts and terminology. In 25th International Symposium on Fault-Tolerant Computing, 1995. IEEE.

Noll, M. (2011). Benchmarking and stress testing an hadoop cluster with terasort, testdfsio & co. Online: http://www. michael-noll. com/blog/2011/04/09/benchmarking-andstress- testing-an-hadoopcluster-with-terasort-testdfsio-nnbench-mrbench.

White, T. (2015). Hadoop: The Definitive Guide, 4th Edition. ”O’Reilly Media, Inc.”.

Young, J. W. (1974). A first order approximation to the optimum checkpoint interval. Communications of the ACM, 17(9):530–531.
Published
2019-05-06
CARDOSO, Paulo V. M.; PITTHAN BARCELOS, Patricia. Impact of Salvage Cost and Configuration Attributes on Apache Hadoop Checkpoint. In: BRAZILIAN SYMPOSIUM ON COMPUTER NETWORKS AND DISTRIBUTED SYSTEMS (SBRC), 37. , 2019, Gramado. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2019 . p. 529-542. ISSN 2177-9384. DOI: https://doi.org/10.5753/sbrc.2019.7384.