Impacto do Custo de Salvamento e dos Atributos de Configuração no Checkpoint do Apache Hadoop

Paulo V. M. Cardoso; Patricia Pitthan Barcelos

doi:10.5753/sbrc.2019.7384

Paulo V. M. Cardoso Universidade Federal de Santa Maria
Patricia Pitthan Barcelos Universidade Federal de Santa Maria

DOI: https://doi.org/10.5753/sbrc.2019.7384

Resumo

O framework Apache Hadoop, usado para processar e armazenar grandes quantidades de dados, usa a técnica de Checkpoint and Recovery para auxiliar em recuperações pós-falha de seu sistema de arquivos distribuído. Porém, adaptações eficientes para período entre checkpoints do Hadoop dependem de observações apuradas do sistema. O objetivo deste trabalho é estimar o custo da realização de checkpoints e o tempo médio entre falhas do sistema a partir de um histórico de observações. Os fatores são observados e analisados com relação a diferentes variações de configuração do framework e do benchmark usado.

Palavras-chave: Sistemas Distribuídos, Tolerância a Falhas, Checkpoints

Referências

Balouek, D. and et al. (2013). Adding virtualization capabilities to the Grid’5000 testbed. In Cloud Computing and Services Science, volume 367 of Communications in Computer and Information Science. Springer Intl Publishing.

Cardoso, P. V. and Barcelos, P. P. (2018a). Experimentação e análise de checkpoint dinâmico no apache hadoop sob cenários de falha. In XIX Simpósio de Sistemas Computacionais de Alto Desempenho (WSCAD 2018). No prelo.

Cardoso, P. V. and Barcelos, P. P. (2018b). Validation of a dynamic checkpoint mechanism for apache hadoop with failure scenarios. In Test Symposium (LATS), 2018 IEEE 19th Latin-American, pages 1–6. IEEE.

Cui, L., Hao, Z., Li, L., Fei, H., Ding, Z., Li, B., and Liu, P. (2015). Lightweight virtual machine checkpoint and rollback for long-running applications. In Int. Conference on Algorithms and Architectures for Parallel Processing, pages 577–596. Springer.

Daly, J. T. (2006). A higher order estimate of the optimum checkpoint interval for restart dumps. Future generation computer systems, 22(3):303–312.

Egwutuoha, I. P., Levy, D., Selic, B., and Chen, S. (2013). A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 65(3):1302–1326.

Ghit, B. and Epema, D. (2017). Better safe than sorry: Grappling with failures of inmemory data analytics frameworks. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing. ACM.

Hunt, P., Konar, M., Junqueira, F. P., and Reed, B. (2010). Zookeeper: Wait-free coordination for internet-scale systems. In USENIX annual technical conference, page 9.

Laprie, J.-C. (1985). Dependable computing and fault tolerance: Concepts and terminology. In 25th International Symposium on Fault-Tolerant Computing, 1995. IEEE.

Noll, M. (2011). Benchmarking and stress testing an hadoop cluster with terasort, testdfsio & co. Online: http://www. michael-noll. com/blog/2011/04/09/benchmarking-andstress- testing-an-hadoopcluster-with-terasort-testdfsio-nnbench-mrbench.

White, T. (2015). Hadoop: The Definitive Guide, 4th Edition. ”O’Reilly Media, Inc.”.

Young, J. W. (1974). A first order approximation to the optimum checkpoint interval. Communications of the ACM, 17(9):530–531.