A Fault Tolerant Scheduling Model for Directed Acyclic Graphs in Cloud
Abstract
Many High Performance Computing (HPC) and resource intensive applications have been tested and migrated to the Cloud. These applications may have high data input size, which often has a high correlation to execution performance and time. Migration to the Cloud demands adaptation of the fault tolerance (FT) and scheduling approaches. Although those topics are well connected, they are often treated separately. This work proposes a novel integrated scheduling and FT model which takes into account the characteristics of the tasks and the target execution nodes. Preliminary results indicate good potential to improve system reliability and execution makespan of scientific workflows.
Keywords:
Fault Tolerance, Scheduling, Directed Acyclic Graphs, Cloud Computing, High Performance Computing
References
Amoon, M. (2016). Adaptive framework for reliable cloud computing environment. IEEE Access, 4:9469–9478.
Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K., and Engelmann, C. (2012). Combining partial redundancy and checkpointing for HPC. In 32nd Intl. Conference on Distributed Computing Systems, pages 615–626. IEEE.
Hasan, M. and Goraya, M. S. (2018). Fault tolerance in cloud computing environment: Asystematic survey. Computers in Industry, 99:156–172.
Hayyolalam, V. and Kazem, A. A. P. (2018). A systematic literature review on qos-aware service composition and selection in cloud environment. Journal of Network and Computer Applications, 110:52–74.
Jadeja, Y. and Modi, K. (2012). Cloud computing-concepts, architecture and challenges. In 2012 International Conference on Computing, Electronics and Electrical Technologies (ICCEET), pages 877–880. IEEE.
Kathpal, C. and Garg, R. (2019). Survey on fault-tolerance-aware scheduling in cloudcomputing. In Information and Communication Technology for Competitive Strategies,pages 275–283. Springer.
Mell, P., Grance, T., et al. (2011). The NIST definition of cloud computing.
Tiwari, D., Gupta, S., and Vazhkudai, S. S. (2014). Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pages 25–36. IEEE.
Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K., and Engelmann, C. (2012). Combining partial redundancy and checkpointing for HPC. In 32nd Intl. Conference on Distributed Computing Systems, pages 615–626. IEEE.
Hasan, M. and Goraya, M. S. (2018). Fault tolerance in cloud computing environment: Asystematic survey. Computers in Industry, 99:156–172.
Hayyolalam, V. and Kazem, A. A. P. (2018). A systematic literature review on qos-aware service composition and selection in cloud environment. Journal of Network and Computer Applications, 110:52–74.
Jadeja, Y. and Modi, K. (2012). Cloud computing-concepts, architecture and challenges. In 2012 International Conference on Computing, Electronics and Electrical Technologies (ICCEET), pages 877–880. IEEE.
Kathpal, C. and Garg, R. (2019). Survey on fault-tolerance-aware scheduling in cloudcomputing. In Information and Communication Technology for Competitive Strategies,pages 275–283. Springer.
Mell, P., Grance, T., et al. (2011). The NIST definition of cloud computing.
Tiwari, D., Gupta, S., and Vazhkudai, S. S. (2014). Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pages 25–36. IEEE.
Published
2020-08-19
How to Cite
ROSSO, Pedro Henrique Di Francia; FRANCESQUINI, Emilio.
A Fault Tolerant Scheduling Model for Directed Acyclic Graphs in Cloud. In: REGIONAL SCHOOL OF HIGH PERFORMANCE COMPUTING FROM SÃO PAULO (ERAD-SP), 11. , 2020, Evento Online.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2020
.
p. 46-49.
DOI: https://doi.org/10.5753/eradsp.2020.16883.
