Tolerância a Falhas de Workﬂows Científicos Executados em Nuvens Usando Checkpoints

Leonardo de Jesus; Daniel de Oliveira; Lúcia Drummond

doi:10.5753/wscad.2016.14263

Leonardo de Jesus UFF
Daniel de Oliveira UFF
Lúcia Drummond UFF

DOI: https://doi.org/10.5753/wscad.2016.14263

Resumo

Workﬂows cientíﬁcos são modelos compostos por tarefas, dados e dependências cujo objetivo é representar experimentos cientíﬁcos baseados em simulações. Estes experimentos tem alta demanda por recursos computacionais uma vez que envolvem o processamento de um grande volume de dados por diversos softwares diferentes. Assim, a utilização de técnicas de Computação de Alto Desempenho na implementação de workﬂows cientíﬁcos fornece o apoio necessário à realização desses experimentos com maior qualidade de serviço. Para gerenciar todo este processo são necessários Sistemas de Gerência de Workﬂows Cientíﬁcos (SGWfC). Entretanto, uma vez que ambientes de Computação de Alto Desempenho envolvem um grande número de variados recursos trabalhando em paralelo, aumenta-se a probabilidade de ocorrência de falhas em algum destes. Portanto, os SGWfC precisam ser tolerantes a tais falhas. Este trabalho busca implementar técnicas de tolerância a falhas em tais sistemas de forma a aumentar a sua resiliência e, consequentemente, diminuir o tempo total de execução e os custos envolvidos.

Referências

Amazon (2016). Amazon aws. Em http://docs.aws.amazon.com/cli/latest/reference/ec2/describe-instance-status.html. Accessed: 2016-08-01.

Baude, F., Caromel, D., Delbé, C., e Henrio, L. (2005). A hybrid message loggingcic protocol for constrained checkpointability. Em European Conference on Parallel Processing, páginas 644–653. Springer.

CRIU (2016). Criu. Em https://criu.org/Main_Page. Accessed: 2016-05-24.

De Oliveira, D., Ogasawara, E., Baião, F., e Mattoso, M. (2010). Scicumulus: A lightweight cloud middleware to explore many task computing paradigm in scientic workows. Em Cloud Computing (CLOUD), 2010 IEEE 3rd International Conference on, páginas 378–385. IEEE.

Di, S., Robert, Y., Vivien, F., Kondo, D., Wang, C.-L., e Cappello, F. (2013). Optimization of cloud task processing with checkpoint-restart mechanism. Em High Performance Computing, Networking, Storage and Analysis (SC), 2013 International Conference for, páginas 1–12. IEEE.

Elmroth, E., Hernández, F., e Tordsson, J. (2007). A light-weight grid workow execution engine enabling client and middleware independence. Em International Conference on Parallel Processing and Applied Mathematics, páginas 754–761. Springer.

Fabra, J. (2013). Using cloud-based resources to improve availability and reliability in a scientic workow execution framework.

Fahringer, T., Prodan, R., Duan, R., Nerieri, F., Podlipnig, S., Qin, J., Siddiqui, M., Truong, H.-L., Villazon, A., e Wieczorek, M. (2005). Askalon: A grid application development and computing environment. Em Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing, páginas 122–131. IEEE Computer Society.

Hargrove, P. H. e Duell, J. C. (2006). Berkeley lab checkpoint/restart (blcr) for linux clusters. Em Journal of Physics: Conference Series, volume 46, page 494. IOP Publishing.

Hoffa, C., Mehta, G., Freeman, T., Deelman, E., Keahey, K., Berriman, B., e Good, J. (2008). On the use of cloud computing for scientic workows. Em eScience, 2008. eScience'08. IEEE Fourth International Conference on, páginas 640–645. IEEE.

Hoheisel, A. (2006). Grid workow execution service-dynamic and interactive execution and visualization of distributed workows. Em Proceedings of the Cracow Grid Workshop, volume 2, páginas 13–24. Citeseer.

Jackson, K. R., Ramakrishnan, L., Runge, K. J., e Thomas, R. C. (2010). Seeking supernovae in the clouds: A performance study. Em Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10, páginas 421–429, New York, NY, USA. ACM.

ModelGenerator (2016). Modelgenerator. Em http://mcinerneylab.com/software/modelgenerator/. Accessed: 2016-05-24.

Oca˜na, K. A., de Oliveira, D., Ogasawara, E., Dávila, A. M., Lima, A. A., e Mattoso, M. (2011). Sciphy: a cloud-based workow for phylogenetic analysis of drug targets in protozoan genomes. Em Brazilian Symposium on Bioinformatics, páginas 66–70. Springer.

SciCumulus (2016). Scicumulus. Em https://scicumulusc2.wordpress.com. Accessed: 2016-07-24.

Taylor, I., Shields, M., Wang, I., e Harrison, A. (2007). The triana workow environment: Architecture and applications. Em Workows for e-Science, páginas 320–339. Springer.

Taylor, I. J., Deelman, E., Gannon, D. B., e Shields, M. (2014). Workows for e-Science: scientic workows for grids. Springer Publishing Company, Incorporated.

von Laszewski, G. e Hategan, M. (2005). Java cog kit karajan/gridant workow guide. Technical report, Technical Report, Argonne National Laboratory, Argonne, IL, USA.

Young, J. W. (1974). A rst order approximation to the optimum checkpoint interval. Communications of the ACM, 17(9):530–531.

Yu, J. e Buyya, R. (2005). A taxonomy of scientic workow systems for grid computing. ACM Sigmod Record, 34(3):44–49.

Zhang, Y., Mandal, A., Koelbel, C., e Cooper, K. (2009). Combined fault tolerance and scheduling techniques for workow applications on computational grids. Em Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, páginas 244–251. IEEE Computer Society.