Implementing a Distributed Execution Service for a Grid Broker
Resumo
Grid middleware such as OurGrid offer solutions for executing parallel tasks on a grid system. In such systems, users submit their applications for executions through a client broker. MyGrid is the client broker used for the OurGrid system; it is in charge of managing task executions that a user has submitted. Although the broker is able to detect task failures and reschedule them, MyGrid itself constitutes a single point of failure from the user perspective. If it fails, all knowledge of task executions is lost. Moreover, MyGrid is also a bottleneck, since hundreds, or even thousands, of executions could potentially be spawned by an application and need to be managed at the same time by a single broker. In this paper we present the design and implementation of a fault-tolerant distributed execution service that allows for load balancing and improves MyGrid performance. A checkpointing mechanism is used to ease the implementation of the service and to further increase system reliability.Referências
Brito A. and Brasileiro F. (2004) “Programando um Subsistema Síncrono para Suporte a Mecanismos Eficientes de Tolerância a Falhas”. Workshop de Tolerância a Falhas / Simpósio Brasileiro de Redes de Computadores
Cirne W., Paranhos D., Costa L., Santos-Neto E., Brasileiro F., Sauvé J., Silva F. A. B., Barros C. O. and Silveira C. (2003) “Running Bag-of-Tasks Applications on Computational Grids: The MyGrid Approach”, Proceedings of the ICCP'2003 - International Conference on Parallel Processing
Cirne W., Brasileiro F., Andrade. N. A., Costa L., Andrade A., Novaes R. and Mowbray M. (2006) “Labs of the World, Unite!!!”, Journal of Grid Computing
Foster I., Kesselman C., Tuecke S. (2001) “The Anatomy of the Grid: Enabling Scalable Virtual Organizations.” International J. Supercomputer Applications
Foster I., Iamnitchi A. (2003) “On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing” Lecture Notes in computer science, Springer
Goux J., Linderoth J., and Yoder M. (1999) “Metacomputing and the Master-Worker Paradigm”
Goux J., Kulkani S., Linderoth J., and Yoder M. (2000) “An enabling framework for master-worker applications on the computational grid.” Submitted to HPDC 2000 Conference Proceedings
Larrea M., Fernández A. and Arévalo S. (2001) “On the Impossibility of Implementing Perpetual Failure Detectors in Partially Synchronous Systems.” Brief Announcements 15 th Int ́l Symp. Distributed Computing
Lawall J. L. and Muller G. (1999) “Efficient Incremental Checkpointing of Java Programs.” Proceedings of the International Conference on Dependable Systems and Networks
OurGrid Team. (2006) “OurGrid Website and Documentation”. http://www.ourgrid.org
Silva H. and Chiao C. M. (2004) “Obtenção de Tolerância a Falhas na Ferramenta de Computação MyGrid”. Escola Regional de Redes de Computadores
Silva F. A., Jansch-Pôrto I. and Lisboa M. L. (2002) “Recuperação com base em checkpointing: Uma abordagem orientada a objetos”. Workshop de Tolerância a Falhas / SBRC
Cirne W., Paranhos D., Costa L., Santos-Neto E., Brasileiro F., Sauvé J., Silva F. A. B., Barros C. O. and Silveira C. (2003) “Running Bag-of-Tasks Applications on Computational Grids: The MyGrid Approach”, Proceedings of the ICCP'2003 - International Conference on Parallel Processing
Cirne W., Brasileiro F., Andrade. N. A., Costa L., Andrade A., Novaes R. and Mowbray M. (2006) “Labs of the World, Unite!!!”, Journal of Grid Computing
Foster I., Kesselman C., Tuecke S. (2001) “The Anatomy of the Grid: Enabling Scalable Virtual Organizations.” International J. Supercomputer Applications
Foster I., Iamnitchi A. (2003) “On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing” Lecture Notes in computer science, Springer
Goux J., Linderoth J., and Yoder M. (1999) “Metacomputing and the Master-Worker Paradigm”
Goux J., Kulkani S., Linderoth J., and Yoder M. (2000) “An enabling framework for master-worker applications on the computational grid.” Submitted to HPDC 2000 Conference Proceedings
Larrea M., Fernández A. and Arévalo S. (2001) “On the Impossibility of Implementing Perpetual Failure Detectors in Partially Synchronous Systems.” Brief Announcements 15 th Int ́l Symp. Distributed Computing
Lawall J. L. and Muller G. (1999) “Efficient Incremental Checkpointing of Java Programs.” Proceedings of the International Conference on Dependable Systems and Networks
OurGrid Team. (2006) “OurGrid Website and Documentation”. http://www.ourgrid.org
Silva H. and Chiao C. M. (2004) “Obtenção de Tolerância a Falhas na Ferramenta de Computação MyGrid”. Escola Regional de Redes de Computadores
Silva F. A., Jansch-Pôrto I. and Lisboa M. L. (2002) “Recuperação com base em checkpointing: Uma abordagem orientada a objetos”. Workshop de Tolerância a Falhas / SBRC
Publicado
29/05/2006
Como Citar
FIGUEIREDO, Flavio V. D. de; BRASILEIRO, Francisco V.; BRITO, Andrey E. M..
Implementing a Distributed Execution Service for a Grid Broker. In: WORKSHOP DE TESTES E TOLERÂNCIA A FALHAS (WTF), 7. , 2006, Curitiba/PR.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2006
.
p. 99-110.
ISSN 2595-2684.
DOI: https://doi.org/10.5753/wtf.2006.23355.