Implementing a Distributed Execution Service for a Grid Broker

Flavio V. D. de Figueiredo; Francisco V. Brasileiro; Andrey E. M. Brito

doi:10.5753/wtf.2006.23355

Flavio V. D. de Figueiredo UFCG
Francisco V. Brasileiro UFCG
Andrey E. M. Brito UFCG

DOI: https://doi.org/10.5753/wtf.2006.23355

Resumo

Grid middleware such as OurGrid offer solutions for executing parallel tasks on a grid system. In such systems, users submit their applications for executions through a client broker. MyGrid is the client broker used for the OurGrid system; it is in charge of managing task executions that a user has submitted. Although the broker is able to detect task failures and reschedule them, MyGrid itself constitutes a single point of failure from the user perspective. If it fails, all knowledge of task executions is lost. Moreover, MyGrid is also a bottleneck, since hundreds, or even thousands, of executions could potentially be spawned by an application and need to be managed at the same time by a single broker. In this paper we present the design and implementation of a fault-tolerant distributed execution service that allows for load balancing and improves MyGrid performance. A checkpointing mechanism is used to ease the implementation of the service and to further increase system reliability.

Referências

Brito A. and Brasileiro F. (2004) “Programando um Subsistema Síncrono para Suporte a Mecanismos Eficientes de Tolerância a Falhas”. Workshop de Tolerância a Falhas / Simpósio Brasileiro de Redes de Computadores

Cirne W., Paranhos D., Costa L., Santos-Neto E., Brasileiro F., Sauvé J., Silva F. A. B., Barros C. O. and Silveira C. (2003) “Running Bag-of-Tasks Applications on Computational Grids: The MyGrid Approach”, Proceedings of the ICCP'2003 - International Conference on Parallel Processing

Cirne W., Brasileiro F., Andrade. N. A., Costa L., Andrade A., Novaes R. and Mowbray M. (2006) “Labs of the World, Unite!!!”, Journal of Grid Computing

Foster I., Kesselman C., Tuecke S. (2001) “The Anatomy of the Grid: Enabling Scalable Virtual Organizations.” International J. Supercomputer Applications

Foster I., Iamnitchi A. (2003) “On Death, Taxes, and the Convergence of Peer-to-Peer and Grid Computing” Lecture Notes in computer science, Springer

Goux J., Linderoth J., and Yoder M. (1999) “Metacomputing and the Master-Worker Paradigm”

Goux J., Kulkani S., Linderoth J., and Yoder M. (2000) “An enabling framework for master-worker applications on the computational grid.” Submitted to HPDC 2000 Conference Proceedings

Larrea M., Fernández A. and Arévalo S. (2001) “On the Impossibility of Implementing Perpetual Failure Detectors in Partially Synchronous Systems.” Brief Announcements 15 th Int ́l Symp. Distributed Computing

Lawall J. L. and Muller G. (1999) “Efficient Incremental Checkpointing of Java Programs.” Proceedings of the International Conference on Dependable Systems and Networks

OurGrid Team. (2006) “OurGrid Website and Documentation”. http://www.ourgrid.org

Silva H. and Chiao C. M. (2004) “Obtenção de Tolerância a Falhas na Ferramenta de Computação MyGrid”. Escola Regional de Redes de Computadores

Silva F. A., Jansch-Pôrto I. and Lisboa M. L. (2002) “Recuperação com base em checkpointing: Uma abordagem orientada a objetos”. Workshop de Tolerância a Falhas / SBRC