Improving an MPI Application-Level Migration Approach through Checkpoint File Splitting

  • Mónica Rodríguez University of A Coruna
  • Iván Cores University of A Coruna
  • Patricia González University of A Coruna
  • María J. Martín University of A Coruna

Resumo


Traditionally used for load balancing, process migration has been gaining popularity in the fault tolerance context. Recently, checkpoint-based migration has been proposed to implement failure avoidance in MPI applications through the proactive migration of processes when impending failures are notified. However, the main drawback of checkpoint-based migration in these scenarios is its high I/0 cost, which may be unfeasible if the migration operation is not completed before the failure arises. To overcome this issue, this work proposes to split the checkpoint files of an application-level migration approach into multiple smaller files to overlap the different phase of the migration operation: checkpoint file writing in the terminating process, with data transferring through the network, and state file read and restart operations in the new spawned processes. The proposal has been tested using the MPI NAS Parallel Benchmarks. The experimental results show a significant reduction in the migration time.
Palavras-chave: Proposals, Writing, Checkpointing, Computer architecture, Fault tolerance, Fault tolerant systems, Benchmark testing
Publicado
22/10/2014
RODRÍGUEZ, Mónica; CORES, Iván; GONZÁLEZ, Patricia; MARTÍN, María J.. Improving an MPI Application-Level Migration Approach through Checkpoint File Splitting. In: INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD), 26. , 2014, Paris/FR. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2014 . p. 33-40.