Strategies for Fault-Tolerant Tightly-Coupled HPC Workloads Running on Low-Budget Spot Cloud Infrastructures

Vanderley Munhoz; Márcio Castro; Odorico Mendizabal

Vanderley Munhoz UFSC
Márcio Castro UFSC
Odorico Mendizabal UFSC

Resumo

Cloud providers can rent their spare computing capacity at substantial discounts, reclaiming it whenever there is a more profitable higher-priority request - a business model well known as spot infrastructure market. Users can attain significant cloud investment savings using spot machines, however with the caveat of increasing software complexity, given the fault tolerance requirements of this environment. Improvements in virtualization and network technology, combined with the development of key new software tools, may allow the HPC community to effectively take advantage of cheap cloud resources, cutting expensive maintenance costs. This study aims to evaluate the viability of budget-constrained cloud environments for tightly-coupled MPI applications, exploring both spot and traditional low-budget infrastructures from real public cloud platforms. We propose and evaluate two different fault tolerance strategies tailored for unreliable spot cloud environments: system-level rollback restart with Berkeley Labs Checkpoint/Restart (BLCR) and in-memory rollback restart with User-Level Failure Mitigation (ULFM). We also propose a provider-agnostic empirical method for testing and predicting MPI workloads execution times and cloud infrastructure costs. A detailed cost analysis and performance benchmark of a case-study application is provided, with data gathered from experiments with both spot and persistent machines from AWS and Vultr Cloud, respectively. Our results show that: (i) adequate cluster sizing plays an important role in the overall job execution performance and cost-effectiveness, regardless of the type of selected instances; (ii) fault tolerance strategies based on BLCR may have worse performance than ULFM, but still be costeffective considering software migration costs; (iii) the use of spot infrastructure does not guarantee costs savings depending on the chosen machine flavors and discounts, as experiments with persistent low-budget options attained better cost-effectiveness in some conditions.

Palavras-chave: High-Performance Computing, Cloud Computing, Spot Instances, Fault Tolerance, MPI, Virtual Clusters