HPC@Cloud: A Provider-Agnostic Software Framework for Enabling HPC in Public Cloud Platforms
The cloud computing paradigm democratized compute infrastructure access to millions of resource-strained organizations, applying economics of scale to massively reduce infrastructure costs. In the High Performance Computing (HPC) context, the benefits of using public cloud resources make it an attractive alternative to expensive on-premises clusters, however there are several challenges and limitations. In this paper, we present HPC@Cloud: a provideragnostic software framework that comprises a set of key software tools to assist in the migration, test and execution of HPC applications in public clouds. HPC@Cloud allows the HPC community to benefit from readily available public cloud resources with minimum efforts and features an empirical approach for estimating cloud infrastructure costs for HPC workloads. We also provide an experimental analysis of HPC@Cloud on two public clouds: Amazon AWS and Vultr Cloud.
Buyya, R., Srirama, S. N., Casale, G., Calheiros, R., Simmhan, Y., Varghese, B., Gelenbe, E., Javadi, B., Vaquero, L. M., Netto, M. A. S., Toosi, A. N., Rodriguez, M. A., Llorente, I. M., Vimercati, S. D. C. D., Samarati, P., Milojicic, D., Varela, C., Bahsoon, R., Assuncao, M. D. D., Rana, O., Zhou, W., Jin, H., Gentzsch, W., Zomaya, A. Y., and Shen, H. (2019). A Manifesto for Future Generation Cloud Computing: Research Directions for the Next Decade. ACM Computing Surveys, 51(5).
Gartner (2022). Available at: [link].
Gong, Y., He, B., and Zhou, A. C. (2015). Monetary cost optimizations for MPI-based HPC applications on Amazon clouds: checkpoints and replicated execution. In SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1-12.
Hargrove, P. and Duell, J. (2006). Berkeley lab checkpoint/restart (BLCR) for Linux clusters. Journal of Physics: Conference Series, 46:494.
Kafle, J., Bagale, L. P., and K. C., D. J. (2020). Numerical Solution of Parabolic Partial Differential Equation by Using Finite Difference Method. Journal of Nepal Physical Society, 6(2):57-65.
Li, Z. E., Zhang, H., O'Brien, L., Jiang, S., Zhou, Y., Kihl, M., and Ranjan, R. (2015). Spot Pricing in the Cloud Ecosystem: A Comparative Investigation. Journal of Systems and Software, 114.
Mell, P. M. and Grance, T. (2011). SP 800-145. The NIST Definition of Cloud Computing. Technical report, National Institute of Standards & Technology, Gaithersburg, MD, USA.
Netto, M., Calheiros, R., Rodrigues, E., Cunha, R., and Buyya, R. (2018). HPC Cloud for Scientific and Business Applications: Taxonomy, Vision, and Research Challenges. ACM Computing Surveys, 51.
Peña-Monferrer, C., Manson-Sawko, R., and Elisseev, V. (2021). HPC-cloud native framework for concurrent simulation, analysis and visualization of CFD workflows. Future Generation Computer Systems, 123:14-23.
Qu, C., Calheiros, R. N., and Buyya, R. (2016). A Reliable and Cost-Efficient Auto-Scaling System for Web Applications Using Heterogeneous Spot Instances. J. Netw. Comput. Appl., 65(C):167-180.
Rahman, A., Mahdavi-Hezaveh, R., and Williams, L. (2019). A systematic mapping study of infrastructure as code research. Information and Software Technology, 108:65-77.
Santos, M. A. d. and Cavalheiro, G. G. H. (2020). Cloud infrastructure for HPC investment analysis. Revista de Informática Teórica e Aplicada, 27(4):45-62.
Somasundaram, T. S. and Govindarajan, K. (2014). CLOUDRB: A framework for scheduling and managing High-Performance Computing (HPC) applications in science cloud. Future Generation Computer Systems, 34:47-65. Special Section: Distributed Solutions for Ubiquitous Computing and Ambient Intelligence.
Teylo, L., Arantes, L., Sens, P., and Drummond, L. M. d. A. (2021). Scheduling Bag-of-Tasks in Clouds using Spot and Burstable Virtual Machines. IEEE Transactions on Cloud Computing, pages 1-1.
Voorsluys, W. and Buyya, R. (2012). Reliable Provisioning of Spot Instances for Compute-intensive Applications. In 2012 IEEE 26th International Conference on Advanced Information Networking and Applications, pages 542-549.
Wong, A. K. and Goscinski, A. M. (2013). A unified framework for the deployment, exposure and access of HPC applications as services in clouds. Future Generation Computer Systems, 29(6):1333-1344. Including Special sections: High Performance Computing in the Cloud & Resource Discovery Mechanisms for P2P Systems.
Yi, S., Kondo, D., and Andrzejak, A. (2010). Reducing Costs of Spot Instances via Checkpointing in the Amazon Elastic Compute Cloud. In 2010 IEEE 3rd International Conference on Cloud Computing, pages 236-243.