Cloud-based parallel computing across multiple clusters in Julia

Resumo


IaaS cloud providers are now a de facto alternative to HPC. They offer a rich catalog of virtual machine instances with high-end processors and accelerators connected through advanced network technology. This makes it possible to create cluster computing platforms rivaling the on-premises alternatives. This paper presents an Infrastructure as Code (IaC) approach to build parallel computing systems in Julia, both hardware and software elements, benefiting users of HPC applications in dynamic programming languages.

Referências

C. A. T. Aguni, L. M. Sato, and E. T. Midorikawa. 2024. MCMPI: A library with elasticity for multi-domain and public cloud environments. Concurrency and Computation: Practice and Experience (2024).

S. Akioka and Y. Muraoka. 2010. HPC Benchmarks on Amazon EC2. In 24th IEEE International Conference on Advanced Information Networking and Applications Workshops. 1029–1034.

S. R. Alam, M. Gila, M. Klein, M. Martinasso, and T. C. Schulthess. 2023. Versatile software-defined HPC and cloud clusters on Alps supercomputer for diverse workflows. The International Journal of High Performance Computing Applications 37, 3-4 (2023), 288–305.

Amazon Web Services (AWS). 2024. AWS ParallelCluster - HPC for the Cloud. [link]

Ansible. 2024. Ansible Collaborative. [link]

The Kubernetes Authors. 2024. Kubernetes. [link]

Amazon Web Services (AWS). 2024. Amazon Elastic Cloud Computing (EC2). [link]

Amazon Web Services (AWS). 2024. AWS Cloud Formation. [link]

AmazonWeb Services (AWS). 2024. High Performance Coputing. [link]

Microsoft Azure. 2024. Azure high-performance computing. [link]

D. H. Bailey and et al. 1991. The NAS Parallel Benchmarks. International Journal of Supercomputing Applications 5, 3 (1991), 63–73.

T. Ben Nun and T. Hoefler. 2019. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis. ACM Computung Surveys 52, 4 (Aug. 2019), 65:1–65:43.

J. L. F. Betting, C. I. De Zeeuw, and C. Strydis. 2023. Oikonomos-II: A Reinforcement-Learning, Resource-Recommendation System for Cloud HPC. In 30th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC). 266–276.

J. Bezanson, J. Chen, B. Chung, S. Karpinski, V. B. Shah, J. Vitek, and L. Zoubritzky. 2018. Julia: Dynamism and Performance Reconciled by Design. Proceedings of ACM Programming Languages 2, OOPSLA, Article 120 (oct 2018), 23 pages.

J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah. 2017. Julia: A Fresh Approach to Numerical Computing. SIAM Review 59, 1 (2017), 65–98.

E. Borin, L. M. A. Drummond, Gaudiot J-L., A. Melo, M. M. Alves, and P. O. A. Navaux. 2023. High Performance Computing in Clouds: Moving HPC Applications to a Scalable and Cost-Effective Environment. Springer.

Terraform Community. 2024. Automate infrastructure on any cloud with Terraform. [link]

AWS.jl contributors. 2024. Julia Interface for AWS. [link]

Flux.jl contributors. 2024. Flux: The Julia Machine Learning Library. [link]

F. H. de Carvalho Junior, W. G. Al Alam, and A. B. de O. Dantas. 2021. Contextual Contracts for Component-Oriented Resource Abstraction in a Cloud of High Performance Computing Services. Concurrency and Computation: Practice and Experience 33, 18 (2021), e6225.

F. H. de Carvalho Junior, A. B. Dantas, J. M. Hoffiman, T. Carneiro, C. S. Sales, and P. A. S. Sales. 2023. Structured Platform-Aware Programming. In XXIV Simpósio em Sistemas Computacionais de Alto Desempenho (SSCAD’2023) (Porto Alegre, RS). SBC, Porto Alegre, Brazil, 301–312.

NASA Advanced Supercomputing (NAS) Division. 2024. NAS Parallel Benchmarks. [link]

J. Dongarra, I. Foster, G. Fox, W. Gropp, K. Kennedy, L. Torczon, and A. White. 2003. Sourcebook of Parallel Computing. Morgan Kauffman publishers.

H. Jin, D. Jespersen, P. Mehrotra, R. Biswas, L. Huang, and B. Chapman. 2011. High performance computing using MPI and OpenMP on multicore parallel systems. Parallel Computing 37, 9 (2011), 562–575.

H. Jin and R. F. Van der Wijngaart. 2006. Performance characteristics of the multi-zone NAS parallel benchmarks. Journal of Parallel and Distributed Computing 66, 5 (2006), 674–685. IPDPS’04 Special Issue.

P. Mehrotra, J. Djomehri, S.e Heistand, R. Hood, H. Jin, A. Lazanoff, S. Saini, and R. Biswas. 2012. Performance evaluation of Amazon EC2 for NASA HPC applications. In Proceedings of the 3rd Workshop on Scientific Cloud Computing (Delft, The Netherlands). ACM, New York, NY, USA, 41–50.

H. Meuer, E. Strohmaier, J. Dongarra, and H. D. Simon. 2013. Top 500 Supercomputer sites. [link]

Microsoft Azure. 2024. Eagle - Microsoft NDv5, Xeon Platinum 8480C 48C 2GHz, NVIDIA H100, NVIDIA Infiniband NDR. [link]

V. Munhoz and M. Castro. 2024. Enabling the execution of HPC applications on public clouds with HPC@Cloud toolkit. Concurrency and Computation: Practice and Experience 36, 8 (2024), e7976.

Riccard Murri. 2024. ElastiCluster. [link]

O. O. Napoli, R. K. Tesser, D. L. Fonseca, and E. Borin. 2023. Cost Effective Deep Learning on the Cloud. Springer, 283–307.

F. Z. Nardelli, J. Belyakova, A. Pelenitsyn, B. Chung, J. Bezanson, and J. Vitek. 2018. Julia Subtyping: A Rational Reconstruction. Proceedings of the ACM Programming Languages 2, Article 113 (oct 2018), 27 pages.

M. A. S. Netto, R. N. Calheiros, E. R. Rodrigues, R. L. F. Cunha, and R. Buyya. 2018. HPC Cloud for Scientific and Business Applications: Taxonomy, Vision, and Research Challenges. ACM Computing Surveys 51, 1 (Jan. 2018), 1–29.

A. Pelenitsyn, J. Belyakova, B. Chung, R. Tate, and J. Vitek. 2021. Type Stability in Julia: Avoiding Performance Pathologies in JIT Compilation. Proceedings of ACM Programmming Languages 5, OOPSLA, Article 150 (oct 2021), 26 pages.

Google Cloud Platform. 2024. High Performance Coputing. [link]

P. Vaillancourt, B. Wineholt, B. Barker, P. Deliyannis, J. Zheng, A. Suresh, A. Brazier, R. Knepper, and R. Wolski. 2020. Reproducible and Portable Workflows for Scientific Computing and HPC in the Cloud. In Practice and Experience in Advanced Research Computing (Portland, OR, USA) (PEARC’20). Association for Computing Machinery, New York, NY, USA, 311–320.
Publicado
30/09/2024
CARVALHO JUNIOR, Francisco H. de; ALENCAR, João Marcelo Uchôa de; SALES, Claro Henrique Silva. Cloud-based parallel computing across multiple clusters in Julia. In: SIMPÓSIO BRASILEIRO DE LINGUAGENS DE PROGRAMAÇÃO (SBLP), 28. , 2024, Curitiba/PR. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 44-52.