Hok: Higher-Order GPU kernels in Elixir

André Rauber Du Bois; Tiago Perlin; Frederico Peixoto Antunes; Gerson Cavalheiro

doi:10.5753/sblp.2024.3690

André Rauber Du Bois UFPel
Tiago Perlin UFPel
Frederico Peixoto Antunes UFPel
Gerson Cavalheiro UFPel

DOI: https://doi.org/10.5753/sblp.2024.3690

Resumo

GPUs (Graphics Processing Units) are usually programmed using low-level languages like CUDA or OpenCL. Although these languages allow the implementation of very optimized software, they are difficult to program due to their low-level nature, where programmers have to mix coordination code, i.e., how tasks are created and distributed, with the actual computation code. In this paper we present Hok, an extension to the Elixir functional language that allows the implementation of higher-order GPU kernels, granting programmers the ability to clearly separate coordination from computation. The Hok system provides a DSL (Domain-Specific Language) for writing low-level GPU kernels that can be parameterized with the computation code. Hok allows device functions, including anonymous functions, to be created and referenced in the host code so that they can configure a kernel before it is launched.We demonstrate that Hok can be used to implement high-level abstractions such as algorithmic skeletons and array comprehensions. We also present experiments that demonstrate the usability of the current implementation of Hok, and show that high speedups can be obtained in comparison to pure Elixir, specially in computationally intensive programs with large inputs.

Palavras-chave: parallel programming, gpu, actors model, Elixir

Referências

2024. The NIFs library. WWW page, [link].

2024. THe Nx library. WWW page, [link].

Joe Armstrong. 2003. Making reliable distributed systems in the presence of software errors. Ph.D. Dissertation. Royal Institute of Technology, Stockholm, Sweden.

Tim Besard, Christophe Foket, and Bjorn De Sutter. 2019. Effective Extensible Programming: Unleashing Julia on GPUs. IEEE Transactions on Parallel and Distributed Systems 30, 4 (2019), 827–841. DOI: 10.1109/TPDS.2018.2872064

Giuseppe Castagna, Guillaume Duboc, and José Valim. 2023. The Design Principles of the Elixir Type System. The Art, Science, and Engineering of Programming 8, 2 (Oct. 2023). DOI: 10.22152/ programming-journal.org/2024/8/4

Bryan Catanzaro, Michael Garland, and Kurt Keutzer. 2011. Copperhead: Compiling an Embedded Data Parallel Language. SIGPLAN Not. 46, 8 (feb 2011), 47–56. DOI: 10.1145/2038037.1941562

Manuel M.T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. 2011. Accelerating Haskell Array Codes with Multicore GPUs. In Proceedings of DAMP 2011 (Austin, Texas, USA). ACM, New York, NY, USA, 3–14.

Dominik Charousset, Raphael Hiesgen, and Thomas C. Schmidt. 2014. CAF - the C++ Actor Framework for Scalable and Resource-Efficient Applications. In Proceedings of the 4th International Workshop on Programming Based on Actors Agents Decentralized Control (Portland, Oregon, USA) (AGERE! ’14). Association for Computing Machinery, New York, NY, USA, 15–28. DOI: 10.1145/2687357.2687363

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). 44–54. DOI: 10.1109/IISWC.2009.5306797

Murray I Cole. 1989. Algorithmic skeletons: structured management of parallel computation. Pitman London.

Andre Rauber Du Bois and Gerson Cavalheiro. 2023. GPotion: An embedded DSL for GPU programming in Elixir. In Proceedings of the XXVII Brazilian Symposium on Programming Languages (, Campo Grande, MS, Brazil,) (SBLP ’23). Association for Computing Machinery, New York, NY, USA, 1–8. DOI: 10.1145/3624309.3624314

Christophe Dubach, Perry Cheng, Rodric Rabbah, David F. Bacon, and Stephen J. Fink. 2012. Compiling a High-Level Language for GPUs: (Via Language Support for Architectures and Compilers). 47, 6 (2012). DOI: 10.1145/2345156.2254066

Johan Enmyren and Christoph W Kessler. 2010. SkePU: a multibackend skeleton programming library for multi-GPU systems. In Proceedings of the fourth international workshop on High-level parallel programming and applications. 5–14.

August Ernstsson, Johan Ahlqvist, Stavroula Zouzoula, and Christoph Kessler. 2021. SkePU 3: Portable High-Level Programming of Heterogeneous Systems and HPC Clusters. Int. J. Parallel Program. 49, 6 (dec 2021), 846–866. DOI: 10.1007/s10766-021-00704-3

August Ernstsson, Dalvan Griebler, and Christoph Kessler. 2023. Assessing Application Efficiency and Performance Portability in Single-Source Programming for Heterogeneous Parallel Systems. International Journal of Parallel Programming 51, 1 (2023), 61–82.

August Ernstsson, Lu Li, and Christoph Kessler. 2018. SkePU 2: Flexible and type-safe skeleton programming for heterogeneous parallel systems. International Journal of Parallel Programming 46 (2018), 62–80.

Tianyi David Han and Tarek S. Abdelrahman. 2011. hiCUDA: High-Level GPGPU Programming. IEEE Transactions on Parallel and Distributed Systems 22, 1 (2011), 78–90. DOI: 10.1109/TPDS.2010.62

Paul Harvey, Kristian Hentschel, and Joseph Sventek. 2015. Parallel Programming in Actor-Based Applications via OpenCL. In Proceedings of the 16th Annual Middleware Conference (Vancouver, BC, Canada) (Middleware ’15). ACM, New York, NY, USA, 162–172.

Troels Henriksen, Niels G. W. Serup, Martin Elsman, Fritz Henglein, and Cosmin E. Oancea. 2017. Futhark: Purely Functional GPUProgramming with Nested Parallelism and in-Place Array Updates. SIGPLAN Not. 52, 6 (jun 2017), 556–571. DOI: 10.1145/3140587.3062354

Carl Hewitt, Peter Bishop, and Richard Steiger. 1973. A Universal Modular ACTOR Formalism for Artificial Intelligence. In Proceedings of the 3rd International Joint Conference on Artificial Intelligence (Stanford, USA) (IJCAI’73). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 235–245.

Raphael Hiesgen, Dominik Charousset, and Thomas C. Schmidt. 2015. Manyfold Actors: Extending the C++ Actor Framework to Heterogeneous Many-Core Machines Using OpenCL. In Proceedings of the 5th International Workshop on Programming Based on Actors, Agents, and Decentralized Control (Pittsburgh, PA, USA) (AGERE! 2015). ACM, New York, NY, USA, 45–56.

Raphael Hiesgen, Dominik Charousset, and Thomas C. Schmidt. 2018. OpenCL Actors – Adding Data Parallelism to Actor-Based Programming with CAF. In LNCS. Springer International Publishing, 59–93.

Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, and Henri E. Bal. 2023. Optimization Techniques for GPU Programming. ACM Comput. Surv. 55, 11, Article 239 (mar 2023), 81 pages. DOI: 10.1145/3570638

Eric Holk, Milinda Pathirage, Arun Chauhan, Andrew Lumsdaine, and Nicholas D. Matsakis. 2013. GPU Programming in Rust: Implementing High-Level Abstractions in a Systems-Level Language. In 2013 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum. 315–324. DOI: 10.1109/IPDPSW.2013.173

John Högberg. 2020. A brief introduction to BEAM. WWW page, [link].

Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. 2015. Numba: A LLVM-Based Python JIT Compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC (Austin, Texas) (LLVM ’15). ACM, New York, NY, USA, Article 7, 6 pages.

Geoffrey Mainland and Greg Morrisett. 2010. Nikola: Embedding Compiled GPU Functions in Haskell. In Proceedings of the Third ACM Haskell Symposium on Haskell (Baltimore, Maryland, USA) (Haskell ’10). ACM, New York, NY, USA, 67–78.

Trevor L. McDonell, Manuel M.T. Chakravarty, Gabriele Keller, and Ben Lippmeier. 2013. Optimising Purely Functional GPU Programs. SIGPLAN Not. 48, 9 (sep 2013), 49–60. DOI: 10.1145/2544174.2500595

Trevor L. McDonell, Manuel M. T. Chakravarty, Vinod Grover, and Ryan R. Newton. 2015. Type-Safe Runtime Code Generation: Accelerate to LLVM. SIGPLAN Not. 50, 12 (aug 2015), 201–212. DOI: 10.1145/2887747.2804313

Richard Membarth, Oliver Reiche, Frank Hannig, Jürgen Teich, Mario Körner, and Wieland Eckert. 2016. HIPAcc: A Domain-Specific Language and Compiler for Image Processing. IEEE Transactions on Parallel and Distributed Systems 27, 1 (2016), 210–224. DOI: 10.1109/TPDS.2015.2394802

Thomas Nelson. 2022. Introducing Microsoft Orleans. In Introducing Microsoft Orleans: Implementing Cloud-Native Services with a Virtual Actor Framework. Springer, 17–27.

ROYUD Nishino and Shohei Hido Crissman Loomis. 2017. Cupy: A numpy-compatible library for nvidia gpu calculations. 31st confernce on neural information processing systems 151, 7 (2017).

Tomas Öhberg, August Ernstsson, and Christoph Kessler. 2020. Hybrid CPU–GPU execution support in the skeleton programming framework SkePU. The Journal of Supercomputing 76, 7 (2020), 5038–5056.

Dinei A. Rockenbach, Júnior Löff, Gabriell Araujo, Dalvan Griebler, and Luiz Gustavo Fernandes. 2022. High-Level Stream and Data Parallelism in C++ for GPUs. In Proceedings of the XXVI Brazilian Symposium on Programming Languages (Virtual Event, Brazil) (SBLP ’22). ACM, 41–49.

Alex Rubinsteyn, Eric Hielscher, Nathaniel Weinman, and Dennis Shasha. 2012. Parakeet: A Just-in-Time Parallel Accelerator for Python. In Proceedings of the 4th USENIX Conference on Hot Topics in Parallelism (Berkeley, CA) (HotPar’12). USENIX Association, USA, 14.

Jason Sanders and Edward Kandrot. 2010. CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional.

Satish Narayana Srirama, Freddy Marcelo Surriabre Dick, and Mainak Adhikari. 2021. Akka framework based on the Actor model for executing distributed Fog Computing applications. Future Generation Computer Systems 117 (2021), 439–452. DOI: 10.1016/j.future.2020.12.011

Satish Narayana Srirama and Deepika Vemuri. 2023. CANTO: An actor model-based distributed fog framework supporting neural networks training in IoT applications. Computer Communications 199 (2023), 1–9.

Andrew Stromme, Ryan Carlson, and Tia Newhall. 2012. Chestnut: A GPU Programming Language for Non-Experts. In Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores (New Orleans, Louisiana) (PMAM ’12). Association for Computing Machinery, New York, NY, USA, 156–167. DOI: 10.1145/2141702.2141720

Joel Svensson, Koen Claessen, and Mary Sheeran. 2010. GPGPU kernel implementation and refinement using Obsidian. Procedia Computer Science 1, 1 (2010), 2065–2074. DOI: 10.1016/j.procs.2010.04.231 ICCS 2010.

Ruomeng (Cocoa) Xu, Anna Lito Michala, and Phil Trinder. 2022. CAEFL: Composable and Environment Aware Federated Learning Models. In Proceedings of the 21st ACM SIGPLAN International Workshop on Erlang (Ljubljana, Slovenia) (Erlang 2022). ACM, New York, NY, USA, 9–20.

Yonghong Yan, Max Grossman, and Vivek Sarkar. 2009. JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA. In Euro-Par 2009 Parallel Processing, Henk Sips, Dick Epema, and Hai-Xiang Lin (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 887–899.