Kernel Fusion of Parallel Skeletons for GPU using Metaprogramming

  • João Antonio Soares UFPel
  • André Rauber Du Bois UFPel
  • Gerson Geraldo H. Cavalheiro UFPel

Resumo


Graphics Processing Units (GPUs) are the current generation of accelerators for high-performance computing (HPC). However, developing efficient applications fully utilizing the resources of the GPU is a time-consuming task. Algorithmic Skeletons help to simplify development by providing reusable patterns, but it often results in many independent kernel launches that are bounded by memory bandwidth, which otherwise could be fused to increase performance gains. In this work, we developed a fusion module that enables programmers to write composable GPU kernels that are efficiently merged. Our approach is applied to PolyHok, a domain-specific language (DSL) that exposes metaprogramming features enabling kernel fusion as high-level code transformations. The practical benefits of this solution are that it doesn’t require modifying any existing compilers. We provide an open-source prototype implementation as well as an experimental evaluation of the feasibility of fusing skeletons.

Referências

Bird, R. S. (1989). Algebraic identities for program calculation. The Computer Journal, 32(2):122–126.

Cole, M. I. (1989). Algorithmic skeletons: structured management of parallel computation. Pitman London.

Du Bois, A. R. and Cavalheiro, G. (2026). Polymorphic higher-order gpu kernels. In Nagel, W. E., Goehringer, D., and Diniz, P. C., editors, Euro-Par 2025: Parallel Processing, pages 100–113, Cham. Springer Nature Switzerland.

Ernsting, S. and Kuchen, H. (2012). Algorithmic skeletons for multi-core, multi-gpu systems and clusters. International Journal of High Performance Computing and Networking, 7(2):129–138.

Filipovič, J., Madzin, M., Fousek, J., and Matyska, L. (2015). Optimizing cuda code by kernel fusion: application on blas. The Journal of Supercomputing, 71(10):3934–3957.

Fousek, J., Filipovič, J., and Madzin, M. (2011). Automatic fusions of cuda-gpu kernels for parallel map. SIGARCH Comput. Archit. News, 39(4):98–99.

Pérez, V., Sommer, L., Lomüller, V., Narasimhan, K., and Goli, M. (2023). User-driven online kernel fusion for sycl. ACM Trans. Archit. Code Optim., 20(2).

Wahib, M. and Maruyama, N. (2014). Scalable kernel fusion for memory-bound gpu applications. In SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 191–202.

Williams, S., Waterman, A., and Patterson, D. (2009). Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65–76.
Publicado
06/05/2026
SOARES, João Antonio; DU BOIS, André Rauber; CAVALHEIRO, Gerson Geraldo H.. Kernel Fusion of Parallel Skeletons for GPU using Metaprogramming. In: ESCOLA REGIONAL DE ALTO DESEMPENHO DA REGIÃO SUL (ERAD-RS), 26. , 2026, Bagé/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2026 . p. 165-168. ISSN 2595-4164. DOI: https://doi.org/10.5753/eradrs.2026.20445.