Overcoming Limits in GPU Multi-Partitioning
Abstract
Multipartitioning is a general-purpose parallel primitive that reorganizes input data into contiguous bins (or buckets), where the function responsible for categorizing each element into a bin is defined by the programmer. Multipartitioning on GPUs, although considered as a parallel primitive in itself, has received little attention in the literature so far. State-of-the-art implementations focus on scenarios with a reduced number of bins (up to 256), also imposing the restriction that this number must be a power of 2. This work aims to present efficient implementations of Multipartitioning on GPUs, exploring different optimized strategies for varying numbers of bins, allowing the algorithm to select, at runtime, the most suitable level of granularity. The algorithm, named mrBin, is capable of processing large numbers of bins (up to 12,288), without imposing restrictions on this number. In other words, more bins, fewer restrictions. The experiments demonstrate that the algorithm surpasses the state of the art for bin counts greater than 64, achieving speedups of up to 2.2x.References
Alcantara, D. A., Sharf, A., Abbasinejad, F., Sengupta, S., Mitzenmacher, M., Owens, J. D., and Amenta, N. (2009). Real-time parallel hashing on the GPU. In ACM SIGGRAPH asia 2009 papers, pages 1–9.
Ashkiani, S., Davidson, A., Meyer, U., and Owens, J. D. (2016). GPU multisplit. SIGPLAN Not., 51(8).
Ashkiani, S., Davidson, A., Meyer, U., and Owens, J. D. (2017). GPU Multisplit: an extended study of a parallel algorithm. ACM Transactions on Parallel Computing.
Choi, B., Komuravelli, R., Lu, V., Sung, H., Bocchino Jr, R. L., Adve, S. V., and Hart, J. C. (2010). Parallel SAH kD-tree construction. In High performance graphics, pages 77–86. Citeseer.
Cordeiro, M. B., Blanco, R. M., and Nunan Zola, W. M. (2025a). Algoritmo paralelo e distribuído para ordenação chave-valor. In Anais da XX Escola Regional de Banco de Dados, pages 133–136, Porto Alegre, RS, Brasil. SBC.
Cordeiro, M. B., Bluhm, G., and Nunan Zola, W. M. (2025b). Multiparticionamento flexível em processadores multicore. In Anais da XXV Escola Regional de Alto Desempenho da Região Sul, pages 169–170, Porto Alegre, RS, Brasil. SBC.
Cordeiro, M. B. and Nunan Zola, W. M. (2023). KNN paralelo em gpu para grandes volumes de dados com agregação de consultas. In Simpósio em Sistemas Computacionais de Alto Desempenho (SSCAD), pages 253–264. SBC.
Cordeiro, M. B. and Nunan Zola, W. M. (2024). Parallel k-means on GPU using warp-centric strategies. In 2024 IEEE 30th International Conference on Parallel and Distributed Systems (ICPADS), pages 720–727.
Cordeiro, M. B. and Nunan Zola, W. M. (2025). Multiparticionamento de dados em gpu. In Anais da XXV Escola Regional de Alto Desempenho da Região Sul, pages 165–166, Foz do Iguaçu, PR, Brasil. SBC.
Ferraz, S., Dias, V., Teixeira, C. H., Parthasarathy, S., Teodoro, G., and Meira, W. (2024). DuMato: An efficient warp-centric subgraph enumeration system for GPU. Journal of Parallel and Distributed Computing, 191:104903.
Gupta, K., Stuart, J. A., and Owens, J. D. (2012). A study of persistent threads style GPU programming for GPGPU workloads. In IEEE 2012 Innovative Parallel Computing (InPar). IEEE.
Jünger, D., Kobus, R., Müller, A., Hundt, C., Xu, K., Liu, W., and Schmidt, B. (2020). WarpCore: A library for fast hash tables on GPUs. In 2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC).
Lessley, B. and Childs, H. (2020). Data-parallel hashing techniques for GPU architectures. IEEE Transactions on Parallel and Distributed Systems, 31(1):237–250.
Merrill, D. G. and Grimshaw, A. S. (2010). Revisiting sorting for GPGPU stream architectures. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pages 545–546.
Meyer, B., Pozo, A., and Nunan Zola, W. M. (2021). Warp-centric k-nearest neighbor graphs construction on GPU. In 50th International Conference on Parallel Processing Workshop, ICPP Workshops ’21, New York, NY, USA. Association for Computing Machinery.
Nunan Zola, W. M. and De Bona, L. C. E. (2012). Parallel speculative encryption of multiple AES contexts on GPUs. In 2012 Innovative Parallel Computing (InPar), pages 1–9. IEEE.
Reinders, J. (2007). Intel threading building blocks: outfitting C++ for multi-core processor parallelism. ”O’Reilly Media, Inc.”.
Stehle, E. and Jacobsen, H.-A. (2017). A memory bandwidth-efficient hybrid radix sort on GPUs. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, New York, NY, USA. Association for Computing Machinery.
Ashkiani, S., Davidson, A., Meyer, U., and Owens, J. D. (2016). GPU multisplit. SIGPLAN Not., 51(8).
Ashkiani, S., Davidson, A., Meyer, U., and Owens, J. D. (2017). GPU Multisplit: an extended study of a parallel algorithm. ACM Transactions on Parallel Computing.
Choi, B., Komuravelli, R., Lu, V., Sung, H., Bocchino Jr, R. L., Adve, S. V., and Hart, J. C. (2010). Parallel SAH kD-tree construction. In High performance graphics, pages 77–86. Citeseer.
Cordeiro, M. B., Blanco, R. M., and Nunan Zola, W. M. (2025a). Algoritmo paralelo e distribuído para ordenação chave-valor. In Anais da XX Escola Regional de Banco de Dados, pages 133–136, Porto Alegre, RS, Brasil. SBC.
Cordeiro, M. B., Bluhm, G., and Nunan Zola, W. M. (2025b). Multiparticionamento flexível em processadores multicore. In Anais da XXV Escola Regional de Alto Desempenho da Região Sul, pages 169–170, Porto Alegre, RS, Brasil. SBC.
Cordeiro, M. B. and Nunan Zola, W. M. (2023). KNN paralelo em gpu para grandes volumes de dados com agregação de consultas. In Simpósio em Sistemas Computacionais de Alto Desempenho (SSCAD), pages 253–264. SBC.
Cordeiro, M. B. and Nunan Zola, W. M. (2024). Parallel k-means on GPU using warp-centric strategies. In 2024 IEEE 30th International Conference on Parallel and Distributed Systems (ICPADS), pages 720–727.
Cordeiro, M. B. and Nunan Zola, W. M. (2025). Multiparticionamento de dados em gpu. In Anais da XXV Escola Regional de Alto Desempenho da Região Sul, pages 165–166, Foz do Iguaçu, PR, Brasil. SBC.
Ferraz, S., Dias, V., Teixeira, C. H., Parthasarathy, S., Teodoro, G., and Meira, W. (2024). DuMato: An efficient warp-centric subgraph enumeration system for GPU. Journal of Parallel and Distributed Computing, 191:104903.
Gupta, K., Stuart, J. A., and Owens, J. D. (2012). A study of persistent threads style GPU programming for GPGPU workloads. In IEEE 2012 Innovative Parallel Computing (InPar). IEEE.
Jünger, D., Kobus, R., Müller, A., Hundt, C., Xu, K., Liu, W., and Schmidt, B. (2020). WarpCore: A library for fast hash tables on GPUs. In 2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC).
Lessley, B. and Childs, H. (2020). Data-parallel hashing techniques for GPU architectures. IEEE Transactions on Parallel and Distributed Systems, 31(1):237–250.
Merrill, D. G. and Grimshaw, A. S. (2010). Revisiting sorting for GPGPU stream architectures. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, pages 545–546.
Meyer, B., Pozo, A., and Nunan Zola, W. M. (2021). Warp-centric k-nearest neighbor graphs construction on GPU. In 50th International Conference on Parallel Processing Workshop, ICPP Workshops ’21, New York, NY, USA. Association for Computing Machinery.
Nunan Zola, W. M. and De Bona, L. C. E. (2012). Parallel speculative encryption of multiple AES contexts on GPUs. In 2012 Innovative Parallel Computing (InPar), pages 1–9. IEEE.
Reinders, J. (2007). Intel threading building blocks: outfitting C++ for multi-core processor parallelism. ”O’Reilly Media, Inc.”.
Stehle, E. and Jacobsen, H.-A. (2017). A memory bandwidth-efficient hybrid radix sort on GPUs. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, New York, NY, USA. Association for Computing Machinery.
Published
2025-10-28
How to Cite
CORDEIRO, Michel B.; ZOLA, Wagner M. Nunan.
Overcoming Limits in GPU Multi-Partitioning. In: BRAZILIAN SYMPOSIUM ON HIGH PERFORMANCE COMPUTING SYSTEMS (SSCAD), 26. , 2025, Bonito/MS.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 494-505.
DOI: https://doi.org/10.5753/sscad.2025.16780.
