Can GPUs help scaling traditional Apache Spark workloads?

  • Moisés Felipe Lehnen UFRGS
  • Lucas Mello Schnorr UFRGS
  • Philippe Olivier Alexandre Navaux UFRGS

Resumo


Apache Spark is an open-source framework for in-memory distributed computing that has proven very efficient for horizontally scaling data and Machine Learning pipelines. Spark outperforms Hadoop MapReduce by relying on the host memory to run the workload computation rather than on storage, as Hadoop does. However, distributed systems generally have some drawbacks. A common drawback of Apache Spark is the Shuffle operation, which happens when we apply wide transformations to a DataFrame and is a source of numerous performance issues. In this work, we dive into the Apache Spark Shuffle operation. We 1/explain the factors that impact the Shuffle execution; 2/ explore different settings to verify the impact on the latency and size of the shuffle data; 3/ evaluate the power of GPUs to help traditional ETL workloads, and in our case, specifically to optimize the shuffle operation in Spark. In our benchmarks, the RAPIDS Shuffle Manager Plugin leveraged the Spark shuffle operation by reducing the shuffle fetch time by 40% for string and 83% for integer data types.
Palavras-chave: Scalability, High performance computing, Conferences, Pipelines, Graphics processing units, Distributed databases, Cluster computing, Machine learning, Computer architecture, Sparks, Apache Spark, GPU, Workflow scalability, Shuffle
Publicado
28/10/2025
LEHNEN, Moisés Felipe; SCHNORR, Lucas Mello; NAVAUX, Philippe Olivier Alexandre. Can GPUs help scaling traditional Apache Spark workloads?. In: WORKSHOP ON CLOUD COMPUTING (WCC) - INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD), 37. , 2025, Bonito/MS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 69-76.