Can GPUs help scaling traditional Apache Spark workloads?

Moisés Felipe Lehnen; Lucas Mello Schnorr; Philippe Olivier Alexandre Navaux

Moisés Felipe Lehnen UFRGS
Lucas Mello Schnorr UFRGS
Philippe Olivier Alexandre Navaux UFRGS

Resumo

Apache Spark is an open-source framework for in-memory distributed computing that has proven very efficient for horizontally scaling data and Machine Learning pipelines. Spark outperforms Hadoop MapReduce by relying on the host memory to run the workload computation rather than on storage, as Hadoop does. However, distributed systems generally have some drawbacks. A common drawback of Apache Spark is the Shuffle operation, which happens when we apply wide transformations to a DataFrame and is a source of numerous performance issues. In this work, we dive into the Apache Spark Shuffle operation. We 1/explain the factors that impact the Shuffle execution; 2/ explore different settings to verify the impact on the latency and size of the shuffle data; 3/ evaluate the power of GPUs to help traditional ETL workloads, and in our case, specifically to optimize the shuffle operation in Spark. In our benchmarks, the RAPIDS Shuffle Manager Plugin leveraged the Spark shuffle operation by reducing the shuffle fetch time by 40% for string and 83% for integer data types.

Palavras-chave: Scalability, High performance computing, Conferences, Pipelines, Graphics processing units, Distributed databases, Cluster computing, Machine learning, Computer architecture, Sparks, Apache Spark, GPU, Workflow scalability, Shuffle