DYAD: Locality-aware Data Management for accelerating Deep Learning Training

Hariharan Devarajan; Ian Lumsden; Chen Wang; Konstantia Georgouli; Tom Scogland; Jae-Seung Yeom; Michela Taufer

Hariharan Devarajan Lawrence Livermore National Laboratory
Ian Lumsden University of Tennessee
Chen Wang Lawrence Livermore National Laboratory
Konstantia Georgouli Lawrence Livermore National Laboratory
Tom Scogland Lawrence Livermore National Laboratory
Jae-Seung Yeom Lawrence Livermore National Laboratory
Michela Taufer University of Tennessee

Resumo

Deep Learning (DL) is increasingly applied across various fields to solve complex scientific challenges in modern high-performance computing (HPC) systems that are beyond the reach of traditional algorithms. Training DL models for scientific applications involves processing multi-terabyte datasets in each epoch. The data access behavior during DL training exposes optimization opportunities to cache these datasets in near-compute storage accelerators in HPC systems, enhancing I/O throughput. However, current middleware solutions employ near-compute storage accelerators primarily as exclusive caches, which limits the effectiveness of cache access locality. To address this problem, we introduce DYAD, a system designed to maximize sample locality in the cache, thereby significantly increasing I/O throughput in HPC systems.DYAD optimizes I/O for DL training based on three key features. First, DYAD boosts inter-node access speeds by using a novel streaming RPC with RDMA protocol, achieving a 1.25x performance gain over state-of-the-art solutions. Second, DYAD further enhances inter-node access by coordinating data movement, which mitigates network congestion and increases throughput for inter-node accesses by up to 8.78x. Last, DYAD uses smart metadata caching that outperforms traditional global metadata access methods by several orders of magnitude in terms of lookup throughput. We demonstrate how DYAD accelerates large-scale DL training on a high-end HPC cluster with 512 GPUs by up to 10.82x faster epochs compared to UnifyFS by performing locality-aware caching on near-compute storage accelerators.

Palavras-chave: Training, Deep learning, Protocols, High performance computing, Redundancy, Termination of employment, Metadata, Performance gain, Throughput, Resource management, caching, deep learning, I/O performance, middleware, HPC, sample sharing, RDMA-enabled