DYAD: Locality-aware Data Management for accelerating Deep Learning Training

  • Hariharan Devarajan Lawrence Livermore National Laboratory
  • Ian Lumsden University of Tennessee
  • Chen Wang Lawrence Livermore National Laboratory
  • Konstantia Georgouli Lawrence Livermore National Laboratory
  • Tom Scogland Lawrence Livermore National Laboratory
  • Jae-Seung Yeom Lawrence Livermore National Laboratory
  • Michela Taufer University of Tennessee

Resumo


Deep Learning (DL) is increasingly applied across various fields to solve complex scientific challenges in modern high-performance computing (HPC) systems that are beyond the reach of traditional algorithms. Training DL models for scientific applications involves processing multi-terabyte datasets in each epoch. The data access behavior during DL training exposes optimization opportunities to cache these datasets in near-compute storage accelerators in HPC systems, enhancing I/O throughput. However, current middleware solutions employ near-compute storage accelerators primarily as exclusive caches, which limits the effectiveness of cache access locality. To address this problem, we introduce DYAD, a system designed to maximize sample locality in the cache, thereby significantly increasing I/O throughput in HPC systems.DYAD optimizes I/O for DL training based on three key features. First, DYAD boosts inter-node access speeds by using a novel streaming RPC with RDMA protocol, achieving a 1.25x performance gain over state-of-the-art solutions. Second, DYAD further enhances inter-node access by coordinating data movement, which mitigates network congestion and increases throughput for inter-node accesses by up to 8.78x. Last, DYAD uses smart metadata caching that outperforms traditional global metadata access methods by several orders of magnitude in terms of lookup throughput. We demonstrate how DYAD accelerates large-scale DL training on a high-end HPC cluster with 512 GPUs by up to 10.82x faster epochs compared to UnifyFS by performing locality-aware caching on near-compute storage accelerators.
Palavras-chave: Training, Deep learning, Protocols, High performance computing, Redundancy, Termination of employment, Metadata, Performance gain, Throughput, Resource management, caching, deep learning, I/O performance, middleware, HPC, sample sharing, RDMA-enabled
Publicado
13/11/2024
DEVARAJAN, Hariharan; LUMSDEN, Ian; WANG, Chen; GEORGOULI, Konstantia; SCOGLAND, Tom; YEOM, Jae-Seung; TAUFER, Michela. DYAD: Locality-aware Data Management for accelerating Deep Learning Training. In: INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD), 36. , 2024, Hilo/Hawaii. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 13-24.