Deep Learning on Large-Scale Muticore Clusters
Resumo
Convolutional neural networks (CNNs) have achieved outstanding accuracy among conventional machine learning algorithms. Recent works have shown that large and complicated models, which take significant cost for training are needed to get higher accuracy. To train these models efficiently in high performance computers (HPCs), many parallelization techniques for CNNs have been developed. However, most techniques are mainly targeting GPUs and parallelizations for CPUs are not fully investigated. This paper explores CNN training performance on large-scale multicore clusters by optimizing intra-node processing and applying techniques of inter-node parallelization for multiple GPUs. Detailed experiments conducted on state-of-the-art multi-core processors using the openMP API and MPI framework demonstrated that Caffe-based CNNs can be accelerated by using well-designed multithreaded programs. We achieved at most 1.64 times speedup in convolution operations with devised lowering strategy compared to conventional lowering and acquired 772 times speedup with 864 nodes compared to one node.
Palavras-chave:
Parallel processing, Computational modeling, Training, Convolution, Multicore processing, Backpropagation, Data models, Convolutional Neural Networks, openMP, MPI, Deep Learning, Caffe, Multicore cluster
Publicado
24/09/2018
Como Citar
SAKIVAMA, Kazumasa; KATO, Shinpei; ISHIKAWA, Yutaka; HORI, Atsushi; MONRROY, Abraham.
Deep Learning on Large-Scale Muticore Clusters. In: INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD), 30. , 2018, Lyon/FR.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2018
.
p. 314-321.
