Performance Evaluation of Distributed Deep Learning Training on Low-Bandwidth Networks
Resumo
The growing complexity of Deep Learning models makes training on single devices difficult, requiring distributed strategies like Data Parallelism. However, this approach introduces critical synchronization challenges between processing units. This work evaluates TensorFlow parallel strategies on an HPC cluster with low-bandwidth network to analyze how application factors contribute to time reduction. The study compares the standard synchronous method against a proposed Custom Training Loop with gradient accumulation designed to reduce the bottleneck by synchronizing gradients less frequently. Our results indicate that optimizing synchronization intervals and batch sizes allows for scalable performance gains even on low-bandwidth infrastructure.Referências
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE CVPR, pages 248–255.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep residual learning for image recognition. CoRR.
Langer, M., He, Z., Rahayu, W., and Xue, Y. (2020). Distributed training of deep learning models: A taxonomic perspective. IEEE TPDS.
Shen, L., Sun, Y., Yu, Z., Ding, L., Tian, X., and Tao, D. (2024). On efficient training of large-scale deep learning models. ACM Computing Surveys, 57:1 – 36.
Tan, M. and Le, Q. V. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. CoRR, abs/1905.11946.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep residual learning for image recognition. CoRR.
Langer, M., He, Z., Rahayu, W., and Xue, Y. (2020). Distributed training of deep learning models: A taxonomic perspective. IEEE TPDS.
Shen, L., Sun, Y., Yu, Z., Ding, L., Tian, X., and Tao, D. (2024). On efficient training of large-scale deep learning models. ACM Computing Surveys, 57:1 – 36.
Tan, M. and Le, Q. V. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. CoRR, abs/1905.11946.
Publicado
06/05/2026
Como Citar
MATOS, Rayan Raddatz de; GULART, Marcelo Cardoso Oliveira; BRUMATO, Kenichi; SCHNORR, Lucas Mello.
Performance Evaluation of Distributed Deep Learning Training on Low-Bandwidth Networks. In: ESCOLA REGIONAL DE ALTO DESEMPENHO DA REGIÃO SUL (ERAD-RS), 26. , 2026, Bagé/RS.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2026
.
p. 97-100.
ISSN 2595-4164.
DOI: https://doi.org/10.5753/eradrs.2026.20518.
