Ember: Asynchronous Dynamic Data Serving for PyTorch Distributed Training

Patrick O. C. Araújo; Fábio T. Ramos; Jhonata M. da Costa; Mario Drumond; José Augusto M. Nacif

doi:10.5753/sbrc.2026.19762

Patrick O. C. Araújo UFV
Fábio T. Ramos UFV
Jhonata M. da Costa UFV
Mario Drumond Huawei
José Augusto M. Nacif UFV

DOI: https://doi.org/10.5753/sbrc.2026.19762

Resumo

Advancements in natural language processing (NLP) and computer vision (CV) have led to substantial growth in data and model sizes, often requiring the distribution of the model and data across multiple GPUs and machines to accelerate training. Existing tools such as PyTorch Distributed Data-Parallel (DDP) operate at a low level of abstraction, requiring vast knowledge of distributed training. Therefore, users must adapt workflows to rigid tools or rebuild the entire code for each model and implementation, which can lead to performance issues. We introduce Ember, a customizable distributed training framework with new data distribution mechanisms. Ember utilizes remote procedure calls to decouple data loading logic from model synchronization, ensuring efficient computational resource utilization while maintaining accessibility and customization. To evaluate Ember’s performance, we compare its core training features with Ray, a powerful distributed framework, to ensure that our functions can be competitive with state-of-the-art implementations. The results of testing Ember use on datasets and models show that the framework achieves its goals, reducing GPU idle time and optimizing data transfer, achieving competitive training times with our baseline while utilizing its new asynchronous services with up to 36% less memory usage.

Referências

A Audibert, Y. C. (2023). tf.data service: A case for disaggregating ml input data processing.

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2015). Tensor-Flow: Large-scale machine learning on heterogeneous distributed systems. Software available from [link].

Ahmed, M. I., Mamun, S. M., and Asif, A. U. Z. (2021). Dcnn-based vegetable image classification using transfer learning: A comparative study.

Alexander Sergeev, M. D. B. (2018). Horovod: fast and easy distributed deep learning in tensorflow.

Deng, L. (2012). The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine.

J Verbraeken, M. W. (2020). A survey on distributed machine learning.

K He, X. Z. (2015). Deep residual learning for image recognition.

Ko, Y. and Kim, S.-W. (2021). Shat: a novel asynchronous training algorithm that provides fast model convergence in distributed deep learning. Applied Sciences, 12(1):292.

M Aach, E Inanc, R. S. (2023). Large scale performance analysis of distributed deep learning frameworks for convolutional neural networks.

M Sandler, A. H. (2019). Mobilenetv2: Inverted residuals and linear bottlenecks.

M Tan, Q. L. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks.

Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan, M. I., and Stoica, I. (2018). Ray: A distributed framework for emerging ai applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 561–577. USENIX Association.

Pritam Damania, S. L. (2023). Pytorch rpc: Distributed deep learning built on tensor-optimized remote procedure calls.

Shen Li, Y. Z. (2020). Pytorch distributed: Experiences on accelerating data parallel training.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2023). Attention is all you need.

Y Gao, Y He, X. L. (2024). An empirical study on low gpu utilization of deep learning jobs.

Ember: Asynchronous Dynamic Data Serving for PyTorch Distributed Training

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)