Bottleneck-Aware Network Design for Optimizing Distributed AI Model Training

  • Vitor F. Zanotelli UFES
  • Arthur T. Sampaio UFES
  • Magnos Martinello UFES
  • Jordi Ros-Giralt Qualcomm Europe, Inc
  • Giovanni Comarela UFES

Abstract


The distributed training of AI models significantly increases the demand for efficient communication both within and across data centers. Grounded in Bottleneck Theory (BST), this work aims to design a bottleneck-aware network architecture that minimizes training time while utilizing the least possible bandwidth capacity. We propose a methodology for optimizing network configurations to ensure efficiency and eliminate resource waste. The resulting optimized design serves as a robust performance baseline prior to the adoption of any overprovisioning strategies.

References

DeepSeek-AI, Liu, A., and et al, B. F. (2025). Deepseek-v3 technical report.

Meta AI (2024). Introducing llama 4: Advancing open, multimodal intelligence. Disponível em: [link]. Acesso em: 01 ago. 2025.

Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. (2020). Zero: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20. IEEE Press.

Ros-Giralt, J., Amsel, N., Yellamraju, S., Ezick, J., Lethin, R., Jiang, Y., Feng, A., and Tassiulas, L. (2022). A quantitative theory of bottleneck structures for data networks.

Ros-Giralt, J., Amsel, N., Yellamraju, S., Ezick, J., Lethin, R., Jiang, Y., Feng, A., Tassiulas, L., Wu, Z., Teh, M. Y., and Bergman, K. (2021). Designing data center networks using bottleneck structures. SIGCOMM ’21.

Ros-Giralt, J., Yellamraju, S., Bohara, A., Lethin, R., Li, J., Lin, Y., Tan, Y., Veeraraghavan, M., Jiang, Y., and Tassiulas, L. (2019). G2: A network optimization framework for high-precision analysis of bottleneck and flow performance. In INDIS ’19.

xAI (2024). Grok is now open source. Disponível em: [link]. Acesso em: 01 ago. 2025.
Published
2025-10-16
ZANOTELLI, Vitor F.; SAMPAIO, Arthur T.; MARTINELLO, Magnos; ROS-GIRALT, Jordi; COMARELA, Giovanni. Bottleneck-Aware Network Design for Optimizing Distributed AI Model Training. In: REGIONAL SCHOOL OF INFORMATICS OF ESPÍRITO SANTO (ERI-ES), 10. , 2025, Espírito Santo/ES. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 158-161. DOI: https://doi.org/10.5753/eries.2025.16023.