Escalonando tarefas comunicantes em topologias DragonFly com Aprendizado por Reforço

Claudinei Cabral Junior; Guilherme Piêgas Koslovski

doi:10.5753/eradrs.2026.20439

Claudinei Cabral Junior UDESC
Guilherme Piêgas Koslovski UDESC

DOI: https://doi.org/10.5753/eradrs.2026.20439

Resumo

O escalonamento eficiente de workflows em data centers de computação de alto desempenho exige decisões que considerem a topologia da rede de interconexão. Este trabalho apresenta um escalonador baseado em Aprendizado por Reforço com arquitetura Actor-Critic que incorpora consciência topológica da rede DragonFly. Resultados demonstram que a incorporação de informações topológicas permite ao agente aprender políticas que favorecem a localidade das tarefas, reduzindo o custo de comunicação.

Referências

Bhatele, A., Jain, N., Livnat, Y., Pascucci, V., and Bremer, P.-T. (2016). Analyzing interjob contention in dragonfly networks. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 296–307. IEEE.

Borghesi, A., Di Santi, C., Molan, M., Ardebili, M. S., Mauri, A., Guarrasi, M., Galetti, D., Cestari, M., Barchi, F., Benini, L., Beneventi, F., and Bartolini, A. (2023). M100 exadata: a data collection campaign on the cineca’s marconi100 tier-0 supercomputer. Scientific Data, 10(1):288.

Kang, Y., Wang, X., McGlohon, N., Mubarak, M., Chunduri, S., and Lan, Z. (2024). Preventing workload interference with intelligent routing and flexible job placement strategy on dragonfly system. ACM Transactions on Modeling and Computer Simulation, 34(2):1–26.

Kim, J., Dally, W. J., Scott, S., and Abts, D. (2008). Technology-driven, highly-scalable dragonfly topology. In Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA), pages 77–88. IEEE.

Koslovski, G. P., Pereira, K., and Albuquerque, P. R. (2024). Dag-based workflows scheduling using actor–critic deep reinforcement learning. Future Generation Computer Systems, 150:354–363.

Mao, H., Schwarzkopf, M., Venkatakrishnan, S. B., Meng, Z., and Alizadeh, M. (2019). Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM Special Interest Group on Data Communication, SIGCOMM ’19, page 270–288, New York, NY, USA. Association for Computing Machinery.

Zhang, Y., Furlani, T., Jones, M. D., White, J. P., and DeLeon, R. L. (2018). A new job allocation policy for dragonfly networks. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 254–263. IEEE.