Auto-Adaptive Multi-Objective Scheduling for Academic HPC Grids: Simulation and Execution with OAR and SimGrid

  • Xavier P. Sebastião UFPel / UniZambeze
  • Gerson Geraldo H. Cavalheiro UFPel

Resumo


This paper proposes an auto-adaptive scheduling framework integrating OAR, a production-grade scheduler deployed in Grid’5000, with SimGrid-based simulation to enable continuous learning and multi-objective optimization. Unlike existing approaches that treat prediction, simulation, and scheduling separately, this framework establishes a closed feedback loop where runtime data from simulated Grid’5000 environments informs policy evaluation, and validated policies are designed for deployment back into OAR via its plugin architecture. The proposed framework intents to balance makespan, energy consumption, and fairness while supporting containerized scientific workflows through simulation that models operational conditions.

Referências

Bader, J., Lehmann, F., Thamsen, L., Leser, U., and Kao, O. (2024). Lotaru: Locally predicting workflow task runtimes for resource management on heterogeneous infrastructures. Future Generation Computer Systems, 150:171–185.

Bolze, R., Cappello, F., Caron, E., Daydé, M., Desprez, F., Jeannot, E., Jégou, Y., Lanteri, S., Leduc, J., Melab, N., et al. (2006). Grid’5000: a large scale and highly reconfigurable experimental grid testbed. The International Journal of High Performance Computing Applications, 20(4):481–494.

Casanova, H., Giersch, A., Legrand, A., Quinson, M., and Suter, F. (2025). Lowering entry barriers to developing custom simulators of distributed applications and platforms with simgrid. Parallel Computing, 123:103125.

Ding, F., Yuan, Y., Lv, L., Zhang, R., and Zhou, W. (2024). Transformer-enhanced dqn approach for energy and cost-efficient large-scale dynamic workflow scheduling in heterogeneous environment. IEEE Internet of Things Journal, 11(22):37351–37367.

Fard, H. M., Prodan, R., and Fahringer, T. (2014). Multi-objective list scheduling of workflow applications in distributed computing infrastructures. Journal of Parallel and Distributed Computing, 74(3):2152–2165.

Gao, X., Dong, H., Zhang, L., Wang, Y., Yang, X., and Li, Z. (2025). Self-attention mechanisms in hpc job scheduling: A novel framework combining gated transformers and enhanced ppo. Applied Sciences, 15(16):8928.

Grid’5000 (2026). Grid’5000 - hardware overview. [link]. Accessed: March 4, 2026.

Hilman, M. H., Rodriguez, M. A., and Buyya, R. (2020). Multiple workflows scheduling in multi-tenant distributed systems: A taxonomy and future directions. ACM Computing Surveys (CSUR), 53(1):1–39.

Horzela, M., Casanova, H., Giffels, M., Gottmann, A., Hofsaess, R., Quast, G., Tisbeni, S. R., Streit, A., and Suter, F. (2024). Modeling distributed computing infrastructures for hep applications. In EPJ Web of Conferences, volume 295, page 04032. EDP Sciences.

Pu, Y., Li, F., and Rahimifard, S. (2024). Multi-agent reinforcement learning for job shop scheduling in dynamic environments. Sustainability, 16(8):3234.

Shan, C., Wu, C., Xia, Y., Guo, Z., Liu, D., and Zhang, J. (2023). Adaptive resource allocation for workflow containerization on kubernetes. Journal of Systems Engineering and Electronics, 34(3):723–743.

Zhang, Z., Xu, C., Liu, K., Xu, S., and Huang, L. (2024). A resource optimization scheduling model and algorithm for heterogeneous computing clusters based on gnn and rl: Z. zhang et al. The Journal of Supercomputing, 80(16):24138–24172.
Publicado
06/05/2026
SEBASTIÃO, Xavier P.; CAVALHEIRO, Gerson Geraldo H.. Auto-Adaptive Multi-Objective Scheduling for Academic HPC Grids: Simulation and Execution with OAR and SimGrid. In: ESCOLA REGIONAL DE ALTO DESEMPENHO DA REGIÃO SUL (ERAD-RS), 26. , 2026, Bagé/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2026 . p. 181-184. ISSN 2595-4164. DOI: https://doi.org/10.5753/eradrs.2026.21294.