A job shaping strategy to accomodate workload traces under varying resource management policies

  • João Pedro M. N. dos Santos LNCC
  • Antônio Tadeu A. Gomes LNCC

Resumo


Supercomputers play a pivotal role in advancing research and development across diverse scientific and engineering domains. However, configuring resource management in these systems to ensure maximum productivity and cost-effectiveness is a challenge. Workload simulation emerges as a crucial tool in this context, offering a mechanism to explore resource management configurations in the presence of expected user behaviors. This paper focuses on a specific requirement for simulation-based optimization applied to tuning resource management configurations: the need for simulators that are both precise and efficient. This paper introduces a job shaping strategy to accommodate real workload traces under varying resource management policies in discrete-event RMS simulations. Our findings from evaluating the proposed strategy on a real-world case study suggest that job shaping allows effectively capturing changes in system behavior, regardless of whether some of the real workload traces used as input to the simulation are incompatible with the simulated policies.

Referências

Casanova, H., Giersch, A., Legrand, A., Quinson, M., and Suter, F. (2014). Versatile, scalable, and accurate simulation of distributed applications and platforms. Journal of Parallel and Distributed Computing, 74(10):2899–2917.

Chapin, S. J., Cirne, W., Feitelson, D. G., Jones, J. P., Leutenegger, S. T., Schwiegelshohn, U., Smith, W., and Talby, D. (1999). Benchmarks and standards for the evaluation of parallel job schedulers. In Feitelson, D. G. and Rudolph, L., editors, Job Scheduling Strategies for Parallel Processing, pages 67–90, Berlin, Heidelberg. Springer.

Cirne, W. and Berman, F. (2001). A model for moldable supercomputer jobs. In Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001, pages 8 pp.–.

Dutot, P.-F., Mercier, M., Poquet, M., and Richard, O. (2017). Batsim: A realistic language-independent resources and jobs management systems simulator. In Desai, N. and Cirne, W., editors, Job Scheduling Strategies for Parallel Processing, pages 178–197, Cham. Springer International Publishing.

Fu, M. C. (2014). Handbook of Simulation Optimization. Springer Publishing Company, Incorporated.

Galleguillos, C., Kiziltan, Z., Netti, A., and Soto, R. (2020). AccaSim: a customizable workload management simulator for job dispatching research in HPC systems. Cluster Computing, 23(1):107–122.

Gomes, A. T. A. (2018). Assessing the behavior of HPC users and systems: The case of the Santos Dumont supercomputer. Lecture of the XIX Brazilian Symposium on High-Performance Computing Systems (WSCAD), São Paulo, Brazil.

Gupta, A., Acun, B., Sarood, O., and Kalé, L. V. (2014). Towards realizing the potential of malleable jobs. In 2014 21st International Conference on High Performance Computing (HiPC), pages 1–10.

Jokanovic, A., D’Amico, M., and Corbalan, J. (2018). Evaluating SLURM simulator with real-machine SLURM and vice versa. In 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pages 72–82. IEEE.

Klusáček, D., Soysal, M., and Suter, F. (2020). Alea – complex job scheduling simulator. In Wyrzykowski, R., Deelman, E., Dongarra, J., and Karczewski, K., editors, Parallel Processing and Applied Mathematics, pages 217–229, Cham. Springer International Publishing.

Posner, J., Hupfeld, F., and Finnerty, P. (2024). Enhancing supercomputer performance with malleable job scheduling strategies. In Zeinalipour, D., Blanco Heras, D., Pallis, G., Herodotou, H., Trihinas, D., Balouek, D., Diehl, P., Cojean, T., Fürlinger, K., Kirkeby, M. H., Nardelli, M., and Di Sanzo, P., editors, Euro-Par 2023: Parallel Processing Workshops, pages 180–192, Cham. Springer Nature Switzerland.

Prabhakaran, S., Neumann, M., Rinke, S., Wolf, F., Gupta, A., and Kale, L. V. (2015). A batch system with efficient adaptive scheduling for malleable and evolving applications. In 2015 IEEE International Parallel and Distributed Processing Symposium, pages 429–438.

Rodrigo, G. P., Elmroth, E., Östberg, P.-O., and Ramakrishnan, L. (2018). ScSF: A scheduling simulation framework. In Klusáček, D., Cirne, W., and Desai, N., editors, Job Scheduling Strategies for Parallel Processing, pages 152–173, Cham. Springer International Publishing.

Sabin, G., Lang, M., and Sadayappan, P. (2007). Moldable parallel job scheduling using job efficiency: An iterative approach. In Frachtenberg, E. and Schwiegelshohn, U., editors, Job Scheduling Strategies for Parallel Processing, pages 94–114, Berlin, Heidelberg. Springer.

Simakov, N. A., Deleon, R. L., Lin, Y., Hoffmann, P. S., and Mathias, W. R. (2022). Developing accurate Slurm simulator. In Practice and Experience in Advanced Research Computing 2022: Revolutionary: Computing, Connections, You. Association for Computing Machinery.
Publicado
23/10/2024
SANTOS, João Pedro M. N. dos; GOMES, Antônio Tadeu A.. A job shaping strategy to accomodate workload traces under varying resource management policies. In: SIMPÓSIO EM SISTEMAS COMPUTACIONAIS DE ALTO DESEMPENHO (SSCAD), 25. , 2024, São Carlos/SP. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 73-84. DOI: https://doi.org/10.5753/sscad.2024.244798.