In search of efficient scheduling heuristics from simulations and Machine Learning
ResumoHigh Performance Computing (HPC) systems are used to solve a number of complex issues in different fields of knowledge. However, these platforms have been rapidly evolving in size and complexity; and ensuring efficiency in managing applications (jobs) has become a challenge. Typically, this management involves scheduling heuristics that consist of functions to order the jobs. In this work we evaluate the limits of regression methods for creating scheduling heuristics. Our results show that the simplest heuristic led to the most efficient scheduling, while the more complex heuristics showed instabilities due to multicollinearity.
Carastan-Santos, D. and de Camargo, R. Y. (2017). Obtaining dynamic scheduling policies with simulation and machine learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1-13, Denver Colorado. ACM.
Carastan-Santos, D., De Camargo, R. Y., Trystram, D., and Zrigui, S. (2019a). One Can Only Gain by Replacing EASY Backfilling: A Simple Scheduling Policies Case Study. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pages 1-10, Larnaca, Cyprus. IEEE.
Carastan-Santos, D., De Camargo, R. Y., Trystram, D., and Zrigui, S. (2019b). One can only gain by replacing easy backfilling: A simple scheduling policies case study. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pages 1-10.
Carroll, R. J. and Ruppert, D. (1988). Transformation and weighting in regression. Monographs on statistics and applied probability. Chapman and Hall, New York.
Casanova, H., Giersch, A., Legrand, A., Quinson, M., and Suter, F. (2014). Versatile, scalable, and accurate simulation of distributed applications and platforms. Journal of Parallel and Distributed Computing, 74(10):2899-2917.
Feitelson, D. G., Tsafrir, D., and Krakov, D. (2014). Experience with using the Parallel Workloads Archive. Journal of Parallel and Distributed Computing, 74(10):2967-2982.
García, C. G., Gómez, R. S., and Pérez, J. G. (2022). A review of ridge parameter selection: minimization of the mean squared error vs. mitigation of multicollinearity. Communications in Statistics Simulation and Computation, 0(0):1-13.
Jack Dongarra and Erich Strohmaier (2022). TOP500 Supercomputer Sites.
Legrand, A., Trystram, D., and Zrigui, S. (2019). Adapting Batch Scheduling to Workload Characteristics: What Can We Expect From Online Learning? In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 686-695, Rio de Janeiro, Brazil. IEEE.
Lublin, U. and Feitelson, D. G. (2003). The workload on parallel supercomputers: modeling the characteristics of rigid jobs. Journal of Parallel and Distributed Computing, 63(11):1105-1122.
Mu'alem, A. and Feitelson, D. (2001). Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Transactions on Parallel and Distributed Systems, 12(6):529-543.
Tang, W., Lan, Z., Desai, N., and Buettner, D. (2009). Fault-aware, utility-based job scheduling on BlueGene/P systems. In Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference on, pages 1-10. IEEE.