On limits of Machine Learning techniques in the learning of scheduling policies

Lucas de Sousa Rosa; Danilo Carastan-Santos; Alfredo Goldman; Denis Trystram

doi:10.5753/reic.2023.3419

Autores

Lucas de Sousa Rosa University of Sao Paulo
Danilo Carastan-Santos Univ. Grenoble Alpes https://orcid.org/0000-0002-1878-8137
Alfredo Goldman University of Sao Paulo https://orcid.org/0000-0001-5746-4154
Denis Trystram Univ. Grenoble Alpes https://orcid.org/0000-0002-2623-6922

DOI:

https://doi.org/10.5753/reic.2023.3419

Palavras-chave:

Heurísticas de escalonamento, Computação de Alto Desempenho, Aprendizado de Máquinas, Regressão Linear

Resumo

Este trabalho de iniciação científica explora a relação emergente entre a gestão de recursos em plataformas de computação de alto desempenho (HPC) e o uso de heurísticas de escalonamento derivadas da regressão para otimizar o desempenho. Pesquisas recentes mostraram que técnicas de aprendizado de máquina (ML) podem ser usadas para gerar heurísticas de escalonamento que são simples e eficientes. Este trabalho propõe uma abordagem alternativa usando funções polinomiais para gerar heurísticas de escalonamento. O polinômio mais simples mostrou-se como uma das heurísticas mais eficientes. Também avaliamos a resiliência das heurísticas derivadas da regressão ao longo do tempo. Publicamos dois artigos em workshops nacionais e internacionais com revisão por pares (Qualis-B3/B4).

Downloads

Não há dados estatísticos.

Referências

Alin, A. (2010). Multicollinearity. Wiley Interdisciplinary Reviews: Computational Statistics, 2(3):370–374.

Amvrosiadis, G., Kuchnik, M., Park, J. W., Cranor, C., Ganger, G. R., Moore, E., and DeBardeleben, N. (2018). The atlas cluster trace repository. Usenix Mag, 43(4).

Brucker, P. (2007). Scheduling Algorithms. Springer, hardcover edition.

Carastan-Santos, D., Camargo, R. Y. D., Trystram, D., and Zrigui, S. (2019). One can only gain by replacing EASY backfilling: A simple scheduling policies case study. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE.

Carastan-Santos, D. and de Camargo, R. Y. (2017). Obtaining dynamic scheduling policies with simulation and machine learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM.

Casanova, H., Giersch, A., Legrand, A., Quinson, M., and Suter, F. (2014). Versatile, scalable, and accurate simulation of distributed applications and platforms. Journal of Parallel and Distributed Computing, 74(10):2899–2917.

Fan, Y., Lan, Z., Childers, T., Rich, P., Allcock, W., and Papka, M. E. (2021). Deep reinforcement agent for scheduling in hpc. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 807–816.

Feitelson, D. G. (2001). Metrics for parallel job scheduling and their convergence. In Job Scheduling Strategies for Parallel Processing, pages 188–205. Springer Berlin Heidelberg.

Feitelson, D. G., Rudolph, L., Schwiegelshohn, U., Sevcik, K. C., and Wong, P. (1997). Theory and practice in parallel job scheduling. In Job Scheduling Strategies for Parallel Processing: IPPS’97 Processing Workshop Geneva, Switzerland, April 5, 1997 Proceedings 3, pages 1–34. Springer.

Feitelson, D. G., Tsafrir, D., and Krakov, D. (2014). Experience with using the parallel workloads archive. Journal of Parallel and Distributed Computing, 74(10):2967–2982.

Garcia, C. G., Gómez, R. S., and P ́erez, J. G. (2022). A review of ridge parameter selection: minimization of the mean squared error vs. mitigation of multicollinearity. Communications in Statistics - Simulation and Computation, pages 1–13.

Legrand, A., Trystram, D., and Zrigui, S. (2019). Adapting batch scheduling to workload characteristics: What can we expect from online learning? In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE.

Li, J., Zhang, X., Han, L., Ji, Z., Dong, X., and Hu, C. (2021). Okcm: improving parallel task scheduling in high-performance computing systems using online learning. The Journal of Supercomputing, 77(6):5960–5983.

Lublin, U. and Feitelson, D. G. (2003). The workload on parallel supercomputers: modeling the characteristics of rigid jobs. Journal of Parallel and Distributed Computing, 63(11):1105–1122.

Mu'alem, A. and Feitelson, D. (2001). Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Transactions on Parallel and Distributed Systems, 12(6):529–543.

Rosa, L., Carastan-Santos, D., and Goldman, A. (2023). An experimental analysis of regression-obtained hpc scheduling heuristics. In Job Scheduling Strategies for Parallel Processing. Springer-Verlag. To be published.

Rosa, L. and Goldman, A. (2022). In search of efficient scheduling heuristics from simulations and machine learning. In Anais Estendidos do XXIII Simpósio em Sistemas Computacionais de Alto Desempenho, pages 17–24, Porto Alegre, RS, Brasil. SBC.

Shalf, J. (2020). The future of computing beyond moore’s law. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 378(2166):20190061.

Tang, W., Lan, Z., Desai, N., and Buettner, D. (2009). Fault-aware, utility-based job scheduling on blue, gene/p systems. In 2009 IEEE International Conference on Cluster Computing and Workshops. IEEE.

Zhang, D., Dai, D., He, Y., Bao, F. S., and Xie, B. (2020). Rlscheduler: An automated hpc batch job scheduler using reinforcement learning. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15.

Zrigui, S., de Camargo, R. Y., Legrand, A., and Trystram, D. (2022). Improving the performance of batch schedulers using online job runtime classification. Journal of Parallel and Distributed Computing, 164:83–95.

Sobre os limites das técnicas de Machine Learning no aprendizado de políticas de escalonamento

Autores

DOI:

Palavras-chave:

Resumo

Downloads

Referências

Downloads

Publicado

Como Citar

Edição

Seção

Licença

Enviar Submissão

Idioma