Predicting Runtime in HPC Environments for an Efficient Use of Computational Resources

Mariza Ferro; Vinicius P. Klôh; Matheus Gritz; Vitor de Sá; Bruno Schulze

doi:10.5753/wscad.2021.18513

Mariza Ferro LNCC
Vinicius P. Klôh LNCC
Matheus Gritz LNCC
Vitor de Sá LNCC
Bruno Schulze LNCC

DOI: https://doi.org/10.5753/wscad.2021.18513

Resumo

Understanding the computational impact of scientific applications on computational architectures through runtime should guide the use of computational resources in high-performance computing systems. In this work, we propose an analysis of Machine Learning (ML) algorithms to gather knowledge about the performance of these applications through hardware events and derived performance metrics. Nine NAS benchmarks were executed and the hardware events were collected. These experimental results were used to train a Neural Network, a Decision Tree Regressor and a Linear Regression focusing on predicting the runtime of scientific applications according to the performance metrics.

Referências

Amaris, M., de Camargo, R. Y., Dyab, M., Goldman, A., and Trystram, D. (2016). A comparison of gpu execution time prediction using machine learning and analytical modeling. In 2016 IEEE 15th International Symposium on Network Computing and Applications (NCA), pages 326–333. IEEE.

Bailey, D. H., Barszcz, E., Barton, J. T., Browning, D. S., Carter, R. L., Dagum, L., Fatoohi, R. A., Frederickson, P. O., Lasinski, T. A., Schreiber, R. S., et al. (1991). The nas parallel benchmarks. The International Journal of Supercomputing Applications, 5(3):63–73.

Balladini, J., Morán, M., Rexachs del Rosario, D., et al. (2014). Metodología para predecir el consumo energético de checkpoints en sistemas de hpc. In XX Congreso Argentino de Ciencias de la Computación (Buenos Aires, 2014).

Gritz, M., Silva, G., Klôh, V., Schulze, B., and Ferro, M. (2019). Towards an autonomous framework for hpc optimization: A study of performance prediction using hardware counters and machine learning. XIX Simpósio de Pesquisa Operacional e Logística da Marinha.

Guo, J., Nomura, A., Barton, R., Haoyu, Z., and Matsuoka, S. (2018). Machine Learning Predictions for Underestimation of Job Runtime on HPC System, pages 179–198.

Hara, K., Saito, D., and Shouno, H. (2015). Analysis of function of rectified linear unit In 2015 International Joint Conference on Neural Networks used in deep learning. (IJCNN), pages 1–8. IEEE.

Johnston, B. (2019). Characterizing and Predicting Scientific Workloads for Heterogeneous Computing Systems. PhD thesis.

Kaltenecker, C. (2016). Comparison of analytical and empirical performance models: A case study on multigrid systems. Masterthesis, University of Passau, Germany, page 10.

Klôh, V., Gritz, M., Schulze, B., and Ferro, M. (2019). Towards an autonomous framework for hpc optimization: Using machine learning for energy and performance modeling. In Anais Principais do XX Simpósio em Sistemas Computacionais de Alto Desempenho, pages 438–445. SBC.

Klôh, V., Schulze, B., and Ferro, M. (2020). Use of machine learning for improvements in performance and energy consumption in hpc systems. Master’s thesis, National Laboratory for Scientific Computing.

Lewis, R. D., Liu, Z., Kettimuthu, R., and Papka, M. E. (2020). Log-based identification, In In HPCSYSPROS20: classification, and behavor prediction of hpc applications. HPC System Professionals Workshop, Atlanta, GA.

Malakar, P., Balaprakash, P., Vishwanath, V., Morozov, V., and Kumaran, K. (2018). Benchmarking machine learning methods for performance modeling of scientific applications. In 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pages 33–44. IEEE.

Martínez, V., Dupros, F., Castro, M., and Navaux, P. (2017). Performance improvement of stencil computations for multi-core architectures based on machine learning. Procedia Computer Science, 108:305–314.

Masouros, D., Xydis, S., and Soudris, D. (2019). Rusty: Runtime system predictability leveraging lstm neural networks. IEEE Computer Architecture Letters, PP:1–1.

Nwankpa, C., Ijomah, W., Gachagan, A., and Marshall, S. (2018). Activation functions: Comparison of trends in practice and research for deep learning. arXiv preprint arXiv:1811.03378.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.

Siegmund, N., Grebhahn, A., Apel, S., and Kästner, C. (2015). Performance-influence models for highly configurable systems. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering, pages 284–294. ACM.

Souza, A., Rezaei, M., Laure, E., and Tordsson, J. (2019). Hybrid resource management for hpc and data intensive workloads. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pages 399–409.

Tanash, M., Dunn, B., Andresen, D., Hsu, W., Yang, H., and Okanlawon, A. (2019). Improving hpc system performance by predicting job resources via supervised machine learning. pages 1–8.

Wu, X., Taylor, V., Cook, J., and Mucci, P. J. (2016). Using performance-power modeling to improve energy efficiency of hpc applications. Computer, 49(10):20–29.

Wu, X., Taylor, V. E., and Lan, Z. (2020). Performance and power modeling and prediction using mummi and ten machine learning methods. CoRR, abs/2011.06655.