Improving performance visualization of OpenMP task-based applications
Resumo
OpenMP is becoming a more powerful environment for exploiting task-based parallelism. Recent specification versions add support for new task clauses, while the OMPT interface provides a standard API for performance monitoring. In this paper, we present a workflow to improve the performance visualization of OpenMP task-based applications. We rely on open-source solutions such as the Tikki OMPT tracing tool and the StarVZ performance analysis framework to create enriched space-time views. We demonstrate this workflow with three applications: Strassen matrix multiply, SparseLU factorization, and a dense Cholesky factorization. For two of them, our strategy enables a better understating of the performance impact of the OpenMP task depend, task wait, and priority constructions.Referências
Agrawal, V., Voss, M. J., Reble, P., Tovinkere, V., Hammond, J., and Klemm, M. (2018). Visualization of OpenMP* task dependencies using Intel® Advisor – flow graph analyzer. In Lecture Notes in Computer Science, International Workshop on OpenMP, IWOMP 2018, page 175–188. Springer International Publishing, Barcelona, Spain.
Augonnet, C., Clet-Ortega, J., Thibault, S., and Namyst, R. (2010). Data-aware task scheduling on multi-accelerator based platforms. In 2010 IEEE 16th International Conference on Parallel and Distributed Systems. IEEE.
Augonnet, C., Thibault, S., Namyst, R., and Wacrenier, P.-A. (2011). StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2):187–198.
Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H., and Zhou, Y. (1996). Cilk: An efficient multithreaded runtime system. Journal of parallel and distributed computing, 37(1):55–69.
Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Herault, T., and Dongarra, J. J. (2013). Parsec: Exploiting heterogeneity to enhance scalability. Computing in Science and Engineering, 15(6):36–45.
Buttari, A., Langou, J., Kurzak, J., and Dongarra, J. (2009). A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing, 35(1):38–53.
Daoudi, I., Virouleau, P., Gautier, T., Thibault, S., and Aumage, O. (2020). sOMP: Simulating OpenMP Task-Based Applications with NUMA Effects, page 197–211. Springer.
Daumen, A., Carribault, P., Trahay, F., and Thomas, G. (2019). ScalOMP: Analyzing the Scalability of OpenMP Applications, page 36–49. Springer, Auckland, New Zealand.
Dongarra, J., Tomov, S., Luszczek, P., Kurzak, J., Gates, M., Yamazaki, I., Anzt, H., Haidar, A., and Abdelfattah, A. (2017). With extreme computing, the rules have changed. Computing in Science and Engineering, 19(3):52–62.
Duran, A., Ayguadé, E., Badia, R. M., Labarta, J., Martinell, L., Martorell, X., and Planas, J. (2011). Ompss: A proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02):173–193.
Duran, A., Teruel, X., Ferrer, R., Martorell, X., and Ayguade, E. (2009). Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In 2009 International Conference on Parallel Processing. IEEE.
Feld, C., Convent, S., Hermanns, M.-A., Protze, J., Geimer, M., and Mohr, B. (2019). Score-P and OMPT: Navigating the perils of callback-driven parallel runtime intro-spection. In Lecture Notes in Computer Science, International Workshop on OpenMP, IWOMP 2019, page 21–35. Springer, Auckland, New Zealand.
Gamblin, T., LeGendre, M., Collette, M. R., Lee, G. L., Moody, A., De Supinski, B. R., and Futral, S. (2015). The Spack package manager: bringing order to HPC software chaos. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12.
Garcia Pinto, V., Mello Schnorr, L., Stanisic, L., Legrand, A., Thibault, S., and Danjean, V. (2018). A visual performance analysis framework for task-based parallel ap plications running on hybrid clusters. Concurrency and Computation: Practice and Experience, 30(18).
Gautier, T., Lima, J. V., Maillard, N., and Raffin, B. (2013). Xkaapi: A runtime system for data-flow task programming on heterogeneous architectures. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. IEEE.
Haugen, B., Richmond, S., Kurzak, J., Steed, C. A., and Dongarra, J. (2015). Visualizing execution traces with task dependencies. In Proceedings of the 2nd Workshop on Visual Performance Analysis, SC15. ACM.
Lima, J. V., Gautier, T., Danjean, V., Raffin, B., and Maillard, N. (2015). Design and analysis of scheduling strategies for multi-cpu and multi-gpu architectures. Parallel Computing, 44:37–52.
Lima, J. V. F., Freytag, G., Pinto, V. G., Schepke, C., and Navaux, P. O. A. (2019). A dynamic task-based d3q19 lattice-boltzmann method for heterogeneous architectures. In 27th Int. Conf. on Parallel, Distributed and Network-Based Processing (PDP). IEEE.
Llort, G., Filgueras, A., Jiménez-González, D., Servat, H., Teruel, X., Mercadal, E., Álvarez, C., Giménez, J., Martorell, X., Ayguadé, E., and Labarta, J. (2016). The secrets of the accelerators unveiled: Tracing heterogeneous executions through OMPT. In LNCS, Int. Workshop on OpenMP (IWOMP), page 217–236. Springer, Nara, Japan.
Miletto, M. C. and Schnorr, L. (2019). Openmp and starpu abreast: the impact of runtime in task-based block qr factorization performance. In Anais do XX Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD 2019). SBC.
Nesi, L. L., Miletto, M., Pinto, V., and Schnorr, L. (2021). Desenvolvimento de aplicações baseadas em tarefas com openmp tasks. pages 131–152. SBC.
OpenMP (2021). OpenMP application program interface version 5.2.
Pinho, V., Yviquel, H., Machado Pereira, M., and Araujo, G. (2020). Omptracing: Easy profiling of openmp programs. In 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pages 249–256.
Pinto, V. G., Leandro Nesi, L., Miletto, M. C., and Mello Schnorr, L. (2021). Providing in-depth performance analysis for heterogeneous task-based applications with starvz. In IEEE Int. Parallel and Distributed Processing Symp. Workshops (IPDPSW). IEEE.
Pinto, V. G., Stanisic, L., Legrand, A., Schnorr, L. M., Thibault, S., and Danjean, V. (2016). Analyzing dynamic task-based applications on hybrid platforms: An agile scripting approach. In Third Workshop on Visual Performance Analysis, VPA@SC 2016, Salt Lake, UT, USA, November 18, 2016, pages 17–24. IEEE.
Schuchart, J., Nachtmann, M., and Gracia, J. (2017). Patterns for OpenMP Task Data Dependency Overhead Measurements, page 156–168. Springer.
Virouleau, P., Brunet, P., Broquedis, F., Furmento, N., Thibault, S., Aumage, O., and Gautier, T. (2014). Evaluation of OpenMP dependent tasks with the kastors benchmark suite. In Int. Workshop on OpenMP (IWOMP), page 16–29. Springer, Salvador, Brazil.
YarKhan, A., Kurzak, J., Luszczek, P., and Dongarra, J. (2016). Porting the plasma numerical library to the openmp standard. Int. J. of Parallel Programming, 45(3):612–633.
Augonnet, C., Clet-Ortega, J., Thibault, S., and Namyst, R. (2010). Data-aware task scheduling on multi-accelerator based platforms. In 2010 IEEE 16th International Conference on Parallel and Distributed Systems. IEEE.
Augonnet, C., Thibault, S., Namyst, R., and Wacrenier, P.-A. (2011). StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2):187–198.
Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., Leiserson, C. E., Randall, K. H., and Zhou, Y. (1996). Cilk: An efficient multithreaded runtime system. Journal of parallel and distributed computing, 37(1):55–69.
Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Herault, T., and Dongarra, J. J. (2013). Parsec: Exploiting heterogeneity to enhance scalability. Computing in Science and Engineering, 15(6):36–45.
Buttari, A., Langou, J., Kurzak, J., and Dongarra, J. (2009). A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing, 35(1):38–53.
Daoudi, I., Virouleau, P., Gautier, T., Thibault, S., and Aumage, O. (2020). sOMP: Simulating OpenMP Task-Based Applications with NUMA Effects, page 197–211. Springer.
Daumen, A., Carribault, P., Trahay, F., and Thomas, G. (2019). ScalOMP: Analyzing the Scalability of OpenMP Applications, page 36–49. Springer, Auckland, New Zealand.
Dongarra, J., Tomov, S., Luszczek, P., Kurzak, J., Gates, M., Yamazaki, I., Anzt, H., Haidar, A., and Abdelfattah, A. (2017). With extreme computing, the rules have changed. Computing in Science and Engineering, 19(3):52–62.
Duran, A., Ayguadé, E., Badia, R. M., Labarta, J., Martinell, L., Martorell, X., and Planas, J. (2011). Ompss: A proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02):173–193.
Duran, A., Teruel, X., Ferrer, R., Martorell, X., and Ayguade, E. (2009). Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In 2009 International Conference on Parallel Processing. IEEE.
Feld, C., Convent, S., Hermanns, M.-A., Protze, J., Geimer, M., and Mohr, B. (2019). Score-P and OMPT: Navigating the perils of callback-driven parallel runtime intro-spection. In Lecture Notes in Computer Science, International Workshop on OpenMP, IWOMP 2019, page 21–35. Springer, Auckland, New Zealand.
Gamblin, T., LeGendre, M., Collette, M. R., Lee, G. L., Moody, A., De Supinski, B. R., and Futral, S. (2015). The Spack package manager: bringing order to HPC software chaos. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12.
Garcia Pinto, V., Mello Schnorr, L., Stanisic, L., Legrand, A., Thibault, S., and Danjean, V. (2018). A visual performance analysis framework for task-based parallel ap plications running on hybrid clusters. Concurrency and Computation: Practice and Experience, 30(18).
Gautier, T., Lima, J. V., Maillard, N., and Raffin, B. (2013). Xkaapi: A runtime system for data-flow task programming on heterogeneous architectures. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. IEEE.
Haugen, B., Richmond, S., Kurzak, J., Steed, C. A., and Dongarra, J. (2015). Visualizing execution traces with task dependencies. In Proceedings of the 2nd Workshop on Visual Performance Analysis, SC15. ACM.
Lima, J. V., Gautier, T., Danjean, V., Raffin, B., and Maillard, N. (2015). Design and analysis of scheduling strategies for multi-cpu and multi-gpu architectures. Parallel Computing, 44:37–52.
Lima, J. V. F., Freytag, G., Pinto, V. G., Schepke, C., and Navaux, P. O. A. (2019). A dynamic task-based d3q19 lattice-boltzmann method for heterogeneous architectures. In 27th Int. Conf. on Parallel, Distributed and Network-Based Processing (PDP). IEEE.
Llort, G., Filgueras, A., Jiménez-González, D., Servat, H., Teruel, X., Mercadal, E., Álvarez, C., Giménez, J., Martorell, X., Ayguadé, E., and Labarta, J. (2016). The secrets of the accelerators unveiled: Tracing heterogeneous executions through OMPT. In LNCS, Int. Workshop on OpenMP (IWOMP), page 217–236. Springer, Nara, Japan.
Miletto, M. C. and Schnorr, L. (2019). Openmp and starpu abreast: the impact of runtime in task-based block qr factorization performance. In Anais do XX Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD 2019). SBC.
Nesi, L. L., Miletto, M., Pinto, V., and Schnorr, L. (2021). Desenvolvimento de aplicações baseadas em tarefas com openmp tasks. pages 131–152. SBC.
OpenMP (2021). OpenMP application program interface version 5.2.
Pinho, V., Yviquel, H., Machado Pereira, M., and Araujo, G. (2020). Omptracing: Easy profiling of openmp programs. In 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pages 249–256.
Pinto, V. G., Leandro Nesi, L., Miletto, M. C., and Mello Schnorr, L. (2021). Providing in-depth performance analysis for heterogeneous task-based applications with starvz. In IEEE Int. Parallel and Distributed Processing Symp. Workshops (IPDPSW). IEEE.
Pinto, V. G., Stanisic, L., Legrand, A., Schnorr, L. M., Thibault, S., and Danjean, V. (2016). Analyzing dynamic task-based applications on hybrid platforms: An agile scripting approach. In Third Workshop on Visual Performance Analysis, VPA@SC 2016, Salt Lake, UT, USA, November 18, 2016, pages 17–24. IEEE.
Schuchart, J., Nachtmann, M., and Gracia, J. (2017). Patterns for OpenMP Task Data Dependency Overhead Measurements, page 156–168. Springer.
Virouleau, P., Brunet, P., Broquedis, F., Furmento, N., Thibault, S., Aumage, O., and Gautier, T. (2014). Evaluation of OpenMP dependent tasks with the kastors benchmark suite. In Int. Workshop on OpenMP (IWOMP), page 16–29. Springer, Salvador, Brazil.
YarKhan, A., Kurzak, J., Luszczek, P., and Dongarra, J. (2016). Porting the plasma numerical library to the openmp standard. Int. J. of Parallel Programming, 45(3):612–633.
Publicado
23/10/2024
Como Citar
PINTO, Vinícius Garcia; SOUSA FILHO, Christian Einhardt.
Improving performance visualization of OpenMP task-based applications. In: SIMPÓSIO EM SISTEMAS COMPUTACIONAIS DE ALTO DESEMPENHO (SSCAD), 25. , 2024, São Carlos/SP.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2024
.
p. 156-167.
DOI: https://doi.org/10.5753/sscad.2024.244795.