OpenMP and StarPU Abreast: the Impact of Runtime in Task-Based Block QR Factorization Performance
Resumo
Directed Acyclic Graph (DAG) is a high-level abstraction to describe the activities of parallel applications. A DAG contains tasks (nodes) and dependencies (edges) in the task-based programming paradigm. Application performance depends on the choices of the runtime system. Our work intends to evaluate and compare the performance of three different runtime systems, GCC/libgomp, LLVM/libomp, and StarPU for a task-based dense block QR factorization. The obtained results show that while GCC/libgomp achieves up to 5.4% better performance in the best case, it has scalability problems for finegrain problems with large DAGs. LLVM/libomp and StarPU are more scalable, and StarPU is much faster in task creation and submission than the other runtimes.
Referências
[Anderson et al. 1999] Anderson, E., Bai, Z., Bischof, C., Blackford, S., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., and Sorensen, D. (1999) LAPACK Users’ guide, volume 9. Siam.
[Augonnet et al. 2011] Augonnet, C., Thibault, S., Namyst, R., and Wacrenier, P.-A. (2011) Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2):187–198.
[Broquedis et al. 2010] Broquedis, F., Furmento, N., Goglin, B., Wacrenier, P.-A., and Namyst, R. (2010). Forestgomp: an efficient openmp environment for numa architectures. International Journal of Parallel Programming, 38(5-6):418–439.
[Broquedis et al. 2012] Broquedis, F., Gautier, T., and Danjean, V. (2012). Libkomp, an efficient openmp runtime system for both fork-join and data flow paradigms. In International Workshop on OpenMP, pages 102–115. Springer.
[Buttari et al. 2008] Buttari, A., Langou, J., Kurzak, J., and Dongarra, J. (2008). Parallel tiled qr factorization for multicore architectures. Concurrency and Computation: Practice and Experience, 20(13):1573–1590.
[Ceballos et al. 2017] Ceballos, G., Grass, T., Hugo, A., and Black-Schaffer, D. (2017) Taskinsight: Understanding task schedules effects on memory and performance. In Intl. Work. on Prog. Models and Applic. for Multi. and Manycores, pages 11–20. ACM.
[Dagum and Menon 1998] Dagum, L. and Menon, R. (1998). Openmp: An industrystandard api for shared-memory programming. Comp. in Sci. & Eng., 5(1):46–55.
[de Oliveira Stein et al. 2010] de Oliveira Stein, B., de Kergommeaux, J. C., and Mounié, G. (2010). Pajé trace file format. Technical report, ID-IMAG, Grenoble, France, 2002.
[Dongarra et al. 2017] Dongarra, J., Tomov, S., Luszczek, P., Kurzak, J., Gates, M., Yamazaki, I., Anzt, H., Haidar, A., and Abdelfattah, A. (2017). With extreme computing, the rules have changed. Computing in Science & Engineering, 19(3):52.
[Eichenberger et al. 2013] Eichenberger, A. E., Mellor-Crummey, J., Schulz, M., Wong, M., Copty, N., Dietrich, R., Liu, X., Loh, E., and Lorenz, D. (2013). Ompt: An openmp tools application programming interface for performance analysis. In International Workshop on OpenMP, pages 171–185. Springer.
[Garcia Pinto et al. 2018] Garcia Pinto, V., Mello Schnorr, L., Stanisic, L., Legrand, A., Thibault, S., and Danjean, V. (2018). A visual performance analysis framework for task-based parallel applications running on hybrid clusters. CCPE, 30(18):e4472.
[Gautier et al. 2013] Gautier, T., Lementec, F., Faucher, V., and Raffin, B. (2013). X-kaapi: a multi paradigm runtime for multicore architectures. In 2013 42nd International Conference on Parallel Processing, pages 728–735. IEEE.
[Givens 1958] Givens, W. (1958). Computation of plain unitary rotations transforming a general matrix to triangular form. J. of the Soc. for Ind. and Appl. Math., 6(1):26–50.
[Golub and Van Loan 2012] Golub, G. H. and Van Loan, C. F. (2012). Matrix computations, volume 3. JHU press.
[Householder 1958] Householder, A. S. (1958). Unitary triangularization of a nonsymmetric matrix. Journal of the ACM (JACM), 5(4):339–342.
[Knüpfer et al. 2012] Knüpfer, A., Rössel, C., an Mey, D., Biersdorff, S., Diethelm, K., Eschweiler, D., Geimer, M., Gerndt, M., Lorenz, D., Malony, A., et al. (2012). Score-p:A joint performance measurement run-time infrastructure for periscope, scalasca, tau, and vampir. In Tools for High Performance Computing 2011, pages 79–91. Springer.
[Lima et al. 2017] Lima, J. V., Raïs, I., Lefèvre, L., and Gautier, T. (2017). Performance and energy analysis of openmp runtime systems with dense linear algebra algorithms.In Intl. Symp. on Comp. Arch. and High Perf. Comp. Workshops, pages 7–12. IEEE.
[LLVM 2015] LLVM, P. (2015). Openmp*: Support for the openmp language.
[Lopez 2015] Lopez, F. (2015). Task-based multifrontal QR solver for heterogeneous architectures. PhD thesis, Université de Toulouse, Université Toulouse III-Paul Sabatier.
[Schmidt 1908] Schmidt, E. (1908). Über die auflösung linearer gleichungen mit unendlich vielen unbekannten. R. del Circolo Matematico di Palermo (1884-1940), 25(1):53–77.
[Terboven et al. 2012] Terboven, C., Schmidl, D., Cramer, T., and an Mey, D. (2012). Assessing openmp tasking implementations on numa architectures. In Intl. Work. on OpenMP, pages 182–195. Springer.
[Willhalm and Popovici 2008] Willhalm, T. and Popovici, N. (2008). Putting intel R threading building blocks to work. In Proceedings of the 1st international workshop on Multicore software engineering, pages 3–4. ACM.