Detection, Evaluation and Mitigation of Resource Afﬁnity and Communication Contention Problems in a Task-Based Runtime over Heterogeneous Clusters

Lucas Leandro Nesi; Lucas Schnorr

doi:10.5753/wscad.2020.14076

Lucas Leandro Nesi UFRGS
Lucas Schnorr UFRGS

DOI: https://doi.org/10.5753/wscad.2020.14076

Resumo

The complexity of high performance computing (HPC) platforms The Task-Based presents challenges in parallel application development. paradigm is a candidate to reduce some of the programmer's burden. However, because of the platforms' complexity, resource afﬁnity and communication contention might cause performance problems. This work presents a case study of these problems employing the Chameleon dense algebra linear solver LU factorization using the Task-Based runtime StarPU over 21 heterogeneous nodes. We present possible conﬁgurations to mitigate performance degradation and conduct an extensive analysis of their interaction. The results show a performance improvement of 16% without changing the application source code.

Referências

Agullo, E. et al. (2010). Faster, cheaper, better – a hybridization methodology to develop linear algebra software for GPUs. In mei W. Hwu, W., editor, GPU Computing Gems, volume 2. Morgan Kaufmann.

Augonnet, C., Thibault, S., Namyst, R., and Wacrenier, P.-A. (2011). StarPU: a unied platform for task scheduling on heterogeneous multicore architec- tures. Concurrency and Computation: Practice and Experience, 23(2).

Aumage, O., Brunet, E., Furmento, N., and Namyst, R. (2007). New madeleine: A fast communication scheduling engine for high performance networks. In 2007 IEEE Int'l. Parallel and Distributed Processing Symposium. IEEE.

Bleuse, R. et al. (2014). Scheduling data ow program in XKaapi: A new afnity based algorithm for heterogeneous architectures. In Silva, F. et al., editors, Euro-Par 2014 Parallel Processing, Cham. Springer International Publishing.

Bolze, R. et al. (2006). Grid'5000: a large scale and highly recong- urable experimental grid testbed. The International Journal of High Performance Computing Applications, 20(4):481–494.

Cruz, E. H. et al. (2018). Improving communication and load balancing In 2018 26th Euromicro International with thread mapping in manycore systems. Conference on Parallel, Distributed and Network-Based Processing (PDP). IEEE.

Denis, A. et al. (2020). Using Dynamic Broadcasts to improve Task- Based Runtime Performances. In Euro-Par - 26th International European Conference on Parallel and Distributed Computing, Euro-Par 2020, Warsaw, Poland. Springer.

Diener, M., Cruz, E. H. M., and Navaux, P. O. A. (2015). Locality vs. balance: Exploring data mapping policies on numa systems. In 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

Diener, M. et al. (2016). Afnity-based thread and data mapping in shared memory systems. ACM Comput. Surv., 49(4).

Dongarra, J. et al. (2017). With extreme computing, the rules have changed. Comp. in Sci. Eng., 19(3):52.

Dumitru, C., Koning, R., De Laat, C., et al. (2011). 40 gigabit eth- ernet: Prototyping transparent end-to-end connectivity. In The TERENA Networking Conference 2011 (TNC 2011).

González-Férez, P. and Bilas, A. (2016). Mitigation of NUMA and synchronization effects in high-speed network storage over raw ethernet. The Journal of Supercomputing, 72(11).

Jain, R. (1990). The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. John Wiley & Sons.

Jeannot, E. et al. (2013). Communication and topology-aware load balancing in Charm++ with treematch. In 2013 IEEE Int'l Conf. on Cluster Computing.

Li, T., Ren, Y., Yu, D., Jin, S., and Robertazzi, T. (2013). Characterization of input/output bandwidth performance models in numa architecture for data intensive applications. In 2013 42nd International Conference on Parallel Processing.

Lima, J. V. F. and Di Domenico, D. (2017). HPSM: a pro- gramming framework for multi-cpu and multi-gpu systems. In 2017 Int'l Symposium on Computer Architecture and High Performance Computing Workshops.

Milic, U. et al. (2017). Beyond the socket: Numa-aware GPUs. In Pro- ceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitec- ture, New York, NY, USA. Association for Computing Machinery.

Nesi, L., Thibault, S., Stanisic, L., and Schnorr, L. (2019). Visual perfor- mance analysis of memory behavior in a task-based runtime on hybrid platforms. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Comput- ing (CCGRID). IEEE.

Pilla, L. L. et al. (2012). A hierarchical approach for load balancing on parallel multi-core systems. In 2012 41st Int'l Conference on Parallel Processing.

Pinto, V. G. et al. (2018). A visual performance analysis framework for task based parallel applications running on hybrid clusters. Concurrency and Compu- tation: Practice and Experience.

Rodrigues, E. R. et al. (2009). Multi-core aware process mapping In 2009 IEEE and its impact on communication overhead of parallel applications. Symposium on Computers and Communications, pages 811–817.

Serpa, M. S., Cruz, E. H., Panetta, J., and Navaux, P. O. (2018). Op- timizing geophysics models using thread and data mapping. In 2018 Symposium on High Performance Computing Systems (WSCAD), pages 135–141. IEEE.

Spafford, K. et al. (2011). Quantifying NUMA and contention effects in multi-gpu systems. In Proceedings of the Fourth Workshop on GPGPUs, GPGPU-4, New York, NY, USA. Association for Computing Machinery.