Atenuando a Contenção nas Unidades de Execução com Mapeamento Instruction-Aware

Matheus Serpa; Eduardo Cruz; Matthias Diener; Antonio Carlos Beck; Philippe Navaux

doi:10.5753/wscad.2020.14073

Matheus Serpa UFRGS
Eduardo Cruz IFPR
Matthias Diener Universidade de Illinois
Antonio Carlos Beck UFRGS
Philippe Navaux UFRGS

DOI: https://doi.org/10.5753/wscad.2020.14073

Resumo

Aplicações paralelas executadas em processadores SMT (Simultaneous Multithreading) competem por unidades de execução. O problema ﬁca ainda pior, quando as threads executam instruções semelhantes, como por exemplo de ponto ﬂutuante, inteiro, load e store. Nesses casos, o mesmo tipo de instrução é despachado para execução, o que leva a perdas de desempenho devido a contenção nessas unidades. Este trabalho tem como objetivo fornecer um mecanismo para mapeamento de múltiplas aplicações paralelas em processadores SMT. O mecanismo foca em melhorar o desempenho, mitigando a contenção nas unidades de execução ao executar aplicações paralelas. Para tanto, threads que estressam as mesmas unidades de execução são mapeadas em núcleos diferentes. Os resultados mostram ganhos de desempenho de 29,1% e 17,4%, em média, quando comparado com o escalonador do sistema operacional Linux e com um mapeamento Round-robin.

Referências

Akturk, I. and Ozturk, O. (2019). Adaptive thread scheduling in chip multiprocessors. International Journal of Parallel Programming, 47(1):1–31.

Bailey, D. H. (2011). Nas parallel benchmarks. Encyclopedia of Parallel Computing,1(1).

Bienia, C. (2011). Benchmarking Modern Multiprocessors. PhD thesis, Princeton University.

Bolze, R., Cappello, F., Caron, E., Daydé, M., Desprez, F., Jeannot, E., Jégou, Y., Lanteri, S., Leduc, J., Melab, N., et al. (2006). Grid'5000: A large scale and highly recongurable experimental grid testbed. The International Journal of High Performance Computing Applications, 20(4):481–494.

Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., and Namyst, R. (2010). hwloc: A generic framework for managing hardware afnities in hpc applications. In 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, pages 180–186, Pisa, Italy. IEEE.

Choi, S. and Yeung, D. (2009). Hill-climbing SMT processor resource distribution. ACM Transactions on Computer Systems, 27(1):1–47.

Cruz, E. H., Diener, M., Serpa, M. S., Navaux, P. O. A., Pilla, L., and Koren, I. (2018). Improving communication and load balancing with thread mapping in manycore systems. In Euromicro International Conference on Parallel, Distributed and Networkbased Processing (PDP).

Feliu, J., Sahuquillo, J., Petit, S., and Duato, J. (2016). Bandwidth-aware on-line scheduling in SMT multicores. IEEE Transactions on Computers, 65(2).

Henning, J. L. (2006). Spec cpu2006 benchmark descriptions. ACM SIGARCH Computer Architecture News, 34(4):1–17.

Johnson, M., McCraw, H., Moore, S., Mucci, P., Nelson, J., Terpstra, D., Weaver, V., In and Mohan, T. (2012). Papi-v: Performance monitoring for virtual machines. International Conference on Parallel Processing Workshops.

Pabla, C. S. (2009). Completely fair scheduler. Linux Journal, 2009(184):4.

Serpa, M. S., Cruz, E. H., Diener, M., Krause, A. M., Navaux, P. O., Panetta, J., Farrés, A., Rosas, C., and Hanzich, M. (2019a). Optimization strategies for geophysics models on manycore systems. The International Journal of High Performance Computing Applications, 33(3):473–486.

Serpa, M. S., Moreira, F. B., Navaux, P. O., Cruz, E. H., Diener, M., Griebler, D., and Fernandes, L. G. (2019b). Memory performance and bottlenecks in multicore and gpu architectures. In 2019 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pages 233–236. IEEE.

Terpstra, D., Jagode, H., You, H., and Dongarra, J. (2010). Collecting performance data with papi-c. In Tools for High Performance Computing 2009. Springer.

Tullsen, D. M., Eggers, S. J., and Levy, H. M. (1995). Simultaneous multithreading: In Proceedings of the 22nd Annual International Maximizing on-chip parallelism. Symposium on Computer Architecture, ISCA '95. ACM.