Impacto da Arquitetura de Memória de GPGPUs na Velocidade da Computação de Estênceis

Thiago Nasciutti; Jairo Panetta

doi:10.5753/wscad.2016.14251

Thiago Nasciutti ITA
Jairo Panetta ITA

DOI: https://doi.org/10.5753/wscad.2016.14251

Resumo

Este trabalho apresenta análise de desempenho da computação de estênceis 3D em GPGPUs (Unidades de Processamento Gráﬁco de Propósito Geral) com foco no uso adequado da hierarquia de memória. São avaliadas codiﬁcações que exploram a memória compartilhada, o cache somente leitura, a internalização do laço em Z e o reuso de registradores. Cada codiﬁcação é experimentada em diversos tamanhos de estênceis e de domínio de entrada, permitindo observar a inﬂuência destes no desempenho ﬁnal. Conclui-se que em algumas codiﬁcações o tamanho do cache L2 afeta o desempenho e que a codiﬁcação mais indicada é baseada na combinação do uso do cache somente leitura, internalização do laço em Z e reuso de registradores.

Referências

Bauer, M., Cook, H., and Khailany, B. (2011). Cudadma: optimizing gpu memory bandwidth via warp specialization. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, page 12. ACM.

Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., and Yelick, K. (2009). Optimization and performance modeling of stencil computations on modern microprocessors. SIAM review, 51(1):129–159.

Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., and Yelick, K. (2008). Stencil computation optimization and auto-tuning on stateof-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC '08, pages 4:1–4:12, Piscataway, NJ, USA. IEEE Press.

Hu, Y., Koppelman, D. M., Brandt, S. R., and Löfer, F. (2015). Model-driven auto-tuning of stencil computations on gpus. In Histencils Workshop, volume 2015.

Krotkiewski, M. and Dabrowski, M. (2013). Efcient 3d stencil computations using cuda. Parallel Computing, 39(10):533–548.

Maruyama, N. and Aoki, T. (2014). Optimizing Stencil Computations for NVIDIA Kepler GPUs. In Größlinger, A. and Köstler, H., editors, Proceedings of the 1st International Workshop on High-Performance Stencil Computations, pages 89–95, Vienna, Austria.

Mei, X. and Chu, X. (2015). Dissecting GPU memory hierarchy through microbench marking. CoRR, abs/1509.02308.

Micikevicius, P. (2009). 3d nite difference computation on gpus using cuda. In Proceedings of 2Nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pages 79–84, New York, NY, USA. ACM.

Nguyen, A., Satish, N., Chhugani, J., Kim, C., and Dubey, P. (2010). 3.5dd blocking In Proceedings of optimization for stencil computations on modern cpus and gpus. the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1–13, Washington, DC, USA. IEEE Computer Society.

NVIDIA (2012). Kepler GK110 whitepaper.

Perkins, S., Marais, P., Zwart, J., Natarajan, I., and Smirnov, O. (2015). Montblanc: GPU accelerated radio interferometer measurement equations in support of bayesian inference for radio observations. CoRR, abs/1501.07719.

Schäfer, A. and Fey, D. (2011). High performance stencil code algorithms for gpgpus. In Sato, M., Matsuoka, S., Sloot, P. M., van Albada, G. D., and Dongarra, J., editors, Proceedings of the International Conference on Computational Science, ICCS 2011, volume 4, pages 2027 – 2036, Netherlands. Elsevier.

Tang, Y., Chowdhury, R. A., Kuszmaul, B. C., Luk, C.-K., and Leiserson, C. E. (2011). The pochoir stencil compiler. In Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures, pages 117–128. ACM. Compilador de DSL para estenceis gerando codigo otimizado para caches em CPUs.

Williams, S., Waterman, A., and Patterson, D. (2009). Rooine: An insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65–76.