Impacto da Arquitetura de Memória de GPGPUs na Velocidade da Computação de Estênceis
Resumo
Este trabalho apresenta análise de desempenho da computação de estênceis 3D em GPGPUs (Unidades de Processamento Gráfico de Propósito Geral) com foco no uso adequado da hierarquia de memória. São avaliadas codificações que exploram a memória compartilhada, o cache somente leitura, a internalização do laço em Z e o reuso de registradores. Cada codificação é experimentada em diversos tamanhos de estênceis e de domínio de entrada, permitindo observar a influência destes no desempenho final. Conclui-se que em algumas codificações o tamanho do cache L2 afeta o desempenho e que a codificação mais indicada é baseada na combinação do uso do cache somente leitura, internalização do laço em Z e reuso de registradores.Referências
Bauer, M., Cook, H., and Khailany, B. (2011). Cudadma: optimizing gpu memory bandwidth via warp specialization. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, page 12. ACM.
Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., and Yelick, K. (2009). Optimization and performance modeling of stencil computations on modern microprocessors. SIAM review, 51(1):129–159.
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., and Yelick, K. (2008). Stencil computation optimization and auto-tuning on stateof-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC '08, pages 4:1–4:12, Piscataway, NJ, USA. IEEE Press.
Hu, Y., Koppelman, D. M., Brandt, S. R., and Löfer, F. (2015). Model-driven auto-tuning of stencil computations on gpus. In Histencils Workshop, volume 2015.
Krotkiewski, M. and Dabrowski, M. (2013). Efcient 3d stencil computations using cuda. Parallel Computing, 39(10):533–548.
Maruyama, N. and Aoki, T. (2014). Optimizing Stencil Computations for NVIDIA Kepler GPUs. In Größlinger, A. and Köstler, H., editors, Proceedings of the 1st International Workshop on High-Performance Stencil Computations, pages 89–95, Vienna, Austria.
Mei, X. and Chu, X. (2015). Dissecting GPU memory hierarchy through microbench marking. CoRR, abs/1509.02308.
Micikevicius, P. (2009). 3d nite difference computation on gpus using cuda. In Proceedings of 2Nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pages 79–84, New York, NY, USA. ACM.
Nguyen, A., Satish, N., Chhugani, J., Kim, C., and Dubey, P. (2010). 3.5dd blocking In Proceedings of optimization for stencil computations on modern cpus and gpus. the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1–13, Washington, DC, USA. IEEE Computer Society.
NVIDIA (2012). Kepler GK110 whitepaper.
Perkins, S., Marais, P., Zwart, J., Natarajan, I., and Smirnov, O. (2015). Montblanc: GPU accelerated radio interferometer measurement equations in support of bayesian inference for radio observations. CoRR, abs/1501.07719.
Schäfer, A. and Fey, D. (2011). High performance stencil code algorithms for gpgpus. In Sato, M., Matsuoka, S., Sloot, P. M., van Albada, G. D., and Dongarra, J., editors, Proceedings of the International Conference on Computational Science, ICCS 2011, volume 4, pages 2027 – 2036, Netherlands. Elsevier.
Tang, Y., Chowdhury, R. A., Kuszmaul, B. C., Luk, C.-K., and Leiserson, C. E. (2011). The pochoir stencil compiler. In Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures, pages 117–128. ACM. Compilador de DSL para estenceis gerando codigo otimizado para caches em CPUs.
Williams, S., Waterman, A., and Patterson, D. (2009). Rooine: An insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65–76.
Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., and Yelick, K. (2009). Optimization and performance modeling of stencil computations on modern microprocessors. SIAM review, 51(1):129–159.
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., and Yelick, K. (2008). Stencil computation optimization and auto-tuning on stateof-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC '08, pages 4:1–4:12, Piscataway, NJ, USA. IEEE Press.
Hu, Y., Koppelman, D. M., Brandt, S. R., and Löfer, F. (2015). Model-driven auto-tuning of stencil computations on gpus. In Histencils Workshop, volume 2015.
Krotkiewski, M. and Dabrowski, M. (2013). Efcient 3d stencil computations using cuda. Parallel Computing, 39(10):533–548.
Maruyama, N. and Aoki, T. (2014). Optimizing Stencil Computations for NVIDIA Kepler GPUs. In Größlinger, A. and Köstler, H., editors, Proceedings of the 1st International Workshop on High-Performance Stencil Computations, pages 89–95, Vienna, Austria.
Mei, X. and Chu, X. (2015). Dissecting GPU memory hierarchy through microbench marking. CoRR, abs/1509.02308.
Micikevicius, P. (2009). 3d nite difference computation on gpus using cuda. In Proceedings of 2Nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pages 79–84, New York, NY, USA. ACM.
Nguyen, A., Satish, N., Chhugani, J., Kim, C., and Dubey, P. (2010). 3.5dd blocking In Proceedings of optimization for stencil computations on modern cpus and gpus. the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC '10, pages 1–13, Washington, DC, USA. IEEE Computer Society.
NVIDIA (2012). Kepler GK110 whitepaper.
Perkins, S., Marais, P., Zwart, J., Natarajan, I., and Smirnov, O. (2015). Montblanc: GPU accelerated radio interferometer measurement equations in support of bayesian inference for radio observations. CoRR, abs/1501.07719.
Schäfer, A. and Fey, D. (2011). High performance stencil code algorithms for gpgpus. In Sato, M., Matsuoka, S., Sloot, P. M., van Albada, G. D., and Dongarra, J., editors, Proceedings of the International Conference on Computational Science, ICCS 2011, volume 4, pages 2027 – 2036, Netherlands. Elsevier.
Tang, Y., Chowdhury, R. A., Kuszmaul, B. C., Luk, C.-K., and Leiserson, C. E. (2011). The pochoir stencil compiler. In Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures, pages 117–128. ACM. Compilador de DSL para estenceis gerando codigo otimizado para caches em CPUs.
Williams, S., Waterman, A., and Patterson, D. (2009). Rooine: An insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65–76.
Publicado
05/10/2016
Como Citar
NASCIUTTI, Thiago; PANETTA, Jairo.
Impacto da Arquitetura de Memória de GPGPUs na Velocidade da Computação de Estênceis. In: SIMPÓSIO EM SISTEMAS COMPUTACIONAIS DE ALTO DESEMPENHO (SSCAD), 17. , 2016, Aracajú.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2016
.
p. 97-108.
DOI: https://doi.org/10.5753/wscad.2016.14251.