Strategies to Improve the Performance and Energy Efficiency of Stencil Computations for NVIDIA GPUs

Pablo José Pavan; Matheus da Silva Serpa; Víctor Martínez; Edson Luiz Padoin; Jairo Panetta; Philippe O. A. Navaux

doi:10.5753/wperformance.2018.3348

Pablo José Pavan UFRGS
Matheus da Silva Serpa UFRGS
Víctor Martínez UFRGS
Edson Luiz Padoin UFRGS / UNIJUI
Jairo Panetta ITA
Philippe O. A. Navaux UFRGS

DOI: https://doi.org/10.5753/wperformance.2018.3348

Resumo

Energy and performance of parallel systems are an increasing concern for new large-scale systems. Research has been developed in response to this challenge aiming the manufacture of more energy efficient systems. In this context, we improved the performance and achieved energy efficiency by the development of three different strategies which use the GPU memory subsystem (global-, shared-, and read-onlymemory). We also develop two optimizations to use data locality and use of registers of GPU architecture. Our developed optimizations were applied to GPU algorithms for stencil applications achieve a performance improvement of up to 201.5% in K80 and 264.6% in P100 when used shared memory and read-only cache respectively over the naive version. The computational results have shown that the combination of use read-only memory, the Z-axis internalization of stencil application and reuse of specific architecture registers allow increasing the energy efficiency of up to 255.6% in K80 and 314.8% in P100.

Referências

Bauer, M., Cook, H., and Khailany, B. (2011). Cudadma: Optimizing gpu memory bandwidth via warp specialization. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, pages 12:1–12:11, New York, NY, USA. ACM.

Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., and Yelick, K. (2008). Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, page 4. IEEE Press.

de la Cruz, R. and Araya-Polo, M. (2011). Towards a multi-level cache performance model for 3d stencil computation. Procedia Computer Science, 4:2146–2155.

Dong, Y., Chen, J., and Tang, T. (2010). Power measurements and analyses of massive object storage system. In Proceedings of CIT, pages 1317–1322. International Conference on Computer and Information Technology (CIT), IEEE Computer Society.

Dubey, A. (2014). Stencils in scientific computations. In Proceedings of the Second Workshop on Optimizing Stencil Computations, pages 57–57. ACM.

Falch, T. L. and Elster, A. C. (2014). Register caching for stencil computations on gpus. In 2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, pages 479–486. IEEE.

Feng, X., Ge, R., and Cameron, K. W. (2005). Power and energy profiling of scientific applications on distributed systems. In International Parallel and Distributed Processing Symposium (IPDPS), pages 34–34. International Conference on Performance Engineering, IEEE.

Hamilton, B., Webb, C. J., Gray, A., and Bilbao, S. (2015). Large stencil operations for gpu-based 3-d acoustics simulations. Proc. Digital Audio Effects (DAFx),(Trondheim, Norway).

Laros, J., Pedretti, K., Kelly, S., VanDyke, J., Ferreira, K., Vaughan, C., and Swan, M. (2009). Topics on measuring real power usage on high performance computing platforms. In Proceedings..., pages 1–8. International Conference on Cluster Computing and Workshops (ICCC).

Maruyama, N. and Aoki, T. (2014). Optimizing stencil computations for nvidia kepler gpus. In Proceedings of the 1st International Workshop on High-Performance Stencil Computations, Vienna, pages 89–95.

Micikevicius, P. (2009). 3d finite difference computation on gpus using cuda. In Proceedings of 2Nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pages 79–84, New York, NY, USA. ACM.

Nasciutti, T. C. and Panetta, J. (2016). Impacto da arquitetura de memória de gpgpus na velocidade de computação de estênceis. In XVII Simpósio de Sistemas Computacionais (WSCAD-SSC), pages 1–8, Aracaju, SE.

Nikitin, V. V., Duchkov, A. A., and Andersson, F. (2012). Parallel algorithm of 3d wavepacket decomposition of seismic data: Implementation and optimization for gpu. Journal of Computational Science, 3(6):469–473.

Padoin, E. L., de Oliveira, D. A. G., Velho, P., Navaux, P. O. A., and Mehaut, J.-F. (2013a). ARM-based cluster: Performance, Scalability and Energy Efficiency. In 4th Workshop on Applications for Multi-Core Architectures (WAMCA SBAC-PAD), pages 1–6, Porto de Galinhas, PB, Brasil.

Padoin, E. L., Pilla, L. L., Boito, F. Z., Kassick, R. V., Velho, P., and Navaux, P. O. (2013b). Evaluating application performance and energy consumption on hybrid cpu+ gpu architecture. Cluster Computing, 16(3):511–525.

Schäfer, A. and Fey, D. (2011). High performance stencil code algorithms for gpgpus. Procedia Computer Science, 4:2027 – 2036. Proceedings of the International Conference on Computational Science, ICCS 2011.

Vilela, R. F. (2017). Perfilagem do problema de resolução da equação da onda por diferençãs finitas em coprocessador xeon phi.

Xue, Q., Wang, Y., Zhan, Y., and Chang, X. (2015). An efficient gpu implementation for locating micro-seismic sources using 3d elastic wave time-reversal imaging. Computers & Geosciences, 82:89–97.

Zhou, G., Zhang, X., Lang, Y., Bo, R., Jia, Y., Lin, J., and Feng, Y. (2016). A novel gpu-accelerated strategy for contingency screening of static security analysis. International Journal of Electrical Power & Energy Systems, 83:33–39.

Zhou, J., Unat, D., Choi, D. J., Guest, C. C., and Cui, Y. (2012). Hands-on performance tuning of 3d finite difference earthquake simulation on gpu fermi chipset. Procedia Computer Science, 9:976–985.