Uso da Classificação Dwarf Mine para a Avaliação Comparativa entre a Arquitetura CUDA e Multicores
Resumo
O uso de processadores gráficos (GPUs) para a aceleração de aplicações de propósito geral vem ganhando popularidade. Porém, apesar dos ganhos obtidos com algumas aplicações portadas para GPUs, ainda não existe uma definição clara do comportamento a ser esperado nestas arquiteturas. Nesse contexto, este artigo apresenta uma comparação entre a arquitetura de GPUs CUDA e os multicore Nehalem e Core 2 Duo considerando três aplicações pertencentes a categorias da classificação Dwarf Mine. Os resultados quando comparando CUDA ao sistema com dois processadores Nehalem apresentaram desempenhos similares e um speedup de 4, o que indica um mapeamento positivo entre três importantes classes de aplicações para GPUs.Referências
Asanovic, K., Bodik, R., Demmel, J., Keaveny, T., Keutzer, K., Kubi-atowicz, J., Morgan, N., Patterson, D., Sen, K., Wawrzynek, J., Wessel, D., and Yelick, K. (2009). A view of the parallel computing landscape. Communications of the ACM, 52(10):56–67.
Asanovic, K., Catanzaro, B., Yelick, K., Bodik, R., Gebis, J., Husbands, P., Keutzer, K., Patterson, D., Plishker, W., Shalf, J., and Williams, S. (2006). The landscape of parallel computing research: A view from berkeley. Electrical Engineering and Computer Sciences, University of California at Berkeley, Technical Report No. UCB/EECS-2006-183, December, 18(2006-183).
Bailey, D., E., B., Barton, J., Browning, D., Carter, R., Dagum, L., Fatoohi, R., Fineberg, S., Frederickson, P., Lasinski, T., Schreiber, R., Simon, H., Venkatakrishnan, V., and Weeratunga, S. (1994). The NAS parallel benchmarks. NASA Ames Research Center, RNR Technical Report RNR-94-007.
Barker, K. J., Davis, K., Hoisie, A., and Kerbyson, D. J. (2008). Performance Evaluation of the Nehalem Quad-Core Processor for Scientific Computing. Parallel Processing Letters, 18(4):453–469.
Cevahir, A., Nukada, A., and Matsuoka, S. (2009). Fast Conjugate Gradients with Multiple GPUs. In Proceedings of the 9th International Conference on Computational Science: Part I, pages 893–903. Springer.
Dongarra, J., Moore, S., Peterson, G., Tomov, S., Allred, J., Natoli, V., and Richie, D. (2008). Exploring new architectures in accelerating CFD for Air Force applications. In Proceedings of HPCMP Users Group Conference, pages 14–17. Citeseer.
Duncan, R. (1990). A Survey of Parallel Computing Architectures. Computer, 23(2):5–16.
Govindaraju, N. K., Lloyd, B., Dotsenko, Y., Smith, B., and Manferdelli, J. (2008). High performance discrete Fourier transforms on graphics processors. In 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12. Ieee.
Gulati, K. and Khatri, S. P. (2008). Towards acceleration of fault simulation using graphics processing units. Proceedings of the 45th annual conference on Design automation - DAC ’08, pages 822–827.
Harish, P. and Narayanan, P. (2007). Accelerating large graph algorithms on the GPU using CUDA. Lecture Notes in Computer Science, 4873:197–208.
Hou, Q., Zhou, K., and Guo, B. (2008). BSGP: bulk-synchronous GPU programming. In ACM SIGGRAPH 2008 papers, pages 1–12. ACM.
Jin, H., Frumkin, M., and Yan, J. (1999). The OpenMP implementation of NAS parallel benchmarks and its performance. NASA Ames Research Center, Technical Report NAS-99-011.
Lee, S., Min, S., and Eigenmann, R. (2009). OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 101–110. ACM.
Michalakes, J. and Vachharajani, M. (2008). Gpu Acceleration of Numerical Weather Prediction. Parallel Processing Letters, 18(04):1–8.
Nukada, A. and Matsuoka, S. (2009). Auto-tuning 3-D FFT library for CUDA GPUs. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1–10. ACM.
NVIDIA (2009a). NVIDIA CUDA Compute Unified Device Architecture Programming Guide.
NVIDIA (2009b). NVIDIA’s Next Generation CUDA Compute Architecture: Fermi.
Panetta, J., Teixeira, T., de Souza Filho, P. R., da Cunha Finho, C. A., Sotelo, D., da Motta, F. M. R., Pinheiro, S. S., Junior, I. P., Rosa, A. L. R., Monnerat, L. R., Carneiro, L. T., and de Albrecht, C. H. (2009). Accelerating Kirchhoff Migration by CPU and GPU Cooperation. 21st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2009), pages 26–32.
Tölke, J. (2008). Implementation of a Lattice Boltzmann kernel using the Compute Unified Device Architecture developed by nVIDIA. Computing and Visualization in Science, 13(1):29–39.
Volkov, V. and Demmel, J. (2008). Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, number November, pages 1–11. IEEE Press.
Vuduc, R., Demmel, J., and Yelick, K. (2005). OSKI: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series, 16(i):521–530.
Asanovic, K., Catanzaro, B., Yelick, K., Bodik, R., Gebis, J., Husbands, P., Keutzer, K., Patterson, D., Plishker, W., Shalf, J., and Williams, S. (2006). The landscape of parallel computing research: A view from berkeley. Electrical Engineering and Computer Sciences, University of California at Berkeley, Technical Report No. UCB/EECS-2006-183, December, 18(2006-183).
Bailey, D., E., B., Barton, J., Browning, D., Carter, R., Dagum, L., Fatoohi, R., Fineberg, S., Frederickson, P., Lasinski, T., Schreiber, R., Simon, H., Venkatakrishnan, V., and Weeratunga, S. (1994). The NAS parallel benchmarks. NASA Ames Research Center, RNR Technical Report RNR-94-007.
Barker, K. J., Davis, K., Hoisie, A., and Kerbyson, D. J. (2008). Performance Evaluation of the Nehalem Quad-Core Processor for Scientific Computing. Parallel Processing Letters, 18(4):453–469.
Cevahir, A., Nukada, A., and Matsuoka, S. (2009). Fast Conjugate Gradients with Multiple GPUs. In Proceedings of the 9th International Conference on Computational Science: Part I, pages 893–903. Springer.
Dongarra, J., Moore, S., Peterson, G., Tomov, S., Allred, J., Natoli, V., and Richie, D. (2008). Exploring new architectures in accelerating CFD for Air Force applications. In Proceedings of HPCMP Users Group Conference, pages 14–17. Citeseer.
Duncan, R. (1990). A Survey of Parallel Computing Architectures. Computer, 23(2):5–16.
Govindaraju, N. K., Lloyd, B., Dotsenko, Y., Smith, B., and Manferdelli, J. (2008). High performance discrete Fourier transforms on graphics processors. In 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–12. Ieee.
Gulati, K. and Khatri, S. P. (2008). Towards acceleration of fault simulation using graphics processing units. Proceedings of the 45th annual conference on Design automation - DAC ’08, pages 822–827.
Harish, P. and Narayanan, P. (2007). Accelerating large graph algorithms on the GPU using CUDA. Lecture Notes in Computer Science, 4873:197–208.
Hou, Q., Zhou, K., and Guo, B. (2008). BSGP: bulk-synchronous GPU programming. In ACM SIGGRAPH 2008 papers, pages 1–12. ACM.
Jin, H., Frumkin, M., and Yan, J. (1999). The OpenMP implementation of NAS parallel benchmarks and its performance. NASA Ames Research Center, Technical Report NAS-99-011.
Lee, S., Min, S., and Eigenmann, R. (2009). OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 101–110. ACM.
Michalakes, J. and Vachharajani, M. (2008). Gpu Acceleration of Numerical Weather Prediction. Parallel Processing Letters, 18(04):1–8.
Nukada, A. and Matsuoka, S. (2009). Auto-tuning 3-D FFT library for CUDA GPUs. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, pages 1–10. ACM.
NVIDIA (2009a). NVIDIA CUDA Compute Unified Device Architecture Programming Guide.
NVIDIA (2009b). NVIDIA’s Next Generation CUDA Compute Architecture: Fermi.
Panetta, J., Teixeira, T., de Souza Filho, P. R., da Cunha Finho, C. A., Sotelo, D., da Motta, F. M. R., Pinheiro, S. S., Junior, I. P., Rosa, A. L. R., Monnerat, L. R., Carneiro, L. T., and de Albrecht, C. H. (2009). Accelerating Kirchhoff Migration by CPU and GPU Cooperation. 21st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2009), pages 26–32.
Tölke, J. (2008). Implementation of a Lattice Boltzmann kernel using the Compute Unified Device Architecture developed by nVIDIA. Computing and Visualization in Science, 13(1):29–39.
Volkov, V. and Demmel, J. (2008). Benchmarking GPUs to tune dense linear algebra. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, number November, pages 1–11. IEEE Press.
Vuduc, R., Demmel, J., and Yelick, K. (2005). OSKI: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series, 16(i):521–530.
Publicado
20/07/2010
Como Citar
PILLA, Laércio Lima; NAVAUX, Philippe Olivier Alexandre.
Uso da Classificação Dwarf Mine para a Avaliação Comparativa entre a Arquitetura CUDA e Multicores. In: WORKSHOP EM DESEMPENHO DE SISTEMAS COMPUTACIONAIS E DE COMUNICAÇÃO (WPERFORMANCE), 9. , 2010, Belo Horizonte/MG.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2010
.
p. 1818-1830.
ISSN 2595-6167.