Análise de escalabilidade e eficiência da fatoração LU usando CPU x GPU

Estevan Braz Brandt Costa; Fabio Takeshi Matsunaga; Jacques Duílio Brancher

Estevan Braz Brandt Costa UEL
Fabio Takeshi Matsunaga UEL
Jacques Duílio Brancher UEL

Resumo

Com o advento da GPU (Graphics Processing Unit), e de sua utilização para auxiliar em processos matemáticos através do surgimento da GPGPU (General-Purpose Graphics Processing Unit), diversas plataformas surgiram para que os desenvolvedores pudessem usar esta arquitura em seu favor. Apesar do desenvolvimento de algoritmos para GPU ter ficado mais simples e rápido, ainda há muito o que se fazer e pensar ao se desenvolver um algoritmo que faça uso de tal tecnologia. Este trabalho vem mostrar o estudo das principais características que devem ser consideradas quando desenvolve-se um algoritmo para ser executado na GPU, como a transferência de dados entre CPU e GPU. Um estudo de caso foi feito e analisado através da implementação do algoritmo de fatoração LU, e resultados mostraram um ganho médio de 93% no desempenho com todas as otimizações consideradas. Os principais fatores que contribuiram para a melhora de desempenho foram o gerenciamento da memória e os tipos de processos e dados que são executados e transferidos nas kernels.

Referências

Agullo, E., Augonnet, C., Dongarra, J., Faverge, M., Langou, J., Ltaief, H., and Tomov, S. (2011). LU factorization for accelerator-based systems. 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA), pages 217–224.

Alonso, P., Dolz, M. F., Igual, F. D., Mayo, R., and Quintana-Orti, E. S. (2012). Saving Energy in the LU Factorization with Partial Pivoting on Multi-core Processors. 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, pages 353–358.

Barrachina, S., Castillo, M., Igual, F. D., Mayo, R., and Quintana-OrtÃ, E. S. (2008). Solving dense linear systems on graphics processors. Euro-Par 08: Proceedings of the 14th international Euro-Par conference on Parallel Processing.

Bosilca, G., Bouteiller, A., Herault, T., Lemarinier, P., Saengpatsa, N., Tomov, S., and Dongarra, J. (2010). A unified HPC environment for hybrid manycore/GPU distributed systems. LAPACK Working Note, Tech. Rep. 234.

Cupertino, L. F., Singulani, A. P., Silva, C. P., Aur, M., Pacheco, C., Janeiro, R. D., and Farias, R. (2010). LU Decomposition on GPUs : The Impact of Memory Access. Work.

Dobes, J., Cerny, D., and Biolek, D. (2011). Efficient procedure for solving circuit algebraic-differential equations with modified sparse LU factorization improving fill-in suppression. 2011 20th European Conference on Circuit Theory and Design (ECCTD), (2):689–692.

Du, P., Luszczek, P., Tomov, S., and Dongarra, J. (2013). Soft error resilient QR factorization for hybrid system with GPGPU . Journal of Computational Science, (0):–.

Fogue, M., Igual, F. D., Quintana-ortÃ, E. S., and Geijn, R. V. D. (2010). Retargeting PLAPACK to clusters with hardware accelerators flame working note 42.

Galoppo, N. (2005). LU-GPU : Efficient Algorithms for Solving Dense Linear Systems on Graphics. Architecture, (c).

Hu, L., Che, X., and Xie, Z. (2013). GPGPU cloud: A paradigm for general purpose computing. Tsinghua Science and Technology, 18(1).

Humprey, J. R., Price, D. K., Spagnoli, K. E., Polini, A. L., and Kelmelis, E. J. (2010). CULA: hybrid GPU accelerated linear algebra routines. Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series.

Ino, F., Matsui, M., Goda, K., and Hagihara, K. (2005). Performance Study of LU Decomposition on the Programmable GPU. 12th IEEE Intl Conf. High Performance Computing (HiPC05), (16016254).

Matsumoto, K., Nakasato, N., Sakai, T., Yahagi, H., and Sedukhin, S. G. (2011). Multilevel Optimization of Matrix Multiplication for GPU-equipped Systems. Procedia Computer Science, 4:342–351.

Michailidis, P. D. and Margaritis, K. G. (2011). Parallel direct methods for solving the system of linear equations with pipelining on a multicore using OpenMP. Journal of Computational and Applied Mathematics, 236(3):326–341.

Nakasato, N. (2012). Implementation of a parallel tree method on a GPU. Journal of Computational Science, 3(3):132 – 141.

Rodriguez-Alvarez, M.-J., Sanchez, F., Soriano, A., and Iborra, A. (2010). Sparse Givens resolution of large system of linear equations: Applications to image reconstruction . Mathematical and Computer Modelling, 52(7-8):1258–1264.