A Performance and Energy Study of GPU-Resident Preconditioners for Conjugate Gradient Solvers: In the Context of Existing and Novel Approaches

Kasia Świrydowicz; Jesun Firoz; Joseph Manzano; Mahantesh Halappanavar; Stephen Thomas; Kevin Barker

Kasia Świrydowicz Advanced Micro Devices, Inc.
Jesun Firoz Pacific Northwest National Laboratory
Joseph Manzano Pacific Northwest National Laboratory
Mahantesh Halappanavar Pacific Northwest National Laboratory
Stephen Thomas Advanced Micro Devices, Inc.
Kevin Barker Advanced Micro Devices, Inc.

Resumo

Optimizing a particular subprogram out of the set of Basic (sparse) Linear Algebra Subprograms (BLAS) for a given architecture is a common topic of research. In applications, however, these BLAS functions rarely appear in isolation; usually, many of them are used together, in various combinations and with varying inputs. As the need to solve a large, sparse linear system is ubiquitous throughout HPC applications, linear solvers constitute a realistic, sufficiently complex and well-defined representative use case for composite BLAS routines. To this end, based on a representative set of matrices drawn from a diverse set of fields, we present a framework to study, from the performance and energy perspective, the efficacy of GPU-resident parallel Conjugate Gradient (CG) linear solver with different preconditioner options, including Gauss-Seidel, Jacobi, and incomplete Cholesky. We also propose a novel GPU-based preconditioner, in which the triangular solves are approximated by an iterative process. The development of this preconditioner was motivated by solving large graph Laplacian linear systems, for which the existing preconditioners either perform slow on GPU-based platforms or are not applicable. We compare the performance of these preconditioners on different hardware accelerator architectures, i.e., AMD MI250X, MI100, Nvidia A100, V100, and Jetson. Our experiments reveal performance trade-offs and provide information on how to select the best strategy for the given linear system, dictated by its properties, and the platform of interest. We demonstrate the application of our novel preconditioner for solving CG and graph Laplacian systems. Overall, the framework can be utilized as a benchmark to guide informed decisions in choosing a specific preconditioner, i.e., whether it is better to rely on the performance of a triangular solver or on the performance of sparse matrix-vector product. Finally, by considering power consumption to solve the linear systems, we report the energy footprint for the solvers.

Palavras-chave: Linear systems, Jacobian matrices, Performance evaluation, Laplace equations, Power demand, Graphics processing units, Sparse matrices, Iterative methods, Standards, Convergence, preconditioner, iterative solver, gpu