A Distributed and Storage-Aware Approach to Large-Scale Cholesky Factorization

Carla Cusihuallpa; Rodrigo Ceccato; Sandro Rigo; Guido Araujo; Hervé Yviquel

Carla Cusihuallpa UNICAMP
Rodrigo Ceccato UNICAMP
Sandro Rigo UNICAMP
Guido Araujo UNICAMP
Hervé Yviquel UNICAMP

Resumo

Cholesky factorization is a core operation in scientific computing, yet its scalability is often constrained by memory limitations when processing extremely large dense matrices. This work introduces an out-of-core Cholesky factorization algorithm for symmetric positive-definite matrices that integrates GPU acceleration, block-wise lossless compression, and parallel I/O to overcome these limitations. The approach leverages the OMPC runtime for asynchronous task scheduling and employs HDF5 to store the matrix on disk, taking advantage of Lustre’s parallel I/O capabilities in distributed environments. Tiles are decompressed just-in-time on the GPU, significantly reducing host memory usage, storage footprint, and end-to-end data movement overhead—from disk through the CPU to the GPU—without compromising numerical accuracy. Experimental results show that the proposed method scales across 8 GPU nodes, successfully factorizing matrices up to 3M × 3M. In comparison, SLATE could only handle sizes up to 700K × 700K, with the proposed algorithm achieving up to 41% higher throughput. These results demonstrate the algorithm’s scalability and competitiveness beyond memory-constrained in-core solutions, offering a practical path for enabling extreme-scale scientific applications.

Palavras-chave: Symmetric matrices, Runtime, Scientific computing, Scalability, Memory management, Graphics processing units, Throughput, Libraries, Matrix decomposition, Next generation networking