Memory-Side Acceleration and Sparse Compression for Quantized Packed Convolutions

Alex Weaver; Krishna Kavi; Pranathi Vasireddy; Gayatri Mehta

Alex Weaver University of North Texas
Krishna Kavi University of North Texas
Pranathi Vasireddy University of North Texas
Gayatri Mehta University of North Texas

Resumo

Neural Network compression techniques, such as parameter quantization and weight pruning have made deep neural network (DNN) inference more efficient for low-power devices such as MCUs and edge devices by reducing the memory and computation overhead required with minimal impact on model accuracy. To avoid storing and computing zeros, these techniques necessitate the use of sparse data representations, which introduces execution overhead to locate values required by a computation. Sparse matrix formats like Compressed Sparse Row (CSR) and other more recent designs are computationally inefficient when applied to the convolution algorithm as well as inefficient for storing quantized values. In this paper, we outline an intuitive extension of CSR called Partitioned Sparse Representation (PSR) in conjunction with a convolution algorithm that hides the cost of indexing overhead via a simple memory-side RISC-like core. PSR divides the entire weight array for a convolution layer into partitions that allow for smaller (e.g., 8-bit) indexes to reduce storage overhead. We also rely on a memory-side accelerator called HHT, a programmable, near-memory RISC-like co-processor that enables efficient processing of sparse data (including PSR). We show that HHT together with PSR allows the CPU to maximize the advantage of RISC-V packed instructions on sparse quantized data. We show as much as 10x speedup for sparse CONV with HHT over a baseline of the CPU performing all computations on dense data. HHT performs 2.7x faster on end-to-end image classification inference over the baseline and achieves 70% energy savings over sparse CONV with CPU performing all computations.

Palavras-chave: CNN, sparsity, compression, RISC-V, programmable, quantization