TangramFP: Energy-Efficient, Bit-Parallel, Multiply-Accumulate for Deep Neural Networks

  • Yuan Yao Uppsala University
  • Xiaoyue Chen Uppsala University
  • Hannah Atmer Uppsala University
  • Stefanos Kaxiras Uppsala University

Resumo


As energy consumption becomes a primary concern for deep learning acceleration, the need to optimize not only data movement but also compute is becoming important. The basic element of compute, the Multiply-Accumulate (MAC) unit, performs the operation X · Y+Z, comprises the compute cores of systolic arrays such as Google’s TPU or Nvidia’s Tensor Cores, and it is found in practically every deep neural network (DNN) accelerator.In this work, we aim to reduce the energy needs of bit-parallel MACs, without perceptible impact on precision, and without affecting the structure of the overall accelerator architecture—in other words, we aim for an energy-efficient drop-in MAC replacement.Although there is a significant body of work on efficient approximate multipliers and MACs, in this work, we propose a novel approach: a tunable floating-point MAC design, TANGRAMFP, that can deliver the full precision of a standard implementation, yet dynamically adjusts to eliminate ineffectual computation. Different from state-of-the-art approaches that are based on truncated multiplication, TANGRAMFP introduces a new class of multipliers where input operands are split and partial products are selectively generated (by enabling or disabling different areas of the logical multiplier array) and added together. In a hardware implementation, this is achieved by decomposing a large multiplier into four smaller ones, at the same overall hardware cost.We demonstrate that TANGRAMFP precision can adhere to the same bounds (measured as Unit-in-Last-Place—ULP—error) as standard IEEE FP16 arithmetic and delivers better precision than a state-of-the-art approach based on bit-serial truncated multiplication aimed at eliminating ineffectual computation in DNNs. At the same time, TANGRAMFP is a drop-in replacement for the standard MAC design, having approximately the same mean ULP error, the same area and latency, while achieving up to 36.57% dynamic power savings (27.44% with a mean error close to standard).
Palavras-chave: Energy consumption, Tensors, Measurement units, High performance computing, Artificial neural networks, Hardware, Energy efficiency, Systolic arrays, Standards, Logic arrays, Bit-Parallel, Energy-Efficient Multiply Accumulate, Deep Neural Networks
Publicado
13/11/2024
YAO, Yuan; CHEN, Xiaoyue; ATMER, Hannah; KAXIRAS, Stefanos. TangramFP: Energy-Efficient, Bit-Parallel, Multiply-Accumulate for Deep Neural Networks. In: INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD), 36. , 2024, Hilo/Hawaii. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 1-12.