On the Efficiency of Register File versus Broadcast Interconnect for Collective Communications in Data-Parallel Hardware Accelerators
Resumo
Reducing power consumption and increasing efficiency is a key concern for many applications. How to design highly efficient computing elements while maintaining enough flexibility within a domain of applications is a fundamental question. In this paper, we present how broadcast buses can eliminate the use of power hungry multi-ported register files in the context of data-parallel hardware accelerators for linear algebra operations. We demonstrate an algorithm/architecture co-design for the mapping of different collective communication operations, which are crucial for achieving performance and efficiency in most linear algebra routines, such as GEMM, SYRK and matrix transposition. We compare a broadcast bus based architecture with conventional SIMD, 2D-SIMD and flat register file for these operations in terms of area and energy efficiency. Results show that fast broadcast data movement abilities in a prototypical linear algebra core can achieve up to 75× better power and up to 10× better area efficiency compared to traditional SIMD architectures.
Palavras-chave:
Registers, Arrays, Symmetric matrices, Vectors, Hardware, Register-file, Broadcast bus, Matrix Multiply, Power efficiency, High performance computing
Publicado
24/10/2012
Como Citar
PEDRAM, Ardavan; GERSTLAUER, Andreas; GEIJN, Robert A. van de.
On the Efficiency of Register File versus Broadcast Interconnect for Collective Communications in Data-Parallel Hardware Accelerators. In: INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD), 24. , 2012, Nova Iorque/EUA.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2012
.
p. 19-26.
