Convolution Operators for Deep Learning Inference on the Fujitsu A64FX Processor

Manuel F. Dolz; Héctor Martínez; Pedro Alonso; Enrique S. Quintana-Ortí

Manuel F. Dolz Universitat Jaume I de Castellón
Héctor Martínez Universidad de Córdoba
Pedro Alonso Universitat Politécnica de Valéncia
Enrique S. Quintana-Ortí Universitat Politécnica de Valéncia

Resumo

The convolution operator is a crucial kernel for many computer vision and signal processing applications that rely on deep learning (DL) technologies. As such, the efficient implementation of this operator has received considerable attention in the past few years for a fair range of processor architectures. In this paper, we follow the technology trend toward integrating long SIMD (single instruction, multiple data) arithmetic units into high performance multicore processors to analyse the benefits of this type of hardware acceleration for latency-constrained DL workloads. For this purpose, we implement and optimise for the Fujitsu processor A64FX, three distinct methods for the calculation of the convolution, namely, the lowering approach, a blocked variant of the direct convolution algorithm, and the Winograd minimal filtering algorithm. Our experimental results include an extensive evaluation of the parallel scalability of these three methods and a comparison of their global performance using three popular DL models and a representative dataset.

Palavras-chave: Convolutional neural networks, high performance, SIMD arithmetic units, ARM-based A64FX processor