Exploiting Processor Heterogeneity to Improve Throughput and Reduce Latency for Deep Neural Network Inference

Olivier Beaumont; Jean-François David; Lionel Eyraud-Dubois; Samuel Thibault

Olivier Beaumont Inria Center of the University of Bordeaux
Jean-François David Inria Center of the University of Bordeaux
Lionel Eyraud-Dubois Inria Center of the University of Bordeaux
Samuel Thibault University of Bordeaux

Resumo

The growing popularity of Deep Neural Networks (DNNs) in a variety of domains, including computer vision, natural language processing, and predictive analytics, has led to an increase in the demand for computing resources. Graphics Processing Units (GPUs) are widely used for training and inference of DNNs. However, this exclusive use can quickly lead to saturation of GPU resources while CPU resources remain underutilized. This paper proposes a performance evaluation of a solution that exploits processor heterogeneity by combining the computational power of GPUs and CPUs. A solution is proposed for distributing the computational load across the different processors to optimize their utilization and achieve better performance. A solution for partitioning a DNN model with different computational resources is proposed. This solution transfers part of the load from the GPUs to the CPUs when necessary to reduce latency and increase throughput. The partitioning of DNN models is performed using METIS to balance the computational load to be distributed among the different resources while minimizing communication. The experimental results show that latency and throughput are improved for a number of DNN models. Potential applications include real-time processing systems such as autonomous vehicles, drones, and video surveillance systems where minimizing latency and maximizing throughput are critical.

Palavras-chave: Training, Computational modeling, Graphics processing units, Artificial neural networks, Throughput, Video surveillance, Real-time systems, Vehicle dynamics, Predictive analytics, Load modeling, dynamic scheduling, graph partitioning, heterogeneous computing, latency optimization, real-time systems