Hardware-efficient convolution algorithms for CNN accelerators: A brief review

  • Ricardo Di Curzio Lera Universidade de São Paulo
  • Bruno de Carvalho Albertini Universidade de São Paulo


The Convolutional Neural Network (CNN) is a technology of vast importance in image processing and computer vision applications. The bottleneck of CNNs is the multidimensional convolution, which often demands accelerator hardware. The convolution algorithms these accelerators use directly affect the ratio between speed increase and hardware resource consumption during scaling, a metric known as hardware efficiency. The lower this metric, the more power and area are spent on minor performance improvements. In this review, we analyze the potential for hardware efficiency in the current proven algorithms used in convolutional layers: im2col convolution used by most modern applications, Toom-Cook convolution, and FFT convolution. Our analysis reveals the inefficiency of im2col convolution regarding hardware scaling and confirms the potential for hardware-efficient applications using Toom-Cook and FFT convolutions, each with its caveats. Further, we identify possible hardware applications for these algorithms, which may be expanded upon in future works.

Palavras-chave: convolutional neural network, CNN, hardware efficiency, TPU, im2col, Toom-Cook, Winograd, FFT, GEMM


Alam, S. A., Anderson, A., Barabasz, B., and Gregg, D. (2022). Winograd convolution for deep neural networks: Efficient point selection. ACM Transactions on Embedded Computing Systems, 21(6):1–28.

Alzubaidi, L., Zhang, J., Humaidi, A. J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., Santamaría, J., Fadhel, M. A., Al-Amidie, M., and Farhan, L. (2021). Review of deep learning: Concepts, cnn architectures, challenges, applications, future directions. Journal of big Data, 8:1–74.

Barabasz, B., Anderson, A., Soodhalter, K. M., and Gregg, D. (2020). Error analysis and improving the accuracy of winograd convolution for deep neural networks. ACM Transactions on Mathematical Software (TOMS), 46(4):1–33.

Barabasz, B. and Gregg, D. (2019). Winograd convolution for dnns: Beyond linear polynomials. In AI* IA 2019–Advances in Artificial Intelligence: XVIIIth International Conference of the Italian Association for Artificial Intelligence, Rende, Italy, November 19–22, 2019, Proceedings, pages 307–320. Springer.

Chitsaz, K., Hajabdollahi, M., Karimi, N., Samavi, S., and Shirani, S. (2020). Acceleration of convolutional neural network using fft-based split convolutions. arXiv preprint arXiv:2003.12621.

Choquette, J., Gandhi, W., Giroux, O., Stam, N., and Krashinsky, R. (2021). Nvidia a100 tensor core gpu: Performance and innovation. IEEE Micro, 41(2):29–35.

Duan, R., Wu, H., and Zhou, R. (2022). Faster matrix multiplication via asymmetric hashing. arXiv preprint arXiv:2210.10173.

Gao, J., Ji, W., Chang, F., Han, S., Wei, B., Liu, Z., and Wang, Y. (2020). A systematic survey of general sparse matrix-matrix multiplication. ACM Computing Surveys.

Han, Y. and Hong, B.-W. (2021). Deep learning based on fourier convolutional neural network incorporating random kernels. Electronics, 10(16):2004.

Highlander, T. and Rodriguez, A. (2016). Very efficient training of convolutional neural networks using fast fourier transform and overlap-and-add. arXiv preprint arXiv:1601.06815.

Jha, N. K. (2007). Ieee transactions on very large scale integration (vlsi) systems. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 15(3):249.

Jouppi, N. P., Yoon, D. H., Kurian, G., Li, S., Patil, N., Laudon, J., Young, C., and Patterson, D. (2020). A domain-specific supercomputer for training deep neural networks. Communications of the ACM, 63(7):67–78.

Kågström, B., Ling, P., and Van Loan, C. (1998). Gemm-based level 3 blas: high-performance model implementations and performance evaluation benchmark. ACM Transactions on Mathematical Software (TOMS), 24(3):268–302.

Kala, S., Jose, B. R., Mathew, J., and Nalesh, S. (2019). High-performance cnn accelerator on fpga using unified winograd-gemm architecture. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 27(12):2816–2828.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems 25 (NIPS’2012), pages 1106—-1114.

Lavin, A. and Gray, S. (2016). Fast algorithms for convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4013–4021.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. In Proceedings of the IEEE, volume 86, pages 2278–2324.

Lee, Y.-C., Chi, T.-S., and Yang, C.-H. (2020). A 2.17-mw acoustic dsp processor with cnn-fft accelerators for intelligent hearing assistive devices. IEEE Journal of Solid-State Circuits, 55(8):2247–2258.

Li, X., Huang, H., Chen, T., Gao, H., Hu, X., and Xiong, X. (2022). A hardware-efficient computing engine for fpga-based deep convolutional neural network accelerator. Microelectronics Journal, 128:105547.

Meher, P. K., Valls, J., Juang, T.-B., Sridharan, K., and Maharatna, K. (2009). 50 years of cordic: Algorithms, architectures, and applications. IEEE Transactions on Circuits and Systems I: Regular Papers, 56(9):1893–1907.

Moolchandani, D., Kumar, A., and Sarangi, S. R. (2021). Accelerating cnn inference on asics: A survey. Journal of Systems Architecture, 113:101887.

Pratt, H., Williams, B., Coenen, F., and Zheng, Y. (2017). Fcnn: Fourier convolutional neural networks. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2017, Skopje, Macedonia, September 18–22, 2017, Proceedings, Part I 17, pages 786–798. Springer.

Shen, J., Qiao, Y., Huang, Y., Wen, M., and Zhang, C. (2018). Towards a multi-array architecture for accelerating large-scale matrix multiplication on fpgas. In 2018 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1–5. IEEE.

Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

Strassen, V. et al. (1969). Gaussian elimination is not optimal. Numerische mathematik, 13(4):354–356.

Vasudevan, A., Anderson, A., and Gregg, D. (2017). Parallel multi channel convolution using general matrix multiplication. In 2017 IEEE 28th international conference on application-specific systems, architectures and processors (ASAP), pages 19–24. IEEE.

Zhou, Y., Yang, M., Guo, C., Leng, J., Liang, Y., Chen, Q., Guo, M., and Zhu, Y. (2021). Characterizing and demystifying the implicit convolution algorithm on commercial matrix-multiplication accelerators. In 2021 IEEE International Symposium on Workload Characterization (IISWC), pages 214–225. IEEE.
LERA, Ricardo Di Curzio; ALBERTINI, Bruno de Carvalho. Hardware-efficient convolution algorithms for CNN accelerators: A brief review. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 20. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 86-96. ISSN 2763-9061. DOI: https://doi.org/10.5753/eniac.2023.233607.