Improving Direct Convolution through Tensor Slicing, Vectorized Packing and ISA Extensions

  • Victor Ferrari UNICAMP
  • Guido Araujo UNICAMP

Resumo


Convolution is one of the most computationally intensive machine learning model operations, usually solved by the traditional Im2Col + BLAS method. This work describes SConv: a novel direct-convolution algorithm to improve upon Im2Col + BLAS by introducing compile-time and execution time components to tile, vectorize and optimize the computation. SConv’s speed-up over an Im2Col + BLAS method based on current BLAS implementations for end-to-end machine-learning model inference is in the range of 11% – 27% for Intel x86 and 11% – 34% for IBM POWER10 architectures. The total convolution speedup for model inference is 13% – 28% on Intel x86 and 23% – 39% on IBM POWER10. SConv also outperforms oneDNN in 6 out of 7 models.

Referências

Anderson, A., Vasudevan, A., Keane, C., and Gregg, D. (2020). High-performance low-memory lowering: Gemm-based algorithms for dnn convolution. In 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pages 99–106.

Barrachina, S., Castelló, A., Dolz, M. F., Low, T. M., Martínez, H., Quintana-Ortí, E. S., Sridhar, U., and Tomás, A. E. (2023). Reformulating the direct convolution for high-performance deep learning inference on arm processors. Journal of Systems Architecture, 135:102806.

Chellapilla, K., Puri, S., and Simard, P. (2006). High Performance Convolutional Neural Networks for Document Processing. In Lorette, G., editor, Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule (France). Université de Rennes 1, Suvisoft.

Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J. M., Tran, J., Catanzaro, B., and Shelhamer, E. (2014). cudnn: Efficient primitives for deep learning. ArXiv, abs/1410.0759.

Cho, M. and Brand, D. (2017). Mec: Memory-efficient convolution for deep neural network. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 815–824. JMLR.org.

Ferrari, V. (2023). Optimized Convolution for MLIR. [link].

Ferrari, V., Sousa, R., Pereira, M., de Carvalho, J. a. P. L., Amaral, J. N., and Araujo, G. (2023a). Improving convolution via cache hierarchy tiling and reduced packing. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, PACT ’22, page 538–539, New York, NY, USA. Association for Computing Machinery.

Ferrari, V., Sousa, R., Pereira, M., L. De Carvalho, J. a. P., Amaral, J. N., Moreira, J., and Araujo, G. (2023b). Advancing direct convolution using convolution slicing optimization and isa extensions. ACM Trans. Archit. Code Optim., 20(4).

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 580–587.

Goto, K. and Geijn, R. A. v. d. (2008). Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw., 34(3).

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778.

Intel (2022). oneAPI Specification. Intel Corporation.

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093.

Juan, P. S., Castello, A., Dolz, M. F., Alonso-Jorda, P., and Quintana-Orti, E. S. (2020). High performance and portable convolution operators for multicore processors. In 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pages 91–98, Los Alamitos, CA, USA. IEEE Computer Society.

Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., and Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1725–1732.

Korostelev, I., L. De Carvalho, J. a. P., Moreira, J., and Amaral, J. N. (2023). Yaconv: Convolution with low cache footprint. ACM Trans. Archit. Code Optim., 20(1).

Le, T. D., Bercea, G.-T., Chen, T., Eichenberger, A. E., Imai, H., Jin, T., Kawachiya, K., Negishi, Y., and O’Brien, K. (2020). Compiling onnx neural network models using mlir. ArXiv, abs/2008.08272.

Li, R., Xu, Y., Sukumaran-Rajam, A., Rountev, A., and Sadayappan, P. (2021). Analytical characterization and design space exploration for optimization of cnns. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 928–942.

Moreira, J. E., Barton, K., Battle, S., Bergner, P., Bertran, R., Bhat, P., Caldeira, P., Edelsohn, D., Fossum, G., Frey, B., Ivanovic, N., Kerchner, C., Lim, V., Kapoor, S., Filho, T. M., Mueller, S. M., Olsson, B., Sadasivam, S., Saleil, B., Schmidt, B., Srinivasaraghavan, R., Srivatsan, S., Thompto, B. W., Wagner, A., and Wu, N. (2021). A matrix math facility for power ISA(TM) processors. CoRR, abs/2104.03142.

Sousa, R., Byungmin, J., Kwak, J., Frank, M., and Araujo, G. (2021). Efficient tensor slicing for multicore npus using memory burst modeling. In 2021 33th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE.

Tollenaere, N., Iooss, G., Pouget, S., Brunie, H., Guillon, C., Cohen, A., Sadayappan, P., and Rastello, F. (2023). Autotuning convolutions is easier than you think. ACM Transactions on Architecture and Code Optimization, 20(2):1–24.

Vasilache, N., Zinenko, O., Bik, A. J. C., Ravishankar, M., Raoux, T., Belyaev, A., Springer, M., Gysi, T., Caballero, D., Herhut, S., Laurenzo, S., and Cohen, A. (2022). Composable and modular code generation in mlir: A structured and retargetable approach to tensor compiler construction.

Xianyi, Z., Kroeker, M., Saar, W., Qian, W., Chothia, Z., Shaohu, C., and Wen, L. (2011). Openblas: An optimized blas library.

Zhang, J., Franchetti, F., and Low, T. M. (2018). High performance zero-memory overhead direct convolutions. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5776–5785. PMLR.

Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., and Oliva, A. (2014). Learning deep features for scene recognition using places database. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc.
Publicado
21/07/2024
FERRARI, Victor; ARAUJO, Guido. Improving Direct Convolution through Tensor Slicing, Vectorized Packing and ISA Extensions. In: CONCURSO DE TESES E DISSERTAÇÕES (CTD), 37. , 2024, Brasília/DF. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 148-157. ISSN 2763-8820. DOI: https://doi.org/10.5753/ctd.2024.2901.