Mixed precision applied on common mathematical procedures over GPU
Resumo
Approximate Computing is a paradigm used by researchers as alternative to the diminishing of the evolution of hardware performance in the ongoing race for computational throughput in HPC. Precision reduction and mixed precision are the most studied among the existing techniques. In addition, some NVIDIA GPUs have Tensor Core architecture to speed up some classes of algorithms, such as matrix multiplication. This study aims to apply Approximate Computing techniques, like mixed precision, in matrix multiplication and stencil algorithms using OpenACC directives and cuTensor library to analyze performance gains versus accuracy losses. Results showed that it was possible to obtain a speedup of 16.60× with an optimized matrix multiplication algorithm present in the matmul intrinsic function using 16-bit floating-point data with Tensor Core, compared to a naive version using 64-bit floating-point. For this same case, accuracy loss went from 10−26 up to 10−1, approximately. For the stencil algorithm, it was possible to obtain a gain of 1.60× by only reducing variables precision from 64-bit to 16-bit floating-point, with accuracy loss from 0 to 10−9, for 300 iterations.
Referências
Appleyard, J. and Yokim, S. ((accessed July 07, 2022)). Programming tensor cores in cuda 9. https://developer.nvidia.com/blog/programming-tensor-cores-cuda-9/.
Fogerty, S., Bishnu, S., Zamora, Y., Monroe, L., Poole, S., Lam, M., Schoonover, J., and Robey, R. (2017). Thoughtful precision in mini-apps. In 2017 IEEE International Conference on Cluster Computing (CLUSTER), pages 858-865.
Higham, N. (2002). Accuracy and stability of numerical algorithms (2 ed). In SIAM, editor, Accuracy and Stability of Numerical Algorithms (2 ed), page 110-123. Society for Industrial and Applied Mathematics Philadelphia.
Koliogeorgi, K., Zervakis, G., Anagnostos, D., Zompakis, N., and Siozios, K. (2019). Optimizing svm classifier through approximate and high level synthesis techniques. In 2019 8th International Conference on Modern Circuits and Systems Technologies (MOCAST), pages 1-4.
Leback, B. (2019 (accessed June 26, 2022)b). Tensor core programming using cuda fortran. https://developer.nvidia.com/blog/tensor-core-programming-using-cuda-fortran/.
Leback, B. (2020 (accessed June 26, 2022)a). Bringing tensor cores to standard fortran. https://developer.nvidia.com/blog/bringing-tensor-cores-to-standard-fortran/.
Matoussi, O., Durand, Y., Sentieys, O., and Molnos, A. (2019). Error analysis of the square root operation for the purpose of precision tuning: A case study on k-means. In 2019 IEEE 30th International Conference on Application-specific Systems, Architectures and Processors (ASAP), volume 2160-052X, pages 75-82.
Mittal, S. (2015). A survey of techniques for approximate computing. DOI: 10.1145/2893356.
Parasyris, K., Laguna, I., Menon, H., Schordan, M., Osei-Kuffuor, D., Georgakoudis, G., Lam, M. O., and Vanderbruggen, T. (2020). Hpc-mixpbench: An hpc benchmark suite for mixed-precision analysis. DOI: 10.1109/IISWC50251.2020.00012.
Parravicini, A., Sgherzi, F., and Santambrogio, M. D. (2021). A reduced-precision streaming spmv architecture for personalized pagerank on fpga. In 2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC), pages 378-383.
Sloot, P. M. A., Tan, C. J. K., Dongarra, J. J., and Hoekstra, A. G., editors (2003). Computational science ICCS 2002. Lecture Notes in Computer Science. Springer Berlin, Berlin, Germany, 2002 edition.
Sudo, M. and Fazenda, (2020). A review on approximate computing applied to meteorological forecast models using software-based techniques.