Improving the Accuracy of High Performance BLAS Implementations Using Adaptive Blocked Algorithms

Matthew Badin; Paolo D'Alberto; Lubmir Bic; Michael Dillencourt; Alexandru Nicolau

Matthew Badin University of California Irvine
Paolo D'Alberto Yahoo, Inc.
Lubmir Bic University of California Irvine
Michael Dillencourt University of California Irvine
Alexandru Nicolau University of California Irvine

Resumo

Matrix multiply is ubiquitous in scientific computing. Considerable effort has been spent on improving its performance. Once methods that make efficient use of the processor have been exhausted, methods that use less operations than the canonical matrix multiply must be explored. Combining the two methods yields a hybrid matrix multiply algorithm. Hybrid matrix multiply algorithms tend to be less accurate than the canonical matrix multiply implementation, leaving room for improvement. There are well-known techniques for improving accuracy, but they tend to be slow and it is not immediately obvious how best to apply them to hybrid algorithms without lowering performance. Previous attempts have focused on the bottom of the hybrid matrix multiply algorithm, modifying the high-performance matrix multiply implementation. In contrast, the top-down approach presented here does not require the modification of the high-performance matrix multiply implementation at the bottom, nor does it require modification of the fast asymptotic matrix multiply algorithm at the top. The three-level hybrid algorithm presented here not only has up to 10% better performance than the fastest high-performance matrix multiply, but is also more accurate.

Palavras-chave: Kernel, Accuracy, Tiles, Computer architecture, Strips, USA Councils, Context, Recursive Matrix Multiply, Pairwise Summation, Hybrid Matrix Multiply