Autotuning GPU Compiler Parameters Using OpenTuner

  • Pedro Bruel USP
  • Marcos Amarís USP
  • Alfredo Goldman USP

Resumo


In this paper we implement an autotuner for the compilation flags of GPU algorithms using the OpenTuner framework. An autotuner is a program that finds a combination of algorithms, or a configuration of an algorithm, that optimizes the solution of a given problem instance or set of instances. We analyse the performance gained after autotuning compilation flags for parallel algorithms in three GPU devices, and show that it is possible to improve upon the high-level optimizations of the CUDA compiler. One of the experimental settings achieved a 30% speedup.

Referências

P. H. Ha, P. Tsigas, and O. J. Anshus, "The Synchronization Power of Coalesced Memory Accesses." in DISC, ser. Lecture Notes in Computer Science, G. Taubenfeld, Ed., vol. 5218. Springer, 2008, pp. 320–334.

J. R. Rice, "The algorithm selection problem," 1976.

J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel, "Optimizing matrix multiply using phipac: A portable, high-performance, ansi c coding methodology," in ACM International Conference on Supercomputing 25th Anniversary Volume. New York, NY, USA: ACM, 2014, pp. 253–260.

R. C. Whaley and J. J. Dongarra, "Automatically tuned linear algebra software," in Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, ser. SC '98. Washington, DC, USA: IEEE Computer Society, 1998, pp. 1–27.

R. Vuduc, J. W. Demmel, and K. A. Yelick, "Oski: A library of automatically tuned sparse matrix kernels," in Journal of Physics: Conference Series, vol. 16, no. 1. IOP Publishing, 2005, p. 521.

H. Jordan, P. Thoman, J. J. Durillo, S. Pellegrini, P. Gschwandtner, T. Fahringer, and H. Moritsch, "A multi-objective auto-tuning framework for parallel codes," in High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for. IEEE, 2012, pp. 1–12.

J. Ansel, S. Kamil, K. Veeramachaneni, J. Ragan-Kelley, J. Bosboom, U.-M. O'Reilly, and S. Amarasinghe, "Opentuner: An extensible framework for program autotuning," in Proceedings of the 23rd international conference on Parallel architectures and compilation. ACM, 2014, pp. 303–316.

F. Hutter, H. H. Hoos, K. Leyton-Brown, and T. Stützle, "Paramils: an automatic algorithm conguration framework," Journal of Articial Intelligence Research, vol. 36, no. 1, pp. 267–306, 2009.

M. Frigo and S. G. Johnson, "Fftw: An adaptive software architecture for the fft," in Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, vol. 3. IEEE, 1998, pp. 1381–1384.

J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe, "Petabricks: A language and compiler for algorithmic choice," SIGPLAN Not., vol. 44, no. 6, pp. 38–49, Jun. 2009.

J. Bosboom, S. Rajadurai, W.-F. Wong, and S. Amarasinghe, "Streamjit: a commensal compiler for high-performance stream programming," in Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications. ACM, 2014, pp. 177–195.

D. Eliahu, O. Spillinger, A. Fox, and J. Demmel, "Frpa: A framework for recursive parallel algorithms," Master's thesis, EECS Department, University of California, Berkeley, May 2015.

Y. Zhang and J. Owens, "A quantitative performance analysis model for GPU architectures," in High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, Feb 2011, pp. 382–393.

S. Liu, C. Eisenbeis, and J.-L. Gaudiot, "Speculative execution on gpu: An exploratory study," in Parallel Processing (ICPP), 2010 39th International Conference on, Sept 2010, pp. 453–461.

J. Nickolls and W. J. Dally, "The gpu computing era," IEEE micro, vol. 30, no. 2, pp. 56–69, 2010.

D. Sampaio, R. M. d. Souza, S. Collange, and F. M. Q. a. Pereira, "Divergence analysis," ACM Transactions on Programming Languages and Systems, vol. 35, no. 4, pp. 13:1–13:36, Jan. 2014.

S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W. Hwu, "An adaptive performance modeling tool for gpu architectures," SIGPLAN Not., vol. 45, no. 5, pp. 105–114, Jan. 2010.

J. S. Kirtzic, "A Parallel Algorithm Design Model for the GPU Architecture," Ph.D. dissertation, Richardson, TX, USA, 2012, aAI3547670.

T. Chen and Y.-K. Chen, "Challenges and opportunities of obtaining performance from multi-core cpus and many-core gpus," in Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, April 2009, pp. 613–616.

P. Guo and L. Wang, "Auto-tuning cuda parameters for sparse matrix-vector multiplication on gpus," in Computational and Information Sciences (ICCIS), 2010 International Conference on, Dec 2010, pp. 1154–1157.

Y. Li, J. Dongarra, and S. Tomov, "A note on auto-tuning gemm for gpus," in Computational Science–ICCS 2009. Springer, 2009, pp. 884–892.

S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, "Auto-tuning a high-level language targeted to gpu codes," in Innovative Parallel Computing (InPar), 2012, May 2012, pp. 1–10.

J. L. Bates and R. L. Constable, "Proofs As Programs," ACM Transactions on Programming Languages and Systems, vol. 7, no. 1, pp. 113–136, Jan. 1985.

C. E. R. Alves, E. Cáceres, and S. W. Song, "BSP/CGM Algorithms for Maximum Subsequence and Maximum Subarray." in PVM/MPI, ser. Lecture Notes in Computer Science, D. Kranzlmüller, P. Kacsuk, and J. Dongarra, Eds., vol. 3241. Springer, 2004, pp. 139–146.

C. Silva, S. Song, and R. Camargo, "A parallel maximum subarray algorithm on gpus," in 5th Workshop on Applications for Multi-Core Architectures (WAMCA 2014). IEEE Int. Symp. on Computer Architecture and High Performance Computing Workshops, Paris, 2014, pp. 12–17.

H. H. Hoos, "Programming by optimization," Communications of the ACM, vol. 55, no. 2, pp. 70–80, 2012.
Publicado
18/10/2015
BRUEL, Pedro; AMARÍS, Marcos; GOLDMAN, Alfredo. Autotuning GPU Compiler Parameters Using OpenTuner. In: SIMPÓSIO EM SISTEMAS COMPUTACIONAIS DE ALTO DESEMPENHO (SSCAD), 16. , 2015, Florianópolis. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2015 . p. 13-23. DOI: https://doi.org/10.5753/wscad.2015.14268.