Performance Evaluation of Compiler Optimizations in FPGA Accelerators

  • Gustavo Leite Universidade Estadual Paulista
  • Alexandro Baldassin UNESP-IGCE
  • Guido Araujo State University of Campinas
  • José Nelson Amaral University of Alberta

Resumo


With the increasing power wall in microprocessor design, engineers shifted their attention to heterogeneous architectures, wherein several classes of devices are used for computation. Among them are FPGAs which offer comparable performance to CPUs while consuming only a fraction of energy. Despite the increasing interest in these devices, programmability and performance engineering in FPGAs remain hard. This work presents an evaluation of the most prominent code transformations targeting FPGAs. More specifically, it studies the performance effect of unrolling loops, replicating compute units and transferring data using DMA in a matrix multiplication OpenCL kernel through an Intel® FPGA. The results indicate that these optimizations can achieve speedups up to 3.78× for a matrix multiplication application, and 412.5× speedup in data transfer.

Referências

Aho, A. V., Lam, M. S., Sethi, R., and Ullman, J. D. (2006). Compilers: Principles, Techniques, and Tools. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2 edition.

Amazon Web Services (2019). Amazon EC2 F1 Instances. [Online]. Available: https://aws.amazon.com/ec2/instance-types/f1/. (Acessed Feb. 11, 2019).

Bacon, D., Rabbah, R., and Shukla, S. (2013). FPGA programming for the masses. Queue, 11(2):40:40–40:52.

Barr, J. (2017) EC2 F1 Instances with FPGAs – Now Generally Available. [Online]. Available: https://aws.amazon.com/blogs/aws/ec2-f1-instances-with-fpgas-now-generally-available/. (Acessed Feb. 11, 2019).

Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S., and Skadron, K. (2009) Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC), pages 44–54.

Dennard, R. H., Gaensslen, F. H., Rideout, V. L., Bassous, E., and LeBlanc, A. R. (1974) Design of ion-implanted mosfet’s with very small physical dimensions. IEEE Journal of Solid-State Circuits, 9(5):256–268.

Fowers, J., Ovtcharov, K., Papamichael, M., Massengill, T., Liu, M., Lo, D., Alkalay, S., Haselman, M., Adams, L., Ghandi, M., Heil, S., Patel, P., Sapek, A., Weisz, G., Woods, L., Lanka, S., Reinhardt, S. K., Caulfield, A. M., Chung, E. S., and Burger, D. (2018). A configurable cloud-scale dnn processor for real-time ai. In Proceedings of the 45th Annual International Symposium on Computer Architecture, ISCA ’18, pages 1–14, Piscataway, NJ, USA. IEEE Press.

Hennessy, J. and Patterson, D. (2019a). Computer architecture: A Quantitative Morgan Kaufmann Publishers, Cambridge, MA. Approach.

Hennessy, J. L. and Patterson, D. A. (2019b). A new golden age for computer architecture. Commun. ACM, 62(2):48–60.

Intel Corporation (2015). Intel(R) Stratix(R) V Device Overview.

Intel Corporation (2018). Intel(R) FPGA SDK for OpenCL(TM) Pro Edition: Best Practices Guide.

Khronos Group (2019). Open Computing Language (OpenCL). [Online]. Available: https://www.khronos.org/opencl/. (Acessed Feb. 15, 2019).

Lambert, J., Lee, S., Kim, J., Vetter, J. S., and Malony, A. D. (2018). Directive-based, high-level programming and optimizations for high-performance computing with FPGAs. In Proceedings of the 2018 International Conference on Supercomputing, ICS ’18, pages 160–171, New York, NY, USA. ACM.

Lee, S., Kim, J., and Vetter, J. S. (2016). OpenACC to FPGA: A framework for directivebased high-performance reconfigurable computing. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 544–554.

Lee, S. and Vetter, J. S. (2014). Openarc: Open accelerator research compiler for directive-based, efficient heterogeneous computing. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC ’14, pages 115–120, New York, NY, USA. ACM.

Lloyd, T., Chikin, A., Ochoa, E., Ali, K., and Amaral, J. N. (2017). A case for better integration of host and target compilation when using OpenCL for FPGAs. In FSP 2017

Moore, G. E. (1965). Cramming more components onto integrated circuits. Electronics, 38(8):56–59.

OpenACC (2019). OpenACC: Directives for Accelerators. [Online]. Available: https://www.openacc.org/. (Acessed Feb. 15, 2019).

Zohouri, H. R., Maruyama, N., Smith, A., Matsuda, M., and Matsuoka, S. (2016). Evaluating and optimizing OpenCL kernels for high performance computing with FPGAs. In SC ’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 409–420.
Publicado
08/11/2019
LEITE, Gustavo; BALDASSIN, Alexandro; ARAUJO, Guido; AMARAL, José Nelson. Performance Evaluation of Compiler Optimizations in FPGA Accelerators. In: SIMPÓSIO EM SISTEMAS COMPUTACIONAIS DE ALTO DESEMPENHO (SSCAD), 20. , 2019, Campo Grande. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2019 . p. 346-357. DOI: https://doi.org/10.5753/wscad.2019.8681.