Simulated Annealing em Hardware com Múltiplas Threads em Pipeline para Posicionamento em CGRAs

Jeronimo Costa Penha; Lucas Bragança da Silva; Michael Canesche; José Augusto M. Nacif; Ricardo Ferreira

doi:10.5753/wscad.2022.226381

Jeronimo Costa Penha UFV / CEFET-MG
Lucas Bragança da Silva UFV
Michael Canesche UFMG
José Augusto M. Nacif UFV
Ricardo Ferreira UFV

DOI: https://doi.org/10.5753/wscad.2022.226381

Resumo

O uso de aceleradores com paralelismo espacial, como os CGRAs, são soluções promissoras em desempenho e eficiência energética. O desempenho dos CGRAs dependem dos compiladores para explorar o paralelismo das aplicações, sendo o mapeamento da aplicação um dos grandes desafios. A primeira etapa deste processo é o posicionamento, cuja eficiência impacta diretamente nos passos seguintes que são o roteamento e o escalonamento. Este trabalho apresenta uma implementação em hardware, usando field-programmable gate arrays (FPGA), para o algoritmo Simulated Annealing (SA). Os resultados mostram uma aceleração de 7 a 30 vezes em relação ao estado da arte sem sacrificar a qualidade da solução, podendo ser de 70 a 300 vezes mais rápido com o uso de múltiplas unidades de posicionamento. O algoritmo foi implementado em pipeline com múltiplas threads para esconder a latência, onde uma iteração completa do SA executa em apenas dois ciclos de relógio do FPGA.

Referências

Canesche, M., Menezes, M., Carvalho, W., Torres, F. S., Jamieson, P., Nacif, J. A., and Ferreira, R. (2020). Traversal: A fast and adaptive graph-based placement and routing for cgras. IEEE Trans on Computer-Aided Design of Integrated Circuits and Systems.

Carvalho, W., Canesche, M., Reis, L., Torres, F., Silva, L., Jamieson, P., Nacif, J., and Ferreira, R. (2020). A design exploration of scalable mesh-based fully pipelined accelerators. In International Conference on Field-Programmable Technology (ICFPT).

Da Silva, M. V., Ferreira, R., Garcia, A., and Cardoso, J. M. P. (2006). Mesh mapping exploration for coarse-grained reconfigurable array architectures. In IEEE Int Conf on Reconfigurable Computing and FPGA's (ReConFig 2006).

Dhar, S. and Pan, D. Z. (2018). GDP: GPU accelerated detailed placement. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pages 1-7. IEEE.

Donovick, C., Mann, M., Barrett, C., and Hanrahan, P. (2019). Agile SMT-based mapping for CGRAs with restricted routing networks. In Int. Conf. on ReConFigurable Computing and FPGAs (ReConFig). IEEE.

Ferreira, R., Duarte, V., Pereira, M., Carro, L., and Wong, S. (2013). A just-in-time modulo scheduling for virtual coarse-grained reconfigurable architectures. In Int Conf on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

Fobel, C., Grewal, G., and Stacey, D. (2014). A scalable, serially-equivalent, high-quality parallel placement methodology suitable for modern multicore and gpu architectures. In International Conference on Field Programmable Logic and Applications (FPL).

Formigoni, R. E., Ferreira, R. S., and Nacif, J. A. M. (2019). Ropper: A placement and routing framework for field-coupled nanotechnologies. In Symposium on Integrated Circuits and Systems Design (SBCCI).

Kong, H., Feng, L., Deng, C., Yuan, B., and Hu, J. (2020). How much does regularity help fpga placement? In Int Conf on Field-Programmable Technology (ICFPT).

Li, Z., Wu, D., Wijerathne, D., and Mitra, T. (2022). Lisa: Graph neural network based portable mapping on spatial accelerators. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 444-459. IEEE.

Liu, D., Yin, S., Luo, G., Shang, J., Liu, L., Wei, S., Feng, Y., and Zhou, S. (2018). Dataflow graph mapping optimization for CGRA with deep reinforcement learning. IEEE Trans on Computer-Aided Design of Integrated Circuits and Systems, 38(12).

Mulpuri, C. and Hauck, S. (2001). Runtime and quality tradeoffs in FPGA placement and routing. In Proceedings of the 2001 ACM/SIGDA ninth international symposium on Field programmable gate arrays, pages 29-36.

Murray, K. E., Petelin, O., Zhong, S., Wang, J. M., Eldafrawy, M., Legault, J.-P., Sha, E., Graham, A. G., Wu, J., Walker, M. J., et al. (2020). VTR 8: High-performance CAD and Customizable FPGA architecture modelling. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 13(2):1-55.

Nowatzki, T., Ardalani, N., Sankaralingam, K., and Weng, J. (2018). Hybrid optimization/heuristic instruction scheduling for programmable accelerator codesign. In Int Conf on Parallel Architectures and Compilation Techniques, pages 1-15.

Takamaeda, S. (2015). A high-level hardware design environment in python. IEICE Technical Report; IEICE Tech. Rep., 115(228):21-26.

Vieira, M., Canesche, M., Bragança, L., Campos, J., Silva, M., Ferreira, R., and Nacif, J. A. (2021). Reshape: A run-time dataflow hardware-based mapping for cgra overlays. In IEEE Int Symp on Circuits and Systems (ISCAS).

Walker, M. J. and Anderson, J. H. (2019). Generic connectivity-based CGRA mapping via integer linear programming. In Field-Programmable Custom Computing Machines (FCCM).

Wang, C. C. and Lemieux, G. G. (2011). Scalable and deterministic timing-driven parallel placement for FPGAs. In Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays, pages 153-162.

Weng, J., Liu, S., Kupsh, D., and Nowatzki, T. (2022). Unifying spatial accelerator compilation with idiomatic and modular transformations. IEEE Micro, (01):1-12.

Wrighton, M. G. and DeHon, A. M. (2003). Hardware-assisted simulated annealing with application for fast FPGA placement. In Proceedings of the 2003 ACM/SIGDA eleventh international symposium on Field programmable gate arrays, pages 33-42.

Xilinx, D. (2020). Ultrascale architecture and product data sheet: Overview. Product Specification, Nov.