Design Exploration of Machine Learning Data-Flows onto Heterogeneous Reconﬁgurable Hardware

Westerley Oliveira; Michael Canesche; Lucas Reis; José Nacif; Ricardo Ferreira

doi:10.5753/wscad.2020.14063

Westerley Oliveira UFV
Michael Canesche UFV
Lucas Reis UFV
José Nacif UFV
Ricardo Ferreira UFV

DOI: https://doi.org/10.5753/wscad.2020.14063

Resumo

Machine/Deep learning applications are currently the center of the attention of both industry and academia, turning these applications acceleration a very relevant research topic. Acceleration comes in different ﬂavors, including parallelizing routines on a GPU, FPGA, or CGRA. In this work, we explore the placement and routing of Machine Learning applications dataﬂow graphs onto three heterogeneous CGRA architectures. We compare our results with the homogeneous case and with one of the state-of-the-art tools for placement and routing (P&R). Our algorithm executed, on average, 52% faster than Versatile Place&Routing (VPR) 8.1. Furthermore, a heterogeneous architecture reduces the cost without losing performance in 76% of the cases.

Referências

Browning, S. A. (1980). The tree machine: A highly concurrent computing environment.

Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., and Temam, O. (2014). Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM SIGARCH Computer Architecture News, 42(1):269–284.

Chin, S. A., Niu, K. P., Walker, M., Yin, S., Mertens, A., Lee, J., and Anderson, J. H. (2018). Architecture exploration of standard-cell and fpga-overlay cgras using the open-source cgra-me framework. In International Symposium on Physical Design.

Fontes, G., Silva, P., Nacif, J., Vilela, O., and Ferreira, R. (2018). Placement and routing by overlapping and merging qca gates. In Int Symp on Circuits and Systems (ISCAS).

Jo, J., Cha, S., Rho, D., and Park, I.-C. (2017). Dsip: A scalable inference accelerator for convolutional neural networks. IEEE Journal of Solid-State Circuits, 53(2):605–618.

Jo, J., Kim, S., and Park, I.-C. (2018). Energy-efcient convolution architecture based on rescheduled dataow. IEEE Transactions on Circuits and Systems I, 65(12).

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classication with deep convolutional neural networks. In Advances in neural information processing systems.

Lee, S.-K. and Choi, H.-A. (1996). Embedding of complete binary trees into meshes with row-column routing. IEEE Trans on Parallel and Distributed Systems, 7(5).

Liu, Dph mapping optimization for cgra with deep reinforcement learning. Trans on Computer-Aided Design of Integrated Circuits and Systems, 38(12).

Liu, D., Yin, S., Luo, G., Shang, J., Liu, L., Wei, S., Feng, Y., and Zhou, S. (2018). Dataflow graph mapping optimization for cgra with deep reinforcement learning. IEEE Trans on Computer-Aided Design of Integrated Circuits and Systems, 38(12).

Liu, L., Zhu, J., Li, Z., Lu, Y., Deng, Y., Han, J., Yin, S., and Wei, S. (2019). A survey of coarse-grained recongurable architecture and design: Taxonomy, challenges, and applications. ACM Computing Surveys (CSUR), 52(6):1–39.

Liu, Z.-G., Whatmough, P. N., and Mattina, M. (2020). Systolic tensor array: An efcient structured-sparse gemm accelerator for mobile cnn inference. IEEE Computer Architecture Letters, 19(1):34–37.

Luo, Z. and Lee, R. B. (2000). Cost-effective multiplication with enhanced adders for multimedia applications. In Int Symp on Circuits and Systems (ISCAS). IEEE.

Mei, B., Vernalde, S., Verkest, D., De Man, H., and Lauwereins, R. (2003). Adres: An architecture with tightly coupled vliw processor and coarse-grained recongurable matrix. In International Conference on Field Programmable Logic and Applications.

Moreano, N., Borin, E., De Souza, C., and Araujo, G. (2005). Efcient datapath merging for partially recongurable architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 24(7):969–980.

Murray, K. E., Petelin, O., Zhong, S., Wang, J. M., ElDafrawy, M., Legault, J.-P., Sha, E., Graham, A. G., Wu, J., Walker, M. J. P., Zeng, H., Patros, P., Luu, J., Kent, K. B., and Betz, V. (2020). Vtr 8: High performance cad and customizable fpga architecture modelling. ACM Trans. Recongurable Technol. Syst.

Nowatzki, T., Ardalani, N., Sankaralingam, K., and Weng, J. (2018). Hybrid optimization/heuristic instruction scheduling for programmable accelerator codesign. In Int Conf on Parallel Architectures and Compilation Techniques (PACT).

Park, H., Fan, K., Mahlke, S. A., Oh, T., Kim, H., and Kim, H.-s. (2008). Edge-centric In Int Conf on modulo scheduling for coarse-grained recongurable architectures. Parallel architectures and compilation techniques (PACT).

Qin, E., Samajdar, A., Kwon, H., Nadella, V., Srinivasan, S., Das, D., Kaul, B., and Krishna, T. (2020). Sigma: A sparse and irregular gemm accelerator with exible interconnects for dnn training. In Int Symp on High Performance Computer Architecture (HPCA).

Silva, M., Ferreira, R., Garcia, A., and Cardoso, J. (2006). Mesh mapping exploration for coarse-grained recongurable array architectures. In Int Conf on Recongurable Computing and FPGA's (ReConFig).

Wei, X., Yu, C. H., Zhang, P., Chen, Y., Wang, Y., Hu, H., Liang, Y., and Cong, J. (2017). Automated systolic array architecture synthesis for high throughput cnn inference on fpgas. In Design Automation Conference (DAC).

Weng, J., Liu, S., Dadu, V., Wang, Z., Shah, P., and Nowatzki, T. (2020). Dsagen: Synthesizing programmable spatial accelerators. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 268–281. IEEE.