On the SPEC-CPU 2017 opportunities for dynamic vectorization possibilities on PIM architectures
Resumo
Processing-In-Memory (PIM) devices usually implement vector instructions to efficiently utilize the large main memory bandwidth. One possible way to vectorize applications for such PIM systems is to convert CPU instructions into PIM vector instructions dynamically. In this work, we present a study on the feasibility of the dynamic conversion between these instructions for the Vector-In-Memory Architecture (VIMA). Our results show that 24 % of the loops from some SPEC-CPU 2017 applications are suitable for this conversion. Furthermore, we conclude that dynamic conversion mechanisms must to be able to efficiently deal with memory access conflicts, a problem present in 99 % of all possible conversions to VIMA.
Referências
Alves, M. A. Z., Diener, M., Santos, P. C., and Carro, L. (2016). Large vector extensions inside the hmc. In 2016 Design, Automation Test in Europe Conference Exhibition (DATE), pages 1249-1254.
Alves, M. A. Z., Santos, S., Cordeiro, A. S., Moreira, F. B., Santos, P. C., and Carro, L. (2022). Vector in memory architecture for simple and high efficiency computing.
Boroumand, A., Ghose, S., Kim, Y., Ausavarungnirun, R., Shiu, E., Thakur, R., Kim, D., Kuusela, A., Knies, A., Ranganathan, P., and Mutlu, O. (2018). Google workloads for consumer devices: Mitigating data movement bottlenecks. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '18, page 316-331, New York, NY, USA. Association for Computing Machinery.
Calder, B. et al. (2005). Simpoint 3.0: Faster and more flexible program analysis.
Call Barreiro, A. (2014). Dynamic vectorization of instructions. Master's thesis, Universitat Politécnica de Catalunya, Barcelona, Spain.
Cordeiro, A. S., Kepe, T. R., Tomé, D. G., de Almeida, E. C., and Alves, M. A. Z. (2017). Intrinsics-hmc: An automatic trace generator for simulations of processing-in-memory instructions. XVIII Simpósio em Sistemas Computacionais de Alto Desempenho WSCAD.
Hadidi, R., Nai, L., Kim, H., and Kim, H. (2017). Cairo: A compiler-assisted technique for enabling instruction-level offloading of processing-in-memory. In ACM Transactions on Architecture and Code Optimization.
Hallou, N., Rohou, E., and Clauss, P. (2016). Runtime vectorization transformations of binary code. International Journal of Parallel Programming, 45(6):1536-1565.
Intel (2018). Pin a dynamic binary instrumentation tool. [link].
Jeddeloh, J. and Keeth, B. (2012). Hybrid memory cube new dram architecture increases density and performance. In 2012 Symposium on VLSI Technology (VLSIT), pages 87-88.
Jun, H., Nam, S., Jin, H., Lee, J., Park, Y. J., and Lee, J. J. (2017). High-bandwidth memory (hbm) test challenges and solutions. IEEE Design Test, 34(1):16-25.
K. Chang, K. (2017). Understanding and Improving the Latency of DRAM-Based Memory Systems. PhD thesis, Carnegie Mellon University, Pittsburgh USA.
Kalathingal, S., Collange, S., Swamy, B. N., and Seznec, A. (2016). Dynamic inter-thread vectorization architecture: Extracting dlp from tlp. In 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pages 18-25.
Lechner, M. (2020). Evolution of single-threaded x86 cpu performance. https://mlech26l.github.io/pages/2020/12/17/cpus.html.
Motoyoshi, M. (2009). Through-silicon via (tsv). Proceedings of the IEEE, 97(1):43-48.
Nai, L., Hadidi, R., Sim, J., Kim, H., Kumar, P., and Kim, H. (2017). Graphpim: Enabling instruction-level pim offloading in graph computing frameworks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 457-468.
Nakamura, T., Miki, S., and Oikawa, S. (2011). Automatic vectorization by runtime binary translation. In 2011 Second International Conference on Networking and Computing, pages 87-94.
Pajuelo, A., González, A., and Valero, M. (2002). Speculative dynamic vectorization. SIGARCH Comput. Archit. News, 30(2):271-280.
Patil, H., Cohn, R., et al. (2004). Pinpointing representative portions of large intel® itanium® programs with dynamic instrumentation. In Int. Symp. on Microarchitecture.
Preshing, J. and Poley, H. (2012). A look back at single-threaded cpu performance. [link].
Santos, P. C., Moreira, F. B., Cordeiro, A. S., Santos, S. R., Kepe, T. R., Carro, L., and Alves, M. A. Z. (2021). Survey on near-data processing: Applications and architectures. Journal of Integrated Circuits and Systems, 16(2):1-17.
Singh, G., Chelini, L., Corda, S., Awan, A. J., Stuijk, S., Jordans, R., Corporaal, H., and Boonstra, A.-J. (2019). Near-memory computing: Past, present, and future. Microprocessors and Microsystems, 71:102868.
Stephens, N., Biles, S., Boettcher, M., Eapen, J., Eyole, M., Gabrielli, G., Horsnell, M., Magklis, G., Martinez, A., Premillieu, N., Reid, A., Rico, A., and Walker, P. (2017). The arm scalable vector extension. IEEE Micro, 37(2):26-39.
Stone, H. S. (1970). A logic-in-memory computer. IEEE Transactions on Computers, C-19(1):73-78.
Tomé, D. G., Santos, P. C., Carro, L., Almeida, E. C., and Alves, M. A. Z. (2018). Hipe: Hmc instruction predication extension applied on database processing. In 2018 Design, Automation Test in Europe Conference Exhibition (DATE), pages 261-264.
Yardimci, E. and Franz, M. (2008). Dynamic parallelization and vectorization of binary executables on hierarchical platforms. J. Instr. Level Parallelism, 10.