Runtime Vectorization of Conditional Code and Dynamic Range Loops to ARM NEON Engine
SIMD engines are widely present in market processors aiming to improve performance of applications through Data Level Parallelism (DLP) exploitation. However, most SIMD engines rely on specific libraries and compilers to support DLP execution, which limits DLP gains since they are restricted to analyze static code. Dynamic SIMD Assembler (DSA)  is capable of exploiting DLP at runtime by identifying vectorizable loops to generate ARM NEON SIMD instructions. However, its DLP coverage capability is not fully exploited, since portion of code that depends on runtime information, such as dynamic range and conditional code loops are not exploited. In this work, we extend the DSA coverage by coupling the exploitation of conditional code and dynamic range loop vectorization. Results show that the proposed techniques improve the original DSA performance in 38% considering benchmarks with opportunities to exploit conditional code and dynamic range loops. In addition, the Extended DSA, besides keeping software productivity and binary compatibility, outperforms ARM compiler auto-vectorization by 12%.
Zhou, Hao, and Jingling Xue. "Exploiting mixed SIMD parallelism by reducing data reorganization overhead."Proceedings of the 2016 International Symposium on Code Generation and Optimization. ACM, 2016.
Baghsorkhi, Sara S., Nalini Vasudevan, and Youfeng Wu. "FlexVec: auto-vectorization for irregular loops." ACM SIGPLAN Notices. Vol. 51. No. 6. ACM, 2016.
Nuzman, Dorit, Ira Rosen, and Ayal Zaks. "Auto-vectorization of interleaved data for SIMD." ACM SIGPLAN Notices 41.6 (2006): 132- 143.
Tian, Xinmin, et al. "Compiling C/C++ SIMD extensions for function and loop vectorizaion on multicore-SIMD processors." Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International. IEEE, 2012.
Bramas, Berenger. "Inastemp: A Novel Intrinsics-as-Template Library for Portable SIMD-Vectorization." Scientific Programming 2017 (2017).
Reddy, Venu Gopal. "Neon technology introduction."ARM Corporation (2008).
Ommited to allow blind review.
Lomont, Chris. "Introduction to Intel advanced vector extensions." Intel White Paper (2011): 1-21.
Diefendorff, Keith, et al. "Altivec extension to PowerPC accelerates media processing." IEEE Micro 20.2 (2000): 85-95.
Binkert, Nathan, et al. "The gem5 simulator." ACM SIGARCH Computer Architecture News 39.2 (2011): 1-7.
Mitra, Gaurav, et al. "Use of SIMD vector operations to accelerate application code performance on low-powered ARM and Intel platforms." Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International. IEEE, 2013.
Guthaus, Matthew R., et al. "MiBench: A free, commercially representative embedded benchmark suite." Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop on. IEEE, 2001.
Bradski, Gary, and Adrian Kaehler. "OpenCV." Dr. Dobb’s journal of software tools 3 (2000).
Maleki,Saeed,etal."Anevaluationofvectorizingcompilers."Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on. IEEE, 2011.
Lattner, Chris, and Vikram Adve. "LLVM: A compilation framework for lifelong program analysis & transformation." Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization. IEEE Computer Society, 2004.