Exploring Heterogeneous Task-Level Parallelism in a BMA Video Coding Application using System-Level Simulation
High abstraction level models can be used within the system-level simulation to allow rapid evaluations of architectural aspects in early Design Space Exploration (DSE) and direct the development decisions. Further, early DSE is of paramount importance in the specification of future Embedded Systems (ES) and its evaluation for applications with high computing demands and energy restrictions. This paper presents the exploration of Heterogeneous Task-Level Parallelism (HTLP) in a Block-Matching Algorithm (BMA) video coding application. HTLP means the creation and execution of simultaneous threads of kernels defined for different types of Processing Elements (PE) - e.g., CPU and GPU - but all for an equal purpose. We employ a BMA implementation as a case study, and its characteristics are used to explore the HTLP - in particular, its kernels for data preparation, SAD (sum of absolute differences) criteria calculation, and SAD values grouping. For the exploration, a system-level simulation framework (SAVE-htlp) is augmented, being able to support the HTLP. In the performed experiments, SAVE-htlp simulates workload and architecture models and explores 22 settings varying the PE type employed during the tasks' execution and the number of concurrent threads for each kernel. Execution time, performance, energy, and power results show HTLP settings overcoming CPU-only ones as well as those with solely GPUs to process its tasks.
F. Firouzi et al., “Internet-of-things and big data for smarter healthcare: From device to architecture, applications and analytics,” Future Generation Computer Systems, vol. 78, pp. 583 – 586, 2018.
M. Hendriks, T. Basten, J. Verriet, M. Brassé, and L. Somers, “A blueprint for system-level performance modeling of software-intensive embedded systems,” International Journal on Software Tools for Tech- nology Transfer, vol. 18, no. 1, pp. 21–40, 2016.
F. Herrera et al., “The COMPLEX methodology for UML/MARTE Modeling and design space exploration of embedded systems,” Journal of Systems Architecture, vol. 60, no. 1, pp. 55–78, 2014.
G. De Micheli and R. K. Gupta, “Hardware/software co-design,” Proceedings of the IEEE, vol. 85, no. 3, pp. 349–365, 1997.
“HSA Foundation,” http://www.hsafoundation.com/, 2018.
P. Rogers, “Heterogeneous system architecture overview,” in 2013 IEEE Hot Chips 25 Symp. (HCS), Aug 2013, pp. 1–41.
D. P. Scarpazza, P. Raghavan, D. Novo, F. Catthoor, and D. Verkest, “Software simultaneous multi-threading, a technique to exploit task- level parallelism to improve instruction- and data-level parallelism,” in Integrated Circuit and Syst. Design. Power and Timing Modeling, Optimiz. and Simulation. Springer, 2006, pp. 12–23.
M. Meloetal.,“A parallel motion estimation solution for heterogeneous system on chip,” in Integrated Circuits and Systems Design (SBCCI), 2016 29th Symposium on. IEEE, 2016, pp. 1–6.
M. Gries, “Methods for evaluating and covering the design space during early design development,” Integration, the VLSI journal, vol. 38, no. 2, pp. 131–183, 2004.
D.D. Gajski, S. Abdi, A. Gerstlauer, and G.Schirner, Embedded system design: modeling, synthesis and verification. Springer Science & Business Media, 2009.
G. Liebel, N. Marko, M. Tichy, A. Leitner, and J. Hansson, “Model-based engineering in the embedded systems domain: an industrial survey on the state-of-practice,” Software & Systems Modeling, pp. 1–23, 2016.
X. An, A. Gamatié, and E. Rutten, “High-level design space exploration for adaptive applications on multiprocessor systems-on-chip,” Journal of Systems Architecture, vol. 61, no. 3–4, pp. 172–184, 2015.
C. Ptolemaeus, System design, modeling, and simulation: using Ptolemy II. Ptolemy.org, Berkeley, 2014, vol. 1.
C. Erbas, A. D. Pimentel, M. Thompson, and S. Polstra, “A framework for system-level modeling and simulation of embedded systems architectures,” EURASIP Journal on Embedded Syst., no. 1, p. 82123, 2007.
K. Grüttner et al., “The complex reference framework for hw/sw co-design and power management supporting platform-based design-space exploration,” Microprocessors and Microsystems, vol. 37, no. 8, pp. 966– 980, 2013.
A. Miele, G. C. Durelli, M. D. Santambrogio, and C. Bolchini, “A System-Level Simulation Framework for Evaluating Resource Management Policies for Heterogeneous System Architectures,” in Digital System Design, 2015 Euromicro Conf. on. IEEE, 2015, pp. 637–644.
B. Nogueira, P. Maciel, E. Tavares, R. M. A. Silva, and E. Andrade, “Multi-objective optimization of multimedia embedded systems using genetic algorithms and stochastic simulation,” Soft Computing, pp. 1– 18, 2016.
G. Callou et al., “Energy consumption and execution time estimation of embedded system applications,” Microprocessors and Microsystems, vol. 35, no. 4, pp. 426–440, 2011.
H. D. Patel and S. K. Shukla, Ingredients for Successful System Level Design Methodology. Springer, 2008.
L. Cai and D. Gajski, “Transaction level modeling: An overview,” in Proceedings of the 1st IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, ser. CODES+ISSS ’03. New York, NY, USA: ACM, 2003, pp. 19–24.
H. Gomaa, Software modeling and design: UML, use cases, patterns, and software architectures. Cambridge University Press, 2011.
Hardkernel, “ODROID-XU3,” http://www.hardkernel.com/main/ products/prdt_info.php?g_code=g140448267127, 2018.
T. J. McCabe, “A complexity measure,” IEEE Transactions on software Engineering, no. 4, pp. 308–320, 1976.
H. Hoffmann, J. Eastep, M. D. Santambrogio, J. E. Miller, and A. Agarwal, “Application heartbeats: A generic interface for specifying program performance and goals in autonomous computing environments,” in Proc. of the 7th Int. Conf. on Autonomic Computing. New York, NY, USA: ACM, 2010, pp. 79–88.
G. J. Sullivan, J. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (hevc) standard,” IEEE Trans. on Circuits Syst. for Video Technol., vol. 22, no. 12, pp. 1649–1668, 2012.
K. Sharman and K. Suehring, “Common test conditions,” Joint Collaborative Team on Video Coding of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11. JCTVC-AC1100. Macao, China, Tech. Rep., 2017.
A. K. Singh, A. Prakash, K. R. Basireddy, G. V. Merrett, and B. M. Al-Hashimi, “Energy-efficient run-time mapping and thread partitioning of concurrent opencl applications on cpu-gpu mpsocs,” ACM Trans. Embed. Comput. Syst., vol. 16, no. 5s, pp. 147:1–147:22, Sep. 2017.