On the influence of Data Migration in Dynamic Thread Management of Parallel Applications
Many parallel applications do not scale as the number of threads increases, which means that using the maximum number of threads will not always deliver the best outcome in performance or energy consumption. Therefore, many works have already proposed tuning strategies, which can be online or offline, to optimize performance or energy. Online tuning approaches (i.e. at execution time) are the most efficient, since they can catch intrinsic characteristics that can be only known at run-time (e.g., input set, current load balance, microarchitecture details). However, such dynamic nature requires fast design space exploration, since online tuning will always add extra overhead to the execution. In this online process, this work investigates how parallel regions may influence each other during execution and shows that data migration events may represent a considerable overhead, affecting the tuning progress and impacting the execution time and energy consumption. Hence, we demonstrate why many approaches will very likely fail when applied to simulated environments or will hardly reach a near-optimum solution when executed in real hardware.
A. F. Lorenzon M. C. Cera A. C. S. Beck "Investigating different general-purpose and embedded multicores to achieve optimal trade-offs between performance and energy" Journal of Parallel and Distributed Computing vol. 95 pp. 107-123 2016.
A. F. Lorenzon A. L. Sartor M. C. Cera A. C. S. Beck "Optimized use of parallel programming interfaces in multithreaded embedded architectures" 2015 IEEE Computer Society Annual Symposium on VLSI pp. 410-415 2015.
D. H. Bailey E. Barszcz J. T. Barton D. S. Browning R. L. Carter L. Dagum R. A. Fatoohi P. O. Frederickson T. A. Lasinski R. S. Schreiber H. D. Simon V. Venkatakrishnan S. K. Weeratunga "The NAS parallel benchmarks – summary and preliminary results" in ACM/IEEE CS NY USA:ACM pp. 158-165 1991.
J. A. Joao M. A. Suleman O. Mutlu Y. N. Patt "Bottleneck identification and scheduling in multithreaded applications" in ASPLOS NY USA:ACM pp. 223-234 2012.
S. E. Raasch S. K. Reinhardt "The impact of resource partitioning on SMT processors" PACT pp. 15-25 2003.
M. A. Suleman M. K. Qureshi Y. N. Patt "Feedback-driven threading: Power-efficient and high-performance execution of multi-threaded workloads on CMPs" SIGARCH Comput. Archit. News vol. 36 no. 1 pp. 277-286 2008.
L. Subramanian V. Seshadri Y. Kim B. Jaiyen O. Mutlu "MISE: Providing performance predictability and improving fairness in shared main memory systems" IEEE HPCA pp. 639-650 2013.
A. F. Lorenzon C. C. D. Oliveira J. D. Souza A. C. S. B. Filho "Aurora: Seamless optimization of OpenMP applications" IEEE Transactions on Parallel and Distributed Systems pp. 1-1 2018.
S. Sridharan G. Gupta G. S. Sohi "Adaptive efficient parallel execution of parallel programs" in ACM SIGPLAN PLDI NY USA:ACM pp. 169-180 2014.
A. C. S. Beck C. A. L. Lisbôa L. Carro Adaptable embedded systems Springer Science & Business Media 2012.
J. Lee H. Wu M. Ravichandran N. Clark "Thread tailor: Dynamically weaving threads together for efficient adaptive parallel applications" SIGARCH Comput. Archit. News vol. 38 no. 3 pp. 270-279 2010.
J. D. McCalpin "Memory bandwidth and machine balance in current high performance computers" IEEE Computer Society Technical Committee on Computer Architecture Newsletter pp. 19-25 1995.
M. Quinn Parallel Programming in C with MPI and OpenMP McGraw-Hill Higher Education 2004.
P. J. Mucci S. Browne C. Deane G. Ho "Papi: A portable interface to hardware performance counters" Proceedings of the department of defense HPCMP users group conference vol. 710 1999.
M. Hähnel B. Döbel M. Völp H. Härtig "Measuring energy consumption for short code paths using RAPL" SIGMETRICS Perform. Eval. Rev. vol. 40 no. 3 pp. 13-17 2012.
K. K. Pusukuri R. Gupta L. N. Bhuyan "Thread reinforcer: Dynamically determining number of threads via OS level monitoring" IEEE ISWC pp. 116-125 2011.
D. De Sensi "Predicting performance and power consumption of parallel applications" PDP pp. 200-207 2016.
M. Curtis-Maury F. Blagojevic C. D. Antonopoulos D. S. Nikolopoulos "Prediction-based power-performance adaptation of multithreaded scientific codes" IEEE Trans. Parallel Distrib. Syst. vol. 19 no. 10 pp. 1396-1410 2008.
C. Jung D. Lim J. Lee S. Han "Adaptive execution techniques for SMT multiprocessor architectures" ACM Symp. on Principles and Practice of Parallel Programming pp. 236-246 2005.
M. Curtis-Maury J. Dzierwa C. D. Antonopoulos D. S. Nikolopoulos "Online power-performance adaptation of multithreaded programs using hardware event-based prediction" Int. CS pp. 157-166 2006.
A. K. Porterfield S. L. Olivier S. Bhalachandra J. F. Prins "Power measurement and concurrency throttling for energy reduction in OpenMP programs" IEEE IPDPS pp. 884-891 2013.
F. Alessi P. Thoman G. Georgakoudis T. Fahringer D. S. Nikolopoulos Application-Level Energy Awareness for OpenMP Cham:Springer pp. 219-232 2015.
D. D. Sensi M. Torquati M. Danelutto "A reconfiguration algorithm for power-aware parallel applications" TACO vol. 13 no. 4 pp. 43:1-43:25 2016.
M. A. S. Bari N. Chaimov A. M. Malik K. A. Huck B. Chapman A. D. Malony O. Sarood "Arcs: Adaptive runtime configuration selection for power-constrained openmp applications" 2016 IEEE International Conference on Cluster Computing (CLUSTER) pp. 461-470 2016.
A. F. Lorenzon J. D. Souza A. C. S. Beck "LAANT: A library to automatically optimize EDP for OpenMP applications" pp. 1229-1232 March 2017.
J. Schwarzrock A. F. Lorenzon P. O. Navaux A. C. S. Beck E. P. de Freitas "Potential gains in edp by dynamically adapting the number of threads for openmp applications in embedded systems" 2017 VII Brazilian Symposium on Computing Systems Engineering (SBESC) pp. 79-85 2017.