Automatic Tuning TLP and DVFS for EDP with a Non-Intrusive Genetic Algorithm Framework
Resumo
New applications have been pushing multithreaded processing to another level of performance and energy requirements. However, many aspects prevent linear improvements when exploiting Thread-level parallelism (TLP), which means that not always using the maximum number of available cores running at the highest possible operating frequency will deliver the best performance or energy consumption. Therefore, it is possible to improve these non-functional requirements by tuning the number of threads and the Dynamic Voltage and Frequency Scaling (DVFS) of the processor. However, applications with distinct behaviors comprise many parallel regions, which will be executed on systems with a different number of cores that run within a large range of operating frequencies. Given this exponential behavior, such problem cannot be efficiently solved by any exhaustive search method. In this scenario, this work proposes to use a Genetic Algorithm to statically find the best configuration for any OpenMP parallel application, aiming to optimize performance and energy. Our framework is totally non-intrusive, which means that the design space exploration can be performed without any changes to the source or binary codes, so even already compiled code can be optimized. Considering eight benchmarks, we improve EDP by 20.4% on average.
Referências
A. F. Lorenzon, M. C. Cera, and A. C. S. Beck, “Investigating different general-purpose and embedded multicores to achieve optimal trade-offs between performance and energy,” Journal of Parallel and Distributed Computing, vol. 95, pp. 107 – 123, 2016, special Issue on Energy Efficient Multi-Core and Many-Core Systems, Part I. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0743731516300090
A. F. Lorenzon, A. L. Sartor, M. C. Cera, and A. C. S. Beck, “Optimized use of parallel programming interfaces in multithreaded embedded architectures,” in 2015 IEEE Computer Society Annual Symposium on VLSI, July 2015, pp. 410–415.
A. F. Lorenzon, C. C. D. Oliveira, J. D. Souza, and A. C. S. B. Filho, “Aurora: Seamless optimization of openmp applications,” IEEE Transactions on Parallel and Distributed Systems, pp. 1–1, 2018.
A. F. Lorenzon, J. D. Souza, and A. C. S. Beck, “Laant: A library to automatically optimize edp for openmp applications,” in Design, Automation Test in Europe Conference Exhibition (DATE), 2017, March 2017, pp. 1229–1232.
R. R. Miftakhutdinov, “Performance prediction for dynamic voltage and frequency scaling,” Ph.D. dissertation, The University of Texas, 2014.
F. D. Rossi, M. Storch, I. de Oliveira, and C. A. F. D. Rose, “Modeling power consumption for dvfs policies,” in 2015 IEEE International Symposium on Circuits and Systems (ISCAS), May 2015, pp. 1879–1882.
Y. L. Chen, M. F. Chang, W. Y. Liang, and C. H. Lee, “Performance and energy efficient dynamic voltage and frequency scaling scheme for multicore embedded system,” in 2016 IEEE International Conference on Consumer Electronics (ICCE), Jan 2016, pp. 58–59.
S. Akram, J. B. Sartor, and L. Eeckhout, “Dvfs performance prediction for managed multithreaded applications,” in 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2016, pp. 12–23.
C. Jung, D. Lim, J. Lee, and S. Han, “Adaptive execution techniques for smt multiprocessor architectures,” in Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM, 2005, pp. 236–246.
J. Lee, H. Wu, M. Ravichandran, and N. Clark, “Thread tailor: Dynamically weaving threads together for efficient, adaptive parallel applications,” SIGARCH Comput. Archit. News, vol. 38, no. 3, pp. 270–279, Jun. 2010. [Online]. Available: http://doi.acm.org/10.1145/1816038.1815996
M. A. Suleman, M. K. Qureshi, and Y. N. Patt, “Feedback-driven threading: Power-efficient and high-performance execution of multithreaded workloads on cmps,” SIGARCH Comput. Archit. News,
vol. 36, no. 1, pp. 277–286, Mar. 2008. [Online]. Available: http://doi.acm.org/10.1145/1353534.1346317
K. K. Pusukuri, R. Gupta, and L. N. Bhuyan, “Thread reinforcer: Dynamically determining number of threads via os level monitoring,” in IEEE Int. Symp. on Workload Characterization. DC, USA: IEEE
Computer Society, 2011, pp. 116–125.
T. Ju, W. Wu, H. Chen, Z. Zhu, and X. Dong, “Thread count prediction model: Dynamically adjusting threads for heterogeneous many-core systems,” in 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS), Dec 2015, pp. 456–464.
J. Li and J. F. Martinez, “Dynamic power-performance adaptation of parallel computation on chip multiprocessors,” in High-Performance Computer Architecture, 2006. The Twelfth International Symposium on. IEEE, 2006, pp. 77–87.
D. Li, B. R. de Supinski, M. Schulz, K. Cameron, and D. S. Nikolopoulos, “Hybrid mpi/openmp power-aware computing,” in 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), April 2010, pp. 1–12.
D. De Sensi, “Predicting performance and power consumption of parallel applications,” in Proceedings of 24th Euromicro International Conference on Parallel, Distributed, and Network-
Based Processing (PDP), Feb 2016, pp. 200 – 207. [Online]. Available: http://ieeexplore.ieee.org/document/7445331/
D. E. E. GOLDBERG, “Genetic algorithms in search, optimization & machine learning.”
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, Third Edition, 3rd ed. The MIT Press, 2009.
S. Beamer, K. Asanovi´c, and D. Patterson, “The gap benchmark suite,” arXiv preprint arXiv:1508.03619, 2015.
W. Petersen and P. Arbenz, Introduction to Parallel Computing : A practical guide with examples in C, ser. Oxford Texts in Applied and Engineering Mathematics. OUP Oxford, 2004.
J. D. McCalpin, “Memory bandwidth and machine balance in current high performance computers,” IEEE Computer Society Technical Committee on Computer Architecture Newsletter, pp. 19–25, 1995.
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga, “The nas parallel benchmarks—summary and preliminary results,” in ACM/IEEE Conf. on Supercomputing. NY, USA: ACM, 1991, pp. 158–165.
S. Seo, G. Jo, and J. Lee, “Performance characterization of the nas parallel benchmarks in opencl,” in IEEE Int. Symp. on Workload Characterization, 2011, pp. 137–148.
M. Hähnel, B. D¨obel, M. Völp, and H. Härtig, “Measuring energy consumption for short code paths using rapl,” SIGMETRICS Perform. Eval. Rev., vol. 40, no. 3, pp. 13–17, 2012.
J. Lee, H. Wu, M. Ravichandran, and N. Clark, “Thread tailor: Dynamically weaving threads together for efficient, adaptive parallel applications,” SIGARCH Comput. Archit. News, vol. 38, no. 3, pp. 270–279, 2010.