PampaAffinity: Otimização de Aplicações Paralelas via Ajuste Dinâmico e Transparente do Grau de Paralelismo e Mapeamento de Threads

Valmir T. Junior; Thiarles S. Medeiros; Janaína Schwarzrock; Samuel Xavier-de-Souza; Fábio D. Rossi; Marcelo C. Luizelli; Antonio Carlos S. Beck; Arthur F. Lorenzon

doi:10.5753/wscad.2021.18525

Valmir T. Junior UNIPAMPA
Thiarles S. Medeiros UNIPAMPA
Janaína Schwarzrock UFRGS
Samuel Xavier-de-Souza UFRN
Fábio D. Rossi IFFar
Marcelo C. Luizelli UNIPAMPA
Antonio Carlos S. Beck UFRGS
Arthur F. Lorenzon UNIPAMPA

DOI: https://doi.org/10.5753/wscad.2021.18525

Resumo

O desenvolvimento de aplicações que possam utilizar de maneira eficiente os recursos computacionais tem se tornado um desafio para os usuários devido às características do software e hardware que afetam a escalabilidade de muitas aplicações paralelas. Neste sentido, estratégias de ajuste dinâmico do número de threads e mapeamento de threads para núcleos de processamento têm sido empregadas para otimizar o uso destes recursos computacionais. No entanto, o espaço de exploração cresce significativamente com o número de núcleos da arquitetura, tornando a tarefa de encontrar uma configuração ideal de grau de paralelismo e mapeamento de threads desafiadora. Assim, nós propomos PampaAffinity, uma abordagem dinâmica, automática e transparente para o usuário, que realiza o ajuste do número de threads e políticas de mapeamento de threads para cada região paralela de aplicações OpenMP. Com a execução de treze aplicações em três arquiteturas multicore, mostramos que PampaAffinity converge para uma solução ideal com precisão média de 85% e otimiza o tradeoff entre desempenho e consumo de energia em 96.1% quando comparado à maneira padrão que aplicações paralelas são executadas.

Referências

Bailey, D. H., Barszcz, E., Barton, J. T., Browning, D. S., Carter, R. L., Dagum, L., Fatoohi, R. A., Frederickson, P. O., Lasinski, T. A., Schreiber, R. S., Simon, H. D., Venkatakrishnan, V., and Weeratunga, S. K. (1991). The nas parallel benchmarks & summary and preliminary results. In ACM/IEEE SC, pages 158–165, USA. ACM.

Broquedis, F., Aumage, O., Goglin, B., Thibault, S., Wacrenier, P.-A., and Namyst, R. (2010). Structuring the execution of openmp applications for multicore architectures. In IEEE International Parallel and Distributed Processing Symposium, pages 1–10. IEEE.

Chapman, B., Jost, G., and Pas, R. v. d. (2007). Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press.

Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J. W., Lee, S.-H., and Skadron, K. (2009). Rodinia: A benchmark suite for heterogeneous computing. In IEEE Int. Symp. on Workload Characterization, pages 44–54, DC, USA. IEEE Computer Society.

Cruz, E. H., Diener, M., and Navaux, P. O. (2012). Using the translation lookaside buffer to map threads in parallel applications based on shared memory. In IEEE International Parallel and Distributed Processing Symposium, pages 532–543. IEEE.

Cruz, E. H. M., Diener, M., Pilla, L. L., and Navaux, P. O. A. (2016). Hardware-assisted thread and data mapping in hierarchical multicore architectures. ACM Trans. Archit. Code Optim., 13(3).

da Silva, V. S., Nogueira, A. G., de Lima, E. C., de A. Rocha, H. M., Serpa, M. S., Luizelli, M. C., Rossi, F. D., Navaux, P. O., Beck, A. C. S., and Francisco Lorenzon, A. (2021). Smart resource allocation of concurrent execution of parallel applications. Concurrency and Computation: Practice and Experience, page e6600.

De Sensi, D., Torquati, M., and Danelutto, M. (2016). A reconfiguration algorithm for power-aware parallel applications. ACM Transactions on Architecture and Code Optimization, 13(4):1–25.

Diener, M., Cruz, E. H., and Navaux, P. O. (2013). Communication-based mapping using shared pages. In IEEE International Parallel and Distributed Processing Symposium, pages 700–711. IEEE.

Eichenberger, A. E., Terboven, C., Wong, M., and an Mey, D. (2012). The design of openmp thread affinity. In International Workshop on OpenMP, pages 15–28. Springer.

Hackenberg, D., Ilsche, T., Schone, R., Molka, D., Schmidt, M., and Nagel, W. E. (2013). Power measurement techniques on standard compute nodes: A quantitative comparison. In ISPASS, pages 194–204.

Hähnel, M., Döbel, B., Völp, M., and Härtig, H. (2012). Measuring energy consumption for short code paths using rapl. SIGMETRICS Perf. Evaluation Rev., 40(3):13–17.

Joao, J. A., Suleman, M. A., Mutlu, O., and Patt, Y. N. (2012). Bottleneck identification and scheduling in multithreaded applications. In ASPLOS, pages 223–234, NY, USA. ACM.

Lorenzon, A. F. and Beck Filho, A. C. S. (2019). Parallel computing hits the power wall: principles, challenges, and a survey of solutions. Springer Nature.

Lorenzon, A. F., De Oliveira, C. C., Souza, J. D., and Beck, A. C. S. (2018). Aurora: IEEE Transactions on Parallel and Seamless optimization of openmp applications. Distributed Systems, 30(5):1007–1021.

OpenMP Architecture Review Board (2018). OpenMP api specification: Version 5.0.

Papadimitriou, G., Chatzidimitriou, A., and Gizopoulos, D. (2019). Adaptive voltage/frequency scaling and core allocation for balanced energy and performance on In IEEE International Symposium on High Performance Computer multicore cpus. Architecture, pages 133–146. IEEE.

Schwarzrock, J., de Oliveira, C. C., Ritt, M., Lorenzon, A. F., and Beck, A. C. S. (2021). A runtime and non-intrusive approach to optimize edp by tuning threads and cpu frequency for openmp applications. IEEE Transactions on Parallel and Distributed Systems, 32(7):1713–1724.

Sridharan, S., Gupta, G., and Sohi, G. S. (2014). Adaptive, efficient, parallel execution of parallel programs. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 169–180.

Suleman, M. A., Qureshi, M. K., and Patt, Y. N. (2008). Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on cmps. ACM Sigplan Notices, 43(3):277–286.

Wang, W., Davidson, J. W., and Soffa, M. L. (2016). Predicting the memory bandwidth and optimal core allocations for multi-threaded applications on large-scale numa machines. In IEEE International Symposium on High Performance Computer Architecture, pages 419–431. IEEE.