Minicursos da XXI Escola Regional de Alto Desempenho da Região Sul

Autores

Andrea Charão (ed.)
UFSM
Matheus Serpa (ed.)
UFRGS

Sinopse

Este livro apresenta versões textuais, na forma de capítulos de livro, de seis minicursos aceitos e apresentados na XXI Escola Regional de Alto Desempenho da Região Sul (ERAD/RS). Os minicursos tratam de aspectos técnicos relacionados à programação paralela em diferentes arquiteturas e ambientes, assim como ferramentas para otimização de aplicações e realização de experimentos reprodutíveis.

No primeiro capítulo deste livro, “Desvendando o Uso de Contadores de Hardware para Otimizar Aplicações de Inteligência Artificial”, os autores abordam a utilização de contadores das arquiteturas Intel Xeon Cascade Lake e NEC SX-Aurora TSUBASA para analisar o desempenho de aplicações de IA, cada vez mais frequentes atualmente. No segundo capítulo, “Otimização de Programas Paralelos com uso do OpenACC”, os autores apresentam técnicas que podem ser utilizadas para aumentar o desempenho de programas paralelos que façam uso de diretivas do OpenACC, que é um modelo de programação aplicável em diversos tipos de arquiteturas paralelas. No terceiro capítulo, “Are you root? Experimentos Reprodutı́veis em Espaço de Usuário”, os autores tratam de técnicas para criar ambientes visando à reprodutibilidade de experimentos, utilizando o gerenciador de pacotes Spack e criando contêineres com Docker e Singularity. No quarto capítulo, “Além de Simplesmente: #pragma omp parallel for”, os autores abordam alguns recursos de OpenMP mais recentes e menos difundidos, indo além do paralelismo de laços que é habitualmente visto em cursos introdutórios. No quinto capítulo, “Ambiente de Nuvem Computacional Privada para Teste e Desenvolvimento de Programas Paralelos”, os autores introduzem noções básicas para implantar uma nuvem privada e demonstrar os benefícios para o desenvolvimento e teste de programas paralelos em nuvem. No sexto capítulo, “Desenvolvimento de Aplicações Baseadas em Tarefas com OpenMP Tasks”, os autores apresentam o paradigma de programação paralela orientado a tarefas, com exemplos de construção de programas com tarefas em OpenMP.

 

Capítulos:

1. Desvendando o Uso de Contadores de Hardware para Otimizar Aplicações de Inteligência Artificial
Valéria Girelli, Félix Michels, Francis Moreira, Philippe Navaux
2. Otimização de Programas Paralelos com uso do OpenACC
Evaldo Costa, Gabriel Silva
3. Are you root? Experimentos Reprodutíveis em Espaço de Usuário
Jessica Dagostini, Vinícius Pinto, Lucas Leandro Nesi, Lucas Schnorr
4. Além de Simplesmente: #pragma omp parallel for
João Vicente Lima, Claudio Schepke, Natiele Lucca
5. Ambiente de Nuvem Computacional Privada para Teste e Desenvolvimento de Programas Paralelos
Anderson Maliszewski, Adriano Vogel, Dalvan Griebler, Claudio Schepke, Philippe Navaux
6. Desenvolvimento de Aplicações Baseadas em Tarefas com OpenMP Tasks
Lucas Leandro Nesi, Marcelo Miletto, Vinícius Pinto, Lucas Schnorr

Downloads

Não há dados estatísticos.

Referências

A, i. et al. Survey on articial intelligence. International Journal of Computer Sciences and Engineering, v. 7, p. 1778–1790, 05 2019. páginas

ADHIANTO, L. et al. Hpctoolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, v. 22, 01 2009. páginas

Advanced Micro Devices, Inc. AMD EPYC 7H12. 2021. páginas 3

AGGARWAL, C. C. et al. Neural networks and deep learning. Springer, Springer, v. 10, p. 978–3, 2018. páginas

Agullo, E. et al. (2010). Faster, cheaper, better – a hybridization methodology to develop linear algebra software for GPUs. In mei W. Hwu, W., editor, GPU Computing Gems, volume 2. Morgan Kaufmann. páginas

ALLES, G. R.; CARISSIMI, A.; SCHNORR, L. M. Assessing the computation and communication overhead of linux containers for hpc applications. In: IEEE. 2018 Symposium on High Performance Computing Systems (WSCAD). [S.l.], 2018. p. 116–123. páginas 8

Augonnet, C., Thibault, S., Namyst, R., and Wacrenier, P.A. (2011). StarPU: a unied platform for task scheduling on heterogeneous multicore architectures. Conc. and Comp.: Pract. and Exp., 23(2). páginas

BAER, J.-L.; CHEN, T.-F. An effective on-chip preloading scheme to reduce data access penalty. In: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing. New York, NY, USA: Association for Computing Machinery, 1991. (Supercomputing ’91), p. 176–186. ISBN 0897914597. Disponível em: https://doi.org/10.1145/125826.125932. páginas

BAILEY, D. H. et al. The NAS Parallel Benchmarks; Summary and Preliminary Results. In: ACM/IEEE Conference on Supercomputing (SC). [S.l.: s.n.], 1991.

BAKHSHALIPOUR, M. et al. Bingo spatial data prefetcher. In: LOURI, A.; VENKATARAMANI, G.; GRATZ, P. (Ed.). 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). [S.l.], 2019. p. 399–411. páginas

BHATIA, E. et al. Perceptron-based prefetch ltering. In: Proceedings of the 46th International Symposium on Computer Architecture. New York, NY, USA: Association for Computing Machinery, 2019. (ISCA ’19), p. 1–13. ISBN 9781450366694. Disponível em: https://doi.org/10.1145/3307650.3322207. páginas

BHOWMIK, S. Cloud Computing. [S.l.]: Cambridge University Press, 2017. ISBN 9781316638101.

BLAKE, G.; DRESLINSKI, R. G.; MUDGE, T. A survey of multicore processors. IEEE Signal Processing Magazine, IEEE, v. 26, n. 6, p. 26–37, 2009. páginas

BOKADE, S.; DAKHOLE, P. Cla based 32-bit signed pipelined multiplier. In: IEEE. 2016 international conference on communication and signal processing (ICCSP). [S.l.], 2016. p. 0849–0852. páginas

Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., and Dongarra, J. (2012). DAGuE: A generic distributed {DAG} engine for high performance computing. Parallel Computing, 38(1–2):37 – 51. Extensions for Next-Generation Parallel Programming Models. páginas

Buttari, A., Langou, J., Kurzak, J., and Dongarra, J. (2009). A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing, 35(1):38–53. páginas

BUYYA, R.; VECCHIOLA, C.; SELVI, S. T. Mastering Cloud Computing: Foundations and Applications Programming. [S.l.]: Newnes, 2013.

CASELLA, G.; BERGER, R. L. Statistical inference. [S.l.]: Cengage Learning, 2021. páginas

CHANDRASEKARAN, K. Essentials of Cloud Computing. [S.l.]: Taylor & Francis, 2014. ISBN 9781482205435.

CHANDRASEKARAN, S.; JUCKELAND, G. OpenACC for Programmers: Concepts and Strategies. 1st. ed. [S.l.]: Addison-Wesley Professional, 2017. ISBN 0134694287. páginas 11

CHAPMAN, B.; MEHROTRA, P.; ZIMA, H. Enhancing openmp with features for locality control. In: CITESEER. Proc. ECWMF Workshop” Towards Teracomputing-The Use of Parallel Processors in Meteorology. Austrian: PSU, 1998. páginas 4

CHE, S. et al. Rodinia: A benchmark suite for heterogeneous computing. In: IEEE. 2009 IEEE international symposium on workload characterization (IISWC). [S.l.], 2009. p. 44–54. páginas

Chen, S. (2017). Introduction to OpenACC. Research Computing Services Information Services and Technology Boston University.

CHEN, S.; DOOLEN, G. D. Lattice Boltzmann Method for Fluid Flows. Annual Review of Fluid Mechanics, v. 30, p. 329–364, 1998. páginas 13

CHEN, T.-F.; BAER, J.-L. Effective hardware-based data prefetching for highperformance processors. IEEE transactions on computers, IEEE, v. 44, n. 5, p. 609–623, 1995. páginas

CHOWDHURY, O. K. C. D. Mastering OpenStack. [S.l.]: Packt Publishing Ltd, 2017.

CONSTANTIN, P.; FOIAS, C. Navier-Stokes Equations. [S.l.]: University of Chicago Press, 1988. páginas 11

CORPORATION, I. Intel Advanced Vector Extensions 512 (Intel AVX-512). 2021. páginas 3

CORPORATION, I. Processador Intel Xeon Platinum 8380H. 2021. páginas 3

COURTÈS, L.; WURMUS, R. Reproducible and user-controlled software environments in hpc with guix. In: HUNOLD, S. et al. (Ed.). Euro-Par 2015: Parallel Processing Workshops. Cham: Springer International Publishing, 2015. p. 579–591. ISBN 978-3-319-27308-2. páginas 2

CUTRESS, I. The AMD Zen and Ryzen 7 Review: A Deep Dive on 1800X, 1700X and 1700. 2017. Disponível em: https://www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700. páginas

DOLSTRA, E.; JONGE, M. de; VISSER, E. Nix: A safe and policy-free system for software deployment. In: Proceedings of the 18th USENIX Conference on System Administration. USA: USENIX Association, 2004. (LISA ’04), p. 79–92. páginas 2

Duran, A., Ayguadé, E., Badia, R. M., Labarta, J., Martinell, L., Martorell, X., and Planas, J. (2011). OmpSs: a proposal for programming heterogeneous multi-core architectures. Par. Proc. Letters, 21(02). páginas

Flannery, B. P., Press, W. H., Teukolsky, S. A., and Vetterling, W. (1992). Numerical recipes in c. Press Syndicate of the University of Cambridge, New York, 24(78):36. páginas

FLOYD, T. L. Digital Fundamentals, 10/e. [S.l.]: Pearson Education India, 2010. páginas

FOG, A. The microarchitecture of intel, amd and via cpus: An optimization guide for assembly programmers and compiler makers. Copenhagen University College of Engineering, p. 02–29, 2012. páginas

FOSTER, I. Designing and Building Parallel Programs: Concepts and tools for Parallel Software Engineering. Reading, MA: Addison Wesley, 1995. páginas 6

GADEPALLY, V. et al. Ai enabling technologies: A survey. arXiv preprint arXiv:1905.03592, 2019. páginas

GAMBLIN, T. et al. The spack package manager: Bringing order to hpc software chaos. In: IEEE. High Performance Computing, Networking, Storage and Analysis, 2015 SC-International Conference for. [S.l.], 2015. p. 1–12. páginas 2, 3, 9, 16

GAUTIER, T.; BESSERON, X.; PIGEON, L. KAAPI: A thread scheduling runtime system for data ow computations on cluster of multi-processors. In: 2007 international workshop on Parallel symbolic computation. Waterloo, Canada: ACM, 2007. p. 15–23. Disponível em: https://hal.inria.fr/hal-00684843. páginas 7

GERNDT, M.; FÜRLINGER, K.; KEREKU, E. Periscope: Advanced techniques for performance analysis. In: PARCO. [S.l.: s.n.], 2005. páginas

GIRELLI, V. S. et al. Investigating memory prefetcher performance over parallel applications: From real to simulated. Concurrency and Computation: Practice and Experience, n/a, n. n/a, p. e6207. Disponível em: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.6207. páginas

GOMEZ, A. N. et al. The reversible residual network: Backpropagation without storing activations. arXiv preprint arXiv:1707.04585, 2017. páginas

Gonçalves, R.; Girardi, A.; Schepke, C. Performance and energy consumption analysis of coprocessors using different programming models. In: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP). [S.l.: s.n.], 2018. p. 508–512. páginas 3

GRAHAM, S. L.; KESSLER, P. B.; MCKUSICK, M. K. Gprof: A Call Graph Execution Proler. SIGPLAN Not., ACM, New York, NY, USA, v. 17, n. 6, p. 120–126, jun. 1982. ISSN 0362-1340. Disponível em: http://doi.acm.org/10.1145/872726.806987. páginas 14

GRIEBLER, D. et al. Performance of Data Mining, Media, and Financial Applications under Private Cloud Conditions. In: IEEE Symposium on Computers and Communications (ISCC). Natal, Brazil: IEEE, 2018.

GROPP, W. D.; LUSK, E.; SKJELLUM, A. Using MPI: portable parallel programming with the message-passing interface. [S.l.]: MIT press, 2014.

Guttman, D. et al. Performance and energy evaluation of data prefetching on intel xeon phi. In: 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). [S.l.: s.n.], 2015. p. 288–297. páginas

Harris, M. (2017). Unified Memory for CUDA Beginners. NVIDIA Corporation.

HENNESSY, J. et al. Mips: A microprocessor architecture. ACM SIGMICRO Newsletter, ACM New York, NY, USA, v. 13, n. 4, p. 17–22, 1982. páginas

HENNESSY, J. L.; PATTERSON, D. A. Computer Architecture, Sixth Edition: A Quantitative Approach. 6th. ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2017. ISBN 0128119055. páginas

HOWELL, M. Homebrew, the missing package manager for OS X. 2017. Disponível em: http://brew.sh. páginas 2

INTEL. Intel® 64 and IA-32 Architectures Optimization Reference Manual. 2019. https://software.intel.com/sites/default/les/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf. [Accesed in: 16 Jan. 2020]. páginas

ISEN, C.; JOHN, L. K.; JOHN, E. A tale of two processors: Revisiting the risc-cisc debate. In: SPRINGER. Spec benchmark workshop. [S.l.], 2009. p. 57–76. páginas

JACOB, B.; WANG, D.; NG, S. Memory systems: cache, DRAM, disk. [S.l.]: Morgan Kaufmann, 2010. páginas

KANG, H.; WONG, J. L. To hardware prefetch or not to prefetch? a virtualized environment study and core binding approach. In: Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems. New York, NY, USA: Association for Computing Machinery, 2013. (ASPLOS ’13), p. 357–368. ISBN 9781450318709. Disponível em: https://doi.org/10.1145/2451116.2451155. páginas

KNÜPFER, A. et al. The vampir performance analysis tool-set. In: Parallel Tools Workshop. [S.l.: s.n.], 2008. páginas

KOMATSU, K. et al. Performance evaluation of a vector supercomputer sx-aurora tsubasa. In: IEEE. SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. [S.l.], 2018. p. 685–696. páginas

KSHEMKALYANI, A. Vector Processors. 2012. https://www.cs.uic.edu/~ajayk/c566/VectorProcessors.pdf. Accessed: 2021–02-21. páginas

KURTZER, G. M.; SOCHAT, V.; BAUER, M. W. Singularity: Scientic containers for mobility of compute. PloS one, Public Library of Science San Francisco, CA USA, v. 12, n. 5, p. e0177459, 2017. páginas 8, 9

LE, H. Q. et al. Ibm power6 microarchitecture. IBM Journal of Research and Development, IBM, v. 51, n. 6, p. 639–662, 2007. páginas

Liao, S. et al. Machine learning-based prefetch optimization for data center applications. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. [S.l.: s.n.], 2009. p. 1–10. ISSN 2167-4337. páginas

LXC. Linux Containers (LXC). 2019. Último acesso em dezembro de 2020. Disponível em: http://linuxcontainers.org/.

MALISZEWSKI, A. M. et al. Minimizing Communication Overheads in Containerbased Clouds for HPC Applications. In: IEEE. IEEE Symposium on Computers and Communications (ISCC). Barcelona, Spain, 2019.

MARR, D. T. et al. Hyper-threading technology architecture and microarchitecture. Intel Technology Journal, v. 6, n. 1, 2002. páginas

McKinley, K. S., Carr, S., and Tseng, C.-W. (1996). Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems (TOPLAS), 18(4):424–453. páginas

MELL, P.; GRANCE, T. et al. The NIST Denition of Cloud Computing. National Institute of Standards and Technology (NIST), Gaithersburg, United States, 2011.

MERKEL, D. Docker: Lightweight linux containers for consistent development and deployment. Linux J., Belltown Media, Houston, TX, v. 2014, n. 239, mar. 2014. ISSN 1075-3583. Disponível em: http://dl.acm.org/citation.cfm?id=2600239.2600241. páginas 9

MEY, D. et al. Score-p: A unied performance measurement system for petascale applications. In: . [S.l.: s.n.], 2012. p. 85–97. ISBN 9783642240249. páginas

MILOJI CI íC, D.; LLORENTE, I.; MONTERO, R. Opennebula: A cloud management tool. IEEE Internet Computing, IEEE, 2011.

MORGAN, T. P. Drilling Xeon Skylake Architecture. 2017. Disponível em: https://www.nextplatform.com/2017/08/04/drilling-xeon-skylake-architecture/. páginas

Murphy, J. (2016). More Tips on OpenACC Acceleration. Microway Corporation.

NEC Corporation. NEC SX-Aurora TSUBASA Vector Engine. 2021. páginas 3

NEC. How to Use C/C++ Compiler for Vector Engine. 2020. https://www.hpc.nec/api/v1/forum/le/download?id=pgNh9b. Acessado em: 08/2020. páginas

NEC. SX-Aurora TSUBASA A100-1 series user’s guide. 2020. https://www.hpc.nec/documents/guide/pdfs/A100-1_series_users_guide.pdf. Acessado em: 08/2020. páginas

NUSSBAUM, L. et al. Linux-based virtualization for hpc clusters. In: Montreal Linux Symposium. [S.l.: s.n.], 2009. páginas 2

NVIDIA (2014). NVIDIA’s Next Generation CUDA Compute Architecture: Kepler TM GK110/210. NVIDIA Corporation.

NVIDIA. GPU NVIDIA A100. 2021. páginas 4

OLIVEIRA, D. P. de. Fluid Flow Through Porous Media With The One Domain Approach: A Simple Model For Grains Drying. 49 p. Dissertação (Mestrado) — Universidade Federal do Pampa, Alegrete, 2020. páginas 11

OpenMP (2020). OpenMP application program interface version 5.1. Disponível em: https://www.openmp.org/wpcontent/uploads/OpenMP-API-Specification-5-1.pdf. páginas

OPENMP. The OpenMP API specication for parallel programming. 2021. Disponível em: https://www.openmp.org. Disponível em: https://www.openmp.org/. páginas 4, 5, 6

OPENNEBULA. OpenNebula 5.12 Documentation. 2021. Available on: https: //docs.opennebula.io/5.12/index.html. Access date: 20 March.

PATTERSON, D. A.; HENNESSY, J. L. Computer Organization and Design (4nd Ed.): The Hardware/Software Interface. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2004. ISBN 1558604286. páginas

Peled, L. et al. Semantic locality and context-based prefetching using reinforcement learning. In: 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). [S.l.: s.n.], 2015. p. 285–297. páginas

PEREZ, A. F. et al. Lower Numerical Precision Deep Learning Inference and Training. 2018. https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html. Accessed: 2021–02-28. páginas

PGI (2019). PROFILER USER’S GUIDE. NVIDIA Corporation.

Rico, A., Sánchez Barrera, I., Joao, J. A., Randall, J., Casas, M., and Moretó, M. (2019). On the benets of tasking with openmp. In Fan, X., de Supinski, B. R., Sinnen, O., and Giacaman, N., editors, OpenMP: Conquering the Full Hardware Spectrum, pages 217–230, Cham. Springer International Publishing. páginas

ROLOFF, E. et al. High Performance Computing in the Cloud: Deployment, Performance and Cost Efciency. In: International Conference on Cloud Computing Technology and Science Proceedings (CloudCom). [S.l.: s.n.], 2012.

ROVEDA, D. et al. Analisando a Camada de Gerenciamento das Ferramentas CloudStack e OpenStack para Nuvens Privadas. In: Escola Regional de Redes de Computadores (ERRC). Passo Fundo, Brazil: [s.n.], 2015.

SCHNORR, L. M. PajeNG – Paje Next Generation. 2021. Disponível em: https://github.com/schnorr/pajeng. páginas 3

SHENDE, S. S.; MALONY, A. D. The tau parallel performance system. Int. J. High Perform. Comput. Appl., Sage Publications, Inc., USA, v. 20, n. 2, p. 287–311, maio 2006. ISSN 1094-3420. Disponível em: https://doi.org/10.1177/1094342006064482. páginas

SNIR, M. et al. MPI-the Complete Reference: the MPI core. [S.l.]: MIT press, 1998.

SUCCI, S. The Lattice Boltzmann Equation for Fluid Dynamics and Beyond. New York, USA: Oxford University Press, 2001. ISBN 0-19-850398-9. páginas 11

THORNTON, J. E. The cdc 6600 project. Annals of the History of Computing, IEEE, v. 2, n. 4, p. 338–348, 1980. páginas

TORELLI, J. C.; BRUNO, O. M. Programação paralela em smps com openmp e posix threads: um estudo comparativo. In: Anais do IV Congresso Brasileiro de Computaçao (CBComp). São Carlos, SP: Instituto de Ciências Matemáticas e de Computação Universidade de São Paulo, 2004. v. 1, p. 486–491. páginas 4

TORREZ, A.; RANDLES, T.; PRIEDHORSKY, R. Hpc container runtimes have minimal or no performance impact. In: 2019 IEEE/ACM International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC). [S.l.: s.n.], 2019. p. 37–42. páginas 8

TULLSEN, D. M.; EGGERS, S. J.; LEVY, H. M. Simultaneous multithreading: Maximizing on-chip parallelism. In: Proceedings of the 22nd annual international symposium on Computer architecture. [S.l.: s.n.], 1995. p. 392–403. páginas

VACCA, J. R. Cloud Computing Security: Foundations and Challenges. [S.l.]: CRC Press, 2016.

VOGEL, A. et al. An Intra-Cloud Networking Performance Evaluation on CloudStack Environment. In: Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP). St. Petersburg, Russia: IEEE, 2017.

VOGEL, A. et al. Private IaaS Clouds: A Comparative Analysis of OpenNebula, CloudStack and OpenStack. In: Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP). Heraklion Crete, Greece: IEEE, 2016.

VOGEL, A.; GRIEBLER, D.; FERNANDES, L. G. Providing High-level Self-adaptive Abstractions for Stream Parallelism on Multicores. Softw Pract Exp, 2021.

WULF, W.; MCKEE, S. Hitting the memory wall: Implications of the obvious. Computer Architecture News, v. 23, 01 1996. páginas

YEH, T.-Y.; PATT, Y. N. Two-level adaptive training branch prediction. In: Proceedings of the 24th annual international symposium on Microarchitecture. [S.l.: s.n.], 1991. p. 51–61. páginas

Zangeneh, S. et al. Branchnet: A convolutional neural network to predict hard-topredict branches. In: 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). [S.l.: s.n.], 2020. p. 118–130. páginas

Zhang, L. et al. A dynamic branch predictor based on parallel structure of srnn. IEEE Access, v. 8, p. 86230–86237, 2020. páginas

Capa para Minicursos da XXI Escola Regional de Alto Desempenho da Região Sul
Data de publicação
14/04/2021

Detalhes sobre o formato disponível para publicação: Volume Completo

Volume Completo
ISBN-13 (15)
978-65-87003-50-4