Detecção de Anomalias de Desempenho em Aplicações de Alto Desempenho baseadas em Tarefas em Clusters Híbridos

  • Vinicius Garcia Pinto UFRGS / Univ. Grenoble Alpes / Inria / CNRS / Grenoble INP / LIG
  • Lucas Mello Schnorr UFRGS
  • Arnaud Legrand Univ. Grenoble Alpes / Inria / CNRS / Grenoble INP / LIG
  • Samuel Thibault Inria Bordeaux Sud-Ouest
  • Luka Stanisic Max Planck Computing and Data Facility
  • Vincent Danjean Univ. Grenoble Alpes / Inria / CNRS / Grenoble INP / LIG

Resumo


Os paradigmas de programação em Computação de Alto Desempenho estão mudando para modelos baseados em tarefas que são capazes de se adaptar a supercomputadores com arquiteturas heterogêneas e escaláveis. A detecção de anomalias de desempenho em tal cenário é particularmente difícil uma vez que ela deve considerar a heterogeneidade da arquitetura, a variabilidade e a capacidade de obter medições confiáveis. Este trabalho apresenta um estudo de caso sobre a detecção de anomalias na execução da conhecida fatoração de Cholesky por blocos desenvolvida com StarPU. Os experimentos foram conduzidos em uma variedade de plataformas com múltiplos nós híbridos para demonstrar a capacidade de detectar e destacar anomalias de desempenho.

Referências

Agullo, E., Aumage, O., Faverge, M., Furmento, N., Pruvost, F., Sergent, M., and Thibault, S. P. (2017). Achieving high performance on supercomputers with a sequential task-based programming model. IEEE Transactions on Parallel and Distributed Systems, Early Access:1–14.

Agullo, E., Bosilca, G., Bramas, B., Castagnede, C., Coulaud, O., Darve, E., Dongarra, J., Faverge, M., Furmento, N., Giraud, L., Lacoste, X., Langou, J., Ltaief, H., Messner, M., Namyst, R., Ramet, P., Takahashi, T., Thibault, S., Tomov, S., and Yamazaki, I. (2012). Poster: Matrices over runtime systems at exascale. In Wasserman, H., editor, High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, pages 1332–1332.

Augonnet, C., Aumage, O., Furmento, N., Namyst, R., and Thibault, S. (2012). StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators. In Träff, J. L., Benkner, S., and Dongarra, J. J., editors, Recent Advances in the Message Passing Interface: 19th European MPI Users’ Group Meeting, EuroMPI 2012, Vienna, Austria, September 23-26, 2012. Proceedings, pages 298–299. Springer Berlin Heidelberg, Berlin, Heidelberg.

Augonnet, C., Thibault, S., Namyst, R., and Wacrenier, P.-A. (2011). StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2):187–198.

Beaumont, O., Eyraud-Dubois, L., and Gao, Y. (2018). Influence of Tasks Duration Variability on Task-Based Runtime Schedulers. Research report, INRIA.

Blackford, L. S., Choi, J., Cleary, A., D’Azeuedo, E., Demmel, J., Dhillon, I., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., and Whaley, R. C. (1997). ScaLAPACK user’s guide. Society for Industrial and Applied Mathematics.

Blumofe, R. D. and Leiserson, C. E. (1999). Scheduling multithreaded computations by work stealing. J. ACM, 46(5):720–748.

Dosimont, D., Corre, Y., Schnorr, L. M., Huard, G., and Vincent, J.-M. (2015). Ocelotl: Large Trace Overviews Based on Multidimensional Data Aggregation. In Niethammer, C., Gracia, J., Knüpfer, A., Resch, M. M., and Nagel, W. E., editors, Tools for High Performance Computing 2014, pages 137–160. Springer International Publishing.

DURAN, A., AYGUADÉ, E., BADIA, R. M., LABARTA, J., MARTINELL, L., MARTORELL, X., and PLANAS, J. (2011). Ompss: A proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02):173–193.

Gautier, T., Lima, J. V., Maillard, N., and Raffin, B. (2013). Xkaapi: A runtime system for data-flow task programming on heterogeneous architectures. In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages 1299–1308. IEEE.

Gropp, W., Thakur, R., and Lusk, E. (1999). Using MPI-2: Advanced features of the message passing interface. MIT press.

Meuer, H. W., Strohmaier, E., Dongarra, J., and Simon, H. D. (2014). The TOP500: History, Trends, and Future Directions in High Performance Computing. Chapman & Hall/CRC, 1st edition.

Pinto, V. G., Schnorr, L. M., Stanisic, L., Legrand, A., Thibault, S., and Danjean, V. (2018). A visual performance analysis framework for task-based parallel applications running on hybrid clusters. Concurrency and Computation: Practice and Experience, Early Access:1–27.

Pinto, V. G., Stanisic, L., Legrand, A., Schnorr, L. M., Thibault, S., and Danjean, V. (2016). Analyzing dynamic task-based applications on hybrid platforms: An agile scripting approach. In Third Workshop on Visual Performance Analysis, VPA@SC 2016, Salt Lake, UT, USA, November 18, 2016, pages 17–24.

R Core Team (2017). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

Schnorr, L. M., Faverge, M., Trahay, F., de Oliveira Stein, B., and de Kergommeaux, J. C. (2016). The Paje trace file format. Technical report, UFRGS.

Schnorr, L. M., Legrand, A., Thibault, S., Stanisic, L., Pinto, V. G., and Danjean, V. (2017). Detecting performance outliers for task-based hpc applications in multi-[cpu|gpu|node] clusters. Presentation at Workshop on Hybrid Computing 2017, Held in conjunction with SBAC-PAD 2017.

Schulte, E., Davison, D., Dye, T., Dominik, C., et al. (2012). A multi-language computing environment for literate programming and reproducible research. Journal of Statistical Software, 46(3):1–24.

Sievert, C., Parmer, C., Hocking, T., Chamberlain, S., Ram, K., Corvellec, M., and Despouy, P. (2016). plotly: Create Interactive Web Graphics via ’plotly.js’. R package version 4.5.6.

Stanisic, L., Agullo, E., Buttari, A., Guermouche, A., Legrand, A., Lopez, F., and Videau, B. (2015). Fast and Accurate Simulation of Multithreaded Sparse Linear Algebra Solvers. In The 21st IEEE International Conference on Parallel and Distributed Systems, pages 481–490, Melbourne, Australia.

Wickham, H. (2009). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.

Wickham, H. (2016). tidyverse: Easily Install and Load ’Tidyverse’ Packages. R package version 1.0.0.
Publicado
22/07/2018
PINTO, Vinicius Garcia; SCHNORR, Lucas Mello; LEGRAND, Arnaud; THIBAULT, Samuel; STANISIC, Luka; DANJEAN, Vincent. Detecção de Anomalias de Desempenho em Aplicações de Alto Desempenho baseadas em Tarefas em Clusters Híbridos. In: WORKSHOP EM DESEMPENHO DE SISTEMAS COMPUTACIONAIS E DE COMUNICAÇÃO (WPERFORMANCE), 17. , 2018, Natal. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2018 . p. 85-98. ISSN 2595-6167. DOI: https://doi.org/10.5753/wperformance.2018.3344.