Spark Scalability Analysis in a Scientific Workflow

  • Renan Souza Universidade Federal do Rio de Janeiro / IBM Research
  • Vítor Silva Universidade Federal do Rio de Janeiro
  • Pedro Miranda Universidade Federal do Rio de Janeiro
  • Alexandre A. B. Lima Universidade Federal do Rio de Janeiro
  • Patrick Valduriez Inria / LIRMM
  • Marta Mattoso Universidade Federal do Rio de Janeiro

Resumo


Spark is being successfully used for big data parallel processing in many business domains (social media, finance, retail). Spark’s scalability, usability, and large user community have motivated developers from scientific domains (bioinformatics, oil and gas, astronomy) to try it. However, scientific applications’ profile, e.g., black-box programs and intense file writes, differs from traditional business workflows, which may affect its scalability. We present a scalability analysis of Spark in a real case-study in Oil and Gas domain. We explore workloads on a 936-cores HPC cluster processing 330 GB of scientific data. We show that it scales very well when running long-lasting scientific tasks, but its performance is lower for short-duration tasks.

Palavras-chave: Spark, Big Data, Parallel Processing, Real Case-Study

Referências

Armbrust, M., Zaharia, M., Das, T., Davidson, A., Ghodsi, A., Or, A., Rosen, J., Stoica, I., Wendell, P., et al., (2015), "Scaling spark in the real world: performance and usability", PVLDB, v. 8, n. 12, p. 1840–1843.

Atkinson, M., Gesing, S., Montagnat, J., Taylor, I., (2017), "Scientific workflows: past, present and future", FGCS, v. 75, p. 216–227.

F. da Silva, R., Filgueira, R., Pietri, I., Jiang, M., Sakellariou, R., Deelman, E., (2017), "A characterization of workflow management systems for extreme-scale applications", FGCS, v. 75, p. 228–238.

GitHub. RFA Spark Repository. Available on: [link].

Gittens, A., Devarakonda, A., Racah, E., Ringenburg, M., Gerhardt, L., Kottalam, J., Liu, J., Maschhoff, K., Canon, S., et al., (2016), "Matrix factorizations at scale: A comparison of scientific data analytics in Spark and C+MPI using three case studies". In: IEEE Int. Conf. on Big Data, p. 204–213

Oliveira, D., Boeres, C., Neto, A., Porto, F., (2015), "Avaliação da localidade de dados intermediários na execução paralela de workflows bigdata". In: SBBD, p. 29–40

Özsu, M. T., Valduriez, P., (2011), Principles of distributed database systems. 3 ed. New York, Springer.

Raicu, I., Foster, I. T., Zhao, Y., (2008), "Many-task computing for grids and supercomputers". In: MTAGS, p. 1–11

Shi, J., Qiu, Y., Minhas, U. F., Jiao, L., Wang, C., Reinwald, B., Özcan, F., (2015), "Clash of the titans: MapReduce vs. Spark for large scale data analytics", PVLDB, v. 8, n. 13, p. 2110–2121.

Souza, R., Silva, V., Coutinho, A. L. G. A., Valduriez, P., Mattoso, M., (2016), "Online input data reduction in scientific workflows". In: WORKS, p. 44–53

Zhang, Z., Barbary, K., Nothaft, F. A., Sparks, E. R., Zahn, O., Franklin, M. J., Patterson, D. A., Perlmutter, S., (2017), "Kira: processing astronomy imagery using big data technology", IEEE Trans. Big Data, v. PP, n. 99, p. 1–14.
Publicado
02/10/2017
SOUZA, Renan; SILVA, Vítor; MIRANDA, Pedro; LIMA, Alexandre A. B.; VALDURIEZ, Patrick; MATTOSO, Marta. Spark Scalability Analysis in a Scientific Workflow. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 32. , 2017, Uberlândia/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2017 . p. 288-293. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2017.174092.