Towards the Optimization of Operators over UDF in Spark

  • João Antonio Ferreira CEFET/RJ
  • Fábio Porto LNCC
  • Rafaelli Coutinho CEFET/RJ
  • Eduardo Ogasawara CEFET/RJ

Abstract


Large-scale data analysis has gained much importance in the scientific community due to the Big Data phenomenon. In this context, user-defined functions (UDFs) are commonly implemented in frameworks such as Apache Spark to enable large-scale data analysis. However, the use of UDF brings challenges in optimization of execution as they are opaque. This work proposes a method of optimizing data analysis workflows supported by UDF on Apache Spark. This method is based on SparkSQL’s Catalyst API and Scala language macros.

References

Armbrust, M., Xin, R., Lian, C., Huai, Y., Liu, D., Bradley, J., Meng, X., Kaftan, T., Frankliny, M., Ghodsi, A., and Zaharia, M. (2015). Spark SQL: Relational data processing in spark. In Proceedings of the ACM SIGMOD International Conference on Management of Data, volume 2015-May, pages 1383–1394.

Ferreira, J., Gaspar, D., Monteiro, B., Silva, A. B., Porto, F., and Ogasawara, E. (2017). Uma Proposta de Implementação de Álgebra de Workflows em Apache Spark no Apoio a Processos de Análise de Dados. In Brazilian e-Science Workshop.

Ogasawara, E., de Oliveira, D., Valduriez, P., Dias, J., Porto, F., and Mattoso, M. (2011). An algebraic approach for data-centric scientific workflows. In Proceedings of the VLDB Endowment, volume 4, pages 1328–1339.

Zaharia, M., Franklin, M. J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., and Venkataraman, S. (2016). Apache spark: A unified engine for big data processing. Communications of the ACM, 59(11):56–65.
Published
2018-07-26
FERREIRA, João Antonio; PORTO, Fábio; COUTINHO, Rafaelli; OGASAWARA, Eduardo. Towards the Optimization of Operators over UDF in Spark. In: BRAZILIAN E-SCIENCE WORKSHOP (BRESCI), 12. , 2018, Natal. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2018 . p. 89-92. ISSN 2763-8774. DOI: https://doi.org/10.5753/bresci.2018.3280.