Machine Learning for Spark Application Runtime Prediction

  • Alexandre Maros Universidade Federal de Minas Gerais
  • Jussara M. Almeida Universidade Federal de Minas Gerais
  • Fabricio Murai Universidade Federal de Minas Gerais
  • Ana Paula Couto da Silva Universidade Federal de Minas Gerais
  • Danilo Ardagna Politecnico di Milano
  • Marco Lattuada Politecnico di Milano

Abstract


The rise of big data applications brought along a series of difficult challenges regarding the allocation of hardware and software resources. Typically these applications are known for being computationally expensive and having high heterogeneity on how they operate, making the task of estimating application's execution time very challenging. It may be still possible to correlate features extracted from the cloud environment and from the input dataset to the execution time. Such relationship may then be used to predict execution times. Based on such assumption, this work explores machine learning (ML) models to the task of predict execution time of Spark applications. This work investigates four ML models as well as different features, while also comparing their results against the current state-of-the-art. All models are evaluated in several scenarios and configurations, producing results that are significantly superior to the state-of-the-art in various cases.

Keywords: Big Data, Spark, Machine Learning

References

Ardagna, D., Bernardi, S., Gianniti, E., Aliabadi, S. K., Perez-Palacin, D., and Requeno, J. I. (2016). Modeling performance of hadoop applications: A journey from queueing networks to stochastic well formed nets. In International Conference on Algorithms and Architectures for Parallel Processing, pages 599–613. Springer.

Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J., Ghodsi, A., et al. (2015). Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1383–1394. ACM.

Arthur, D. and Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, pages 1027–1035, Philadelphia, PA, USA. Society for Industrial and Applied Mathematics.

Hsu, C., Chang, C., and Lin, C. (2008). A practical guide to support vector classification. BJU International, 101:1396–1400.

Karau, H., Konwinski, A., Wendell, P., and Zaharia, M. (2015). Learning spark: lightning-fast big data analysis. ”O’Reilly Media, Inc.”.

Liang, D.-R. and Tripathi, S. K. (2000). On performance prediction of parallel computations with precedent constraints. IEEE Transactions on Parallel and Distributed Systems, 11(5):491–508.

Liaw, A., Wiener, M., et al. (2002). Classification and regression by randomforest. R news, 2(3):18–22.

Low, C., Chen, Y., and Wu, M. (2011). Understanding the determinants of cloud computing adoption. Industrial management & data systems, 111(7):1006–1023.

Mak, V.W. and Lundstrom, S. F. (1990). Predicting performance of parallel computations. IEEE Transactions on Parallel and Distributed Systems, 1(3):257–270.

Menasce, D. A., Almeida, V. A., Dowdy, L. W., and Dowdy, L. (2004). Performance by design: computer capacity planning by example. Prentice Hall Professional.

Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al. (2016). Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, 17(1):1235–1241.

Pinto, T. B., da Silva, A. P. C., and Almeida, J. M. (2018). Previs ão do tempo de resposta de aplicac¸ ˜oes de big data em ambientes de nuvem. In Simp´osio Brasileiro de Redes de Computadores (SBRC), volume 36.

Popescu, A. D. (2015). Runtime prediction for scale-out data analytics. Technical report, EPFL.

Robert, C. (2014). Machine learning, a probabilistic perspective.

Song, G., Meng, Z., Huet, F., Magoules, F., Yu, L., and Lin, X. (2013). A hadoop mapreduce performance prediction method. In 2013 IEEE International Conference on High Performance Computing and Communications, pages 820–825. IEEE.

Stokes, C., Kumar, A., Choi, F., and Weischedel, R. (2015). Scaling nlp algorithms to meet high demand. In Big Data (Big Data), 2015 IEEE International Conference on, pages 2839–2839. IEEE.

Venkataraman, S., Yang, Z., Franklin, M. J., Recht, B., and Stoica, I. (2016). Ernest:
Efficient performance prediction for large-scale advanced analytics. In USENIX Symposium on Networked Systems Design and Implementation, pages 363–378.

Wang, K. and Khan, M. M. H. (2015). Performance prediction for apache spark platform. In 2015 IEEE 17th International Conference on High Performance Computing and Communications, pages 166–173.

Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., et al. (2016). Apache spark: a unified engine for big data processing. Communications of the ACM, 59(11):56–65.

Zhang, Q., Zhu, Q., and Boutaba, R. (2011). Dynamic resource allocation for spot markets in cloud computing environments. In 2011 Fourth IEEE International Conference on Utility and Cloud Computing, pages 178–185.

Zhang, Y., Qiu, M., Tsai, C.-W., Hassan, M. M., and Alamri, A. (2017). Health-cps: Healthcare cyber-physical system assisted by cloud and big data. IEEE Systems Journal, 11(1):88–95.
Published
2019-05-06
MAROS, Alexandre; ALMEIDA, Jussara M.; MURAI, Fabricio; DA SILVA, Ana Paula Couto; ARDAGNA, Danilo; LATTUADA, Marco. Machine Learning for Spark Application Runtime Prediction. In: BRAZILIAN SYMPOSIUM ON COMPUTER NETWORKS AND DISTRIBUTED SYSTEMS (SBRC), 37. , 2019, Gramado. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2019 . p. 197-210. ISSN 2177-9384. DOI: https://doi.org/10.5753/sbrc.2019.7360.

Most read articles by the same author(s)