Parâmetros de Configuração Relevantes para o Tempo de Execução de Tarefas no Apache Spark

  • Maria Carolina Lins Nunes UNIVASF
  • Jairson Barbosa Rodrigues UNIVASF

Resumo


Sistemas tradicionais centralizados não conseguem lidar com o contexto big data. Plataformas de computação distribuída como o Apache Spark têm sido amplamente adotadas, mas a configuração de seus parâmetros é desafiante face ao número de fatores e suas interações. Este trabalho emprega técnicas de Design of Experiments (DoE) para triar fatores de software mais relevantes para o tempo de execução de uma tarefa distribuída de aprendizagem de máquina Naı̈ve Bayes sobre um subconjunto do Corpus PT7 WEB, com 14.88 GB de dados. Empregando um projeto fatorial fracionado com 192 unidades experimentais e técnicas de regressão linear com backward elimination obteve-se um modelo capaz de identificar os fatores mais relevantes para o tempo de execução de tarefas no contexto analisado.

Referências

Ahmed, N., Barczak, A. L. C., Susnjak, T., and Rashid, M. A. (2020). A comprehensive performance analysis of apache hadoop and apache spark for large scale data sets using hibench. J. Big Data, 7(1):110.

Amato, A. (2017). On the Role of Distributed Computing in Big Data Analytics, pages 1–10. Springer International Publishing, Cham.

Chen, Q., Wang, K., Bian, Z., Cremer, I., Xu, G., and Guo, Y. (2016). Simulating spark cluster for deployment planning, evaluation and optimization. In 2016 6th International Conference on Simulation and Modeling Methodologies, Technologies and Applications (SIMULTECH), pages 1–11.

Fisher, R. A. (1936). Design of experiments. British Medical Journal, 1(3923):554.

Gounaris, A. and Torres, J. (2018). A methodology for spark parameter tuning. Big Data Research, 11:22–32. Selected papers from the 2nd INNS Conference on Big Data: Big Data Neural Networks.

Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., and Ullah Khan, S. (2015). The rise of “big data” on cloud computing: Review and open research issues. Information Systems, 47:98–115.

Kutner, M. (2005). Applied Linear Statistical Models. McGrwa-Hill international edition. McGraw-Hill Irwin.

Laney, D. (2001). 3D data management: Controlling data volume, velocity, and variety. Technical report, META Group.

Lujan-Moreno, G. A., Howard, P. R., Rojas, O. G., and Montgomery, D. C. (2018). Design of experiments and response surface methodology to tune machine learning hyperparameters, with a random forest case-study. Expert Systems with Applications, 109:195–205.

Montgomery, D. and Runger, G. (2003). Estatística aplicada e probabilidade para engenheiros. Livros Técnicos e Científicos.

Montgomery, D. C. (2017). Design and analysis of experiments. John wiley & sons.

Nguyen, N., Maifi Hasan Khan, M., and Wang, K. (2018). Towards automatic tuning of apache spark configuration. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pages 417–425.

Petridis, P., Gounaris, A., and Torres, J. (2017). Spark parameter tuning via trial-and-error. In Angelov, P., Manolopoulos, Y., Iliadis, L., Roy, A., and Vellasco, M., editors, Advances in Big Data, pages 226–237, Cham. Springer International Publishing.

Rodrigues, J., Vasconcelos, G., and Maciel, P. (2020). Pt7 web, an annotated portuguese language corpus.

Rodrigues, J., Vasconcelos, G., and Maciel, P. (2021). Screening hardware and volume factors in distributed machine learning algorithms on spark: A design of experiments (doe) based approach. Computing, 103.

Rodrigues, J. B. (2020). Análise de Fatores Relevantes no Desempenho de Plataformas para Processamento de Big Data. PhD thesis, Recife.

Simonet, A., Fedak, G., and Ripeanu, M. (2015). Active data: A programming model to manage data life cycle across heterogeneous systems and infrastructures. Future Generation Computer Systems, 53:25–42.

Tekindal, M. A., Bayrak, H., Özkaya, B., and Yavuz, Y. (2014). Second-order response surface method: factorial experiments an alternative method in the field of agronomy. Turkish Journal of Field Crops, 19(1):35–45.

Wang, G., Xu, J., and He, B. (2016). A novel method for tuning configuration parameters of spark based on machine learning. In 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pages 586–593.

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. (2010). Spark: Cluster computing with working sets. In 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10).
Publicado
21/07/2024
NUNES, Maria Carolina Lins; RODRIGUES, Jairson Barbosa. Parâmetros de Configuração Relevantes para o Tempo de Execução de Tarefas no Apache Spark. In: WORKSHOP EM DESEMPENHO DE SISTEMAS COMPUTACIONAIS E DE COMUNICAÇÃO (WPERFORMANCE), 23. , 2024, Brasília/DF. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 73-84. ISSN 2595-6167. DOI: https://doi.org/10.5753/wperformance.2024.2821.