Otimização de Parâmetros em Aplicações de Big Data Baseadas em Múltiplos Frameworks

Bruna de Mello Almeida; Yuri Frota; Daniel de Oliveira

doi:10.5753/sbbd.2024.240405

Bruna de Mello Almeida Universidade Federal Fluminense
Yuri Frota Universidade Federal Fluminense
Daniel de Oliveira Universidade Federal Fluminense

DOI: https://doi.org/10.5753/sbbd.2024.240405

Resumo

Os sistemas de gerência de banco de dados e os frameworks de computação distribuída são cruciais para aplicações que processam grandes volumes de dados. Configurá-los manualmente é complexo devido à quantidade e interdependência dos parâmetros tanto em um mesmo framework quanto entre frameworks. As soluções automáticas atuais necessitam de muitos exemplos e não otimizam a integração entre sistemas. Este artigo avalia uma abordagem independente de modelo para otimizar parâmetros do Apache Spark e Cassandra de forma integrada. Os resultados mostram melhorias de até 69,99% com a otimização dos parâmetros de forma integrada, em comparação com os valores default de parâmetros.

Palavras-chave: Otimização de parâmetros, Spark, Cassandra, irace

Referências

de Oliveira, D. C. M., Liu, J., and Pacitti, E. (2019). Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments. Synthesis Lectures on Data Management. Morgan & Claypool Publishers.

de Oliveira, D. E. M. et al. (2021). Towards optimizing the execution of spark scientific workflows using machine learning-based parameter tuning. Concurr. Comput. Pract. Exp., 33(5).

Essertel, G. et al. (2018). Flare: Optimizing Apache Spark with native compilation for scale-up architectures and medium-size data. In 13th USENIX OSDI, pages 799–815.

Haase, C., Röseler, T., and Seidel, M. (2022). METL: a modern ETL pipeline with a dynamic mapping matrix. CoRR, abs/2203.10289.

Huang, X., Zhang, H., and Zhai, X. (2022). A novel reinforcement learning approach for spark configuration parameter optimization. Sensors (Basel), 22(15):5930.

Jin, W., Wang, H., Zha, D., Tan, Q., Ma, Y., Li, S., and Lee, S.-I. (2024). Dcai: Data-centric artificial intelligence. WWW ’24, page 1482–1485, New York, NY, USA.

Lama, P. and Zhou, X. (2012). Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud. In ICAC ’12, pages 63–72, New York, NY, USA.

LeFevre, J., Liu, R., et al. (2016). Building the enterprise fabric for big data with vertica and spark integration. In SIGMOD, SIGMOD ’16, page 63–75, New York, NY, USA.

López-Ibáñez, M., Dubois-Lacoste, J., Pérez Cáceres, L., Birattari, M., and Stützle, T. (2016). The irace package: Iterated racing for automatic algorithm configuration. Operations Research Perspectives, 3:43–58.

Maron, O. and Moore, A. W. (1997). The racing algorithm: Model selection for lazy learners. Artificial Intelligence Review, 11(1):193–225.

Mozaffari, M., Dignös, A., Gamper, J., and Störl, U. (2024). Self-tuning database systems: A systematic literature review of automatic database schema design and tuning. ACM Comput. Surv. Just Accepted.

Ocaña, K. A. C. S., de Oliveira, D., Ogasawara, E. S., Dávila, A. M. R., Lima, A. A. B., and Mattoso, M. (2011). Sciphy: A cloud-based workflow for phylogenetic analysis of drug targets in protozoan genomes. In 6th BSB 2011, Brasilia, Brazil, pages 66–70.

Ocaña, K. A. C. S. et al. (2015). Data analytics in bioinformatics: Data science in practice for genomics analysis workflows. In IEEE e-Science 2015, pages 322–331.

Oliveira, R., Baião, F., Machado, J., Almeida, A. C., and Lifschitz, S. (2022). Autonomic combination and selection of tuning actions. In SBBD 2022, pages 39–51. SBC.

Pina, D. B., Chapman, A., Kunstmann, L. N. O., de Oliveira, D., and Mattoso, M. (2024). Dlprov: A data-centric support for deep learning workflow analyses. In Proc. of the 8th DEEM-SIGMOD 2024, Santiago, Chile, pages 77–85. ACM.

Popescu, A., Balmin, A., Ercegovac, V., and Ailamaki, A. (2013). Predict: Towards predicting the runtime of large scale iterative analytics. PVLDB, 6(14):1678–1689.

Sharma, A., Schuhknecht, F. M., and Dittrich, J. (2018). The case for automatic database administration using deep reinforcement learning. ArXiv e-prints.

Silva-Muñoz, M., Franzin, A., and Bersini, H. (2021). Automatic configuration of the cassandra database using irace. PeerJ Comput. Sci., 7:e634.

Teylo, L., de Paula Junior, U., Frota, Y., de Oliveira, D., and Drummond, L. M. A. (2017). A hybrid evolutio nary algorithm for task scheduling and data assignment of data-intensive scientific workflows on clouds. Future Gener. Comput. Syst., 76:1–17.

Yu, Z., Bei, Z., and Qian, X. (2018). Datasize-aware high dimensional configurations auto-tuning of in-memory cluster computing. In ASPLOS’18, pages 564–577.

Zaharia, M. (2019). Lessons from large-scale software as a service at databricks. SoCC ’19, page 101, New York, NY, USA.

Zhang, J. et al. (2021). Cdbtune+: An efficient deep reinforcement learning-based automatic cloud database tuning system. VLDB J., 30(6):959–987.

Zhu, Y., Liu, J., Guo, M., Bao, Y., Ma, W., Liu, Z., Song, K., and Yang, Y. (2017). Best-config: tapping the performance potential of systems via automatic configuration tuning. SoCC ’17, page 338–350, New York, NY, USA.