Automatic Parameter Optimization in Big Data Applications Based on Multiple Frameworks

  • Bruna de Mello Almeida Fluminense Federal University
  • Yuri Frota Fluminense Federal University
  • Daniel de Oliveira Fluminense Federal University

Abstract


Database management systems and distributed computing frameworks are crucial for applications that process large volumes of data. Configuring them manually is complex due to the number and interdependence of parameters both intra-and inter-frameworks. Current automatic solutions require many examples and do not optimize system integration. This paper evaluates a model-independent approach to optimize parameters from Apache Spark and Cassandra in an integrated way. The results show performance improvements of up to 69.99% with the integrated parameter optimization compared to the default parameter values.
Keywords: Parameter optimization, Spark, Cassandra, irace

References

de Oliveira, D. C. M., Liu, J., and Pacitti, E. (2019). Data-Intensive Workflow Management: For Clouds and Data-Intensive and Scalable Computing Environments. Synthesis Lectures on Data Management. Morgan & Claypool Publishers.

de Oliveira, D. E. M. et al. (2021). Towards optimizing the execution of spark scientific workflows using machine learning-based parameter tuning. Concurr. Comput. Pract. Exp., 33(5).

Essertel, G. et al. (2018). Flare: Optimizing Apache Spark with native compilation for scale-up architectures and medium-size data. In 13th USENIX OSDI, pages 799–815.

Haase, C., Röseler, T., and Seidel, M. (2022). METL: a modern ETL pipeline with a dynamic mapping matrix. CoRR, abs/2203.10289.

Huang, X., Zhang, H., and Zhai, X. (2022). A novel reinforcement learning approach for spark configuration parameter optimization. Sensors (Basel), 22(15):5930.

Jin, W., Wang, H., Zha, D., Tan, Q., Ma, Y., Li, S., and Lee, S.-I. (2024). Dcai: Data-centric artificial intelligence. WWW ’24, page 1482–1485, New York, NY, USA.

Lama, P. and Zhou, X. (2012). Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud. In ICAC ’12, pages 63–72, New York, NY, USA.

LeFevre, J., Liu, R., et al. (2016). Building the enterprise fabric for big data with vertica and spark integration. In SIGMOD, SIGMOD ’16, page 63–75, New York, NY, USA.

López-Ibáñez, M., Dubois-Lacoste, J., Pérez Cáceres, L., Birattari, M., and Stützle, T. (2016). The irace package: Iterated racing for automatic algorithm configuration. Operations Research Perspectives, 3:43–58.

Maron, O. and Moore, A. W. (1997). The racing algorithm: Model selection for lazy learners. Artificial Intelligence Review, 11(1):193–225.

Mozaffari, M., Dignös, A., Gamper, J., and Störl, U. (2024). Self-tuning database systems: A systematic literature review of automatic database schema design and tuning. ACM Comput. Surv. Just Accepted.

Ocaña, K. A. C. S., de Oliveira, D., Ogasawara, E. S., Dávila, A. M. R., Lima, A. A. B., and Mattoso, M. (2011). Sciphy: A cloud-based workflow for phylogenetic analysis of drug targets in protozoan genomes. In 6th BSB 2011, Brasilia, Brazil, pages 66–70.

Ocaña, K. A. C. S. et al. (2015). Data analytics in bioinformatics: Data science in practice for genomics analysis workflows. In IEEE e-Science 2015, pages 322–331.

Oliveira, R., Baião, F., Machado, J., Almeida, A. C., and Lifschitz, S. (2022). Autonomic combination and selection of tuning actions. In SBBD 2022, pages 39–51. SBC.

Pina, D. B., Chapman, A., Kunstmann, L. N. O., de Oliveira, D., and Mattoso, M. (2024). Dlprov: A data-centric support for deep learning workflow analyses. In Proc. of the 8th DEEM-SIGMOD 2024, Santiago, Chile, pages 77–85. ACM.

Popescu, A., Balmin, A., Ercegovac, V., and Ailamaki, A. (2013). Predict: Towards predicting the runtime of large scale iterative analytics. PVLDB, 6(14):1678–1689.

Sharma, A., Schuhknecht, F. M., and Dittrich, J. (2018). The case for automatic database administration using deep reinforcement learning. ArXiv e-prints.

Silva-Muñoz, M., Franzin, A., and Bersini, H. (2021). Automatic configuration of the cassandra database using irace. PeerJ Comput. Sci., 7:e634.

Teylo, L., de Paula Junior, U., Frota, Y., de Oliveira, D., and Drummond, L. M. A. (2017). A hybrid evolutio nary algorithm for task scheduling and data assignment of data-intensive scientific workflows on clouds. Future Gener. Comput. Syst., 76:1–17.

Yu, Z., Bei, Z., and Qian, X. (2018). Datasize-aware high dimensional configurations auto-tuning of in-memory cluster computing. In ASPLOS’18, pages 564–577.

Zaharia, M. (2019). Lessons from large-scale software as a service at databricks. SoCC ’19, page 101, New York, NY, USA.

Zhang, J. et al. (2021). Cdbtune+: An efficient deep reinforcement learning-based automatic cloud database tuning system. VLDB J., 30(6):959–987.

Zhu, Y., Liu, J., Guo, M., Bao, Y., Ma, W., Liu, Z., Song, K., and Yang, Y. (2017). Best-config: tapping the performance potential of systems via automatic configuration tuning. SoCC ’17, page 338–350, New York, NY, USA.
Published
2024-10-14
ALMEIDA, Bruna de Mello; FROTA, Yuri; DE OLIVEIRA, Daniel. Automatic Parameter Optimization in Big Data Applications Based on Multiple Frameworks. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 39. , 2024, Florianópolis/SC. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 418-430. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2024.240405.