Gerenciamento Dinâmico de Memória em Aplicações com Reutilização de RDDs no Spark

Maurício Matter Donato; Rhauani Weber Aita Fazul; Patrícia Pitthan Barcelos

doi:10.5753/sbesc_estendido.2021.18491

Maurício Matter Donato UFSM
Rhauani Weber Aita Fazul UFSM
Patrícia Pitthan Barcelos UFSM

DOI: https://doi.org/10.5753/sbesc_estendido.2021.18491

Resumo

O framework Apache Spark utiliza o algoritmo LRU (Least Recently Used) para a remoção de partições de RDDs (Resilient Distributed Datasets) em caso de sobrecarga da memória. Embora suponha que partições recentemente utilizadas sejam acessadas em um futuro próximo, o LRU pode degradar o desempenho de aplicações com acessos cíclicos à memória em que a quantidade de dados manipulados excede o espaço disponível. Este trabalho apresenta o DMM (Dynamic Memory Management), um modelo de Gerenciamento Dinâmico de Memória que verifica a necessidade de remoção de partições, instrumentando a execução de aplicações e identificando o bloco a ser removido, com base na reutilização dos RDDs. Os experimentos conduzidos demonstram que o DMM pode reduzir significativamente o tempo médio de execução da aplicação quando comparado ao algoritmo LRU nativamente implementado pelo Spark, provendo assim uma melhor utilização da memória e possibilitando maior estabilidade na execução das aplicações no cluster.

Palavras-chave: gerenciamento de memória, reutilização de dados, Apache Spark, LRU, RDD

Referências

Apache Software Foundation. (2021) Apache hadoop. [Online]. Available: https://hadoop.apache.org/docs/r3.3.1/. [Acesso: Junho, 2021].

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in Proceedings of the 9th fUSENIXg Symposium on Networked Systems Design and Implementation (fNSDIg 12). San Jose, CA: USENIX Association, 2012, pp. 15–28.

Apache Software Foundation. (2021) Apache Spark. [Online]. Available: https://spark.apache.org. [Acesso: Junho, 2021].

S. Jiang and X. Zhang, “Lirs: An efficient low inter-reference recency set replacement policy to improve buffer cache performance,” in Proceedings of the 2002 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, ser. SIGMETRICS ’02. New York, NY, USA: Association for Computing Machinery, 2002, p. 31–42. [Online]. Available: https://doi.org/10.1145/511334.511340

S. Haloi, Apache Zookeeper Essentials, 1st ed. Birmingham: Packt Publishing Ltd, 2015.

H. Karau, A. Konwinski, P. Wendell, and M. Zaharia, Learning spark: lightning-fast big data analysis, 1st ed. Sebastopol: O’Reilly Media, Inc., 2015.

S. Gulati, Apache Spark 2.x for Java Developers: Explore big data at scale using Apache Spark 2.x Java APIs, 1st ed. Birmingham: Packt Publishing Ltd, jul 2017.

M. Duan, K. Li, Z. Tang, G. Xiao, and K. Li, “Selection and replacement algorithms for memory performance improvement in spark,” Concurrency and Computation: Practice and Experience, vol. 28, no. 8, pp. 2473–2486, 2016.

Y. Geng, X. Shi, C. Pei, H. Jin, and W. Jiang, “Lcs: An efficient data eviction strategy for spark,” Int. J. Parallel Program., vol. 45, no. 6, p. 1285–1297, Dec. 2017. [Online]. Available: https://doi.org/10.1007/s10766-016-0470-1

Y. Wang and T. Zhou, “A lowest cost rdd caching strategy for spark,” in Proceedings of 2019 4th International Conference on Automatic Control and Mechatronic Engineering (ACME 2019). CSP, 2019, pp. 30–36.

M. Zhang, R. Chen, X. Zhang, Z. Feng, G. Rao, and X. Wang, “Intelligent rdd management for high performance in-memory computing in spark,” in Proceedings of the 26th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, 2017, pp. 873–874.

L. Xu, M. Li, L. Zhang, A. R. Butt, Y. Wang, and Z. Z. Hu, “Memtune: Dynamic memory management for in-memory data analytic platforms,” in 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2016, pp. 383–392.

S.-J. Chae and T.-S. Chung, “Dsmm: A dynamic setting for memory management in apache spark,” in 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2019, pp. 143–144.

S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang, “The hibench benchmark suite: Characterization of the mapreduce-based data analysis,” in 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), 2010, pp. 41–51.

H. Grgic, B. Mihaljevic, and A. Radovan, “Comparison of garbage collectors in java programming language,” in 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO). IEEE, 2018, pp. 1539–1544.