Cluster de Big Data com Apache Hadoop: Um Mapeamento Sistemático da Literatura

  • João Victor Tabosa de Souza IFS
  • Paulo do Amaral Costa IFS

Resumo


A contínua e crescente demanda por soluções de processamento de dados em larga escala, devido ao aumento exponencial de dados digitais provenientes de uma computação cada vez mais ubíqua, exige tecnologias eficientes e já consolidadas, para lidar com grandes volumes de dados, como é o caso do framework Apache Hadoop. Este artigo consiste em um Mapeamento Sistemático da Literatura (MSL) de trabalhos publicados nos últimos 10 anos, sobre clusters de Big Data que utilizaram o Hadoop, envolvendo o uso de Hard Disk Drives (HDDs) ou de Solid State Drives (SSDs), em ambientes físicos ou virtualizados, com o intuito de responder a cinco questões de pesquisa. Essa busca na literatura científica constatou a existência de poucos documentos, especialmente focados nesses cenários ambientais que estabeleceram alguma relação comparativa de desempenho. Houve um equilíbrio de experimentos realizados em ambientes virtualizados e físicos, no entanto, o principal aplicativo utilizado nos testes de desempenho foi o Terasort com nove menções, seguido do WordCount com apenas quatro.

Referências

Akhtar, N., Parwej, F., and Perwej, Y. (2017). A perusal of big data classification and hadoop technology. International Transaction of Electrical and Computer Engineers System (ITECES), USA, 4(1):26–38.

Apache Hadoop (2023). Hdfs architecture. Apache Software Foundation (ASF).

Auradkar, P., Prashanth, T., Aralihalli, S., Kumar, S. P., and Sitaram, D. (2020). Performance tuning analysis of spatial operations on spatial hadoop cluster with ssd. volume 167, page 2253 – 2266. All Open Access, Gold Open Access.

Gugnani, S., Lu, X., and Panda, D. K. (2016). Performance characterization of hadoop workloads on sr-iov-enabled virtualized infiniband clusters. In Proceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, BDCAT ’16, page 36–45, New York, NY, USA. Association for Computing Machinery.

Gupta, M. K., Pandey, S. K., and Gupta, A. (2022). Hadoop- an open source framework for big data. In 2022 3rd International Conference on Intelligent Engineering and Management (ICIEM), pages 708–711.

Hong, J., Li, L., Han, C., Jin, B., Yang, Q., and Yang, Z. (2016). Optimizing hadoop framework for solid state drives. In 2016 IEEE International Congress on Big Data (BigData Congress), pages 9–17.

Islam, N. S., Wasi-ur Rahman, M., Lu, X., and Panda, D. K. D. K. (2016). Efficient data access strategies for hadoop and spark on hpc cluster with heterogeneous storage. In 2016 IEEE International Conference on Big Data (Big Data), pages 223–232.

Issa, J. A. (2015). Performance evaluation and estimation model using regression method for hadoop wordcount. IEEE Access, 3:2784–2793.

Kitchenham, B. and Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering. 2.

Lee, H. and Fox, G. (2019). Big data benchmarks of high-performance storage systems on commercial bare metal clouds. In 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), pages 1–8.

Lim, S. and Park, D. (2024). Improving hadoop mapreduce performance on heterogeneous single board computer clusters. Future Generation Computer Systems, 160:752–766.

Moon, S., Lee, J., and Kee, Y. S. (2014). Introducing ssds to the hadoop mapreduce framework. In 2014 IEEE 7th International Conference on Cloud Computing, pages 272–279.

Reinsel, D., Gantz, J., and Rydning, J. (2017). Data age 2025: The evolution of data to life-critical. An IDC White Paper, Sponsored by Seagate.

Sanches, R. (2021). Tudo programado blog: Introdução a arquitetura hadoop.

Saxena, P. and Kumar, P. (2014). Performance evaluation of hdd and ssd on 10gige, ipoib rdma-ib with hadoop cluster performance benchmarking system. In 2014 5th International Conference - Confluence The Next Generation Information Technology Summit (Confluence), pages 30–35.

Tang, Z., Wang, W., Huang, Y., Wu, H., Wei, J., and Huang, T. (2017). Application-centric ssd cache allocation for hadoop applications. In Proceedings of the 9th Asia-Pacific Symposium on Internetware, Internetware ’17, New York, NY, USA. Association for Computing Machinery.

Valova, I. (2023). Using big data and hadoop in the student learning process - enhancing the educational process through real experience. page 470 – 475. Cited by: 0.

Wu, W., Lin, W., Hsu, C.-H., and He, L. (2018). Energy-efficient hadoop for big data analytics and computing: A systematic review and research insights. Future Generation Computer Systems, 86:1351–1367.

Zhu, Q. and Miao, L. (2014). The realization of green storage in hadoop. In Proceedings of 2014 International Conference on Cloud Computing and Internet of Things, pages 91–95.
Publicado
05/11/2024
SOUZA, João Victor Tabosa de; COSTA, Paulo do Amaral. Cluster de Big Data com Apache Hadoop: Um Mapeamento Sistemático da Literatura. In: ESCOLA REGIONAL DE COMPUTAÇÃO BAHIA, ALAGOAS E SERGIPE (ERBASE), 24. , 2024, Salvador/BA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 29-38. DOI: https://doi.org/10.5753/erbase.2024.4447.