Big Data Cluster with Apache Hadoop: A Systematic Literature Mapping

  • João Victor Tabosa de Souza IFS
  • Paulo do Amaral Costa IFS

Abstract


The continuous and growing demand for large-scale data processing solutions, due to the exponential increase in digital data from increasingly ubiquitous computing, requires efficient and already consolidated technologies to handle large volumes of data, such as the Apache Hadoop framework. This article consists of a Systematic Literature Mapping (SLM) of works published in the last 10 years on Big Data clusters that used Hadoop, involving the use of Hard Disk Drives (HDDs) or Solid State Drives (SSDs), in physical or virtualized environments, with the aim of answering five research questions. This search in the scientific literature found that there were few documents, especially focused on these environmental scenarios, that established any comparative performance relationship. There was a balance of experiments carried out in virtualized and physical environments; however, the main application used in the performance tests was Terasort with nine mentions, followed by WordCount with only four.

References

Akhtar, N., Parwej, F., and Perwej, Y. (2017). A perusal of big data classification and hadoop technology. International Transaction of Electrical and Computer Engineers System (ITECES), USA, 4(1):26–38.

Apache Hadoop (2023). Hdfs architecture. Apache Software Foundation (ASF).

Auradkar, P., Prashanth, T., Aralihalli, S., Kumar, S. P., and Sitaram, D. (2020). Performance tuning analysis of spatial operations on spatial hadoop cluster with ssd. volume 167, page 2253 – 2266. All Open Access, Gold Open Access.

Gugnani, S., Lu, X., and Panda, D. K. (2016). Performance characterization of hadoop workloads on sr-iov-enabled virtualized infiniband clusters. In Proceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, BDCAT ’16, page 36–45, New York, NY, USA. Association for Computing Machinery.

Gupta, M. K., Pandey, S. K., and Gupta, A. (2022). Hadoop- an open source framework for big data. In 2022 3rd International Conference on Intelligent Engineering and Management (ICIEM), pages 708–711.

Hong, J., Li, L., Han, C., Jin, B., Yang, Q., and Yang, Z. (2016). Optimizing hadoop framework for solid state drives. In 2016 IEEE International Congress on Big Data (BigData Congress), pages 9–17.

Islam, N. S., Wasi-ur Rahman, M., Lu, X., and Panda, D. K. D. K. (2016). Efficient data access strategies for hadoop and spark on hpc cluster with heterogeneous storage. In 2016 IEEE International Conference on Big Data (Big Data), pages 223–232.

Issa, J. A. (2015). Performance evaluation and estimation model using regression method for hadoop wordcount. IEEE Access, 3:2784–2793.

Kitchenham, B. and Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering. 2.

Lee, H. and Fox, G. (2019). Big data benchmarks of high-performance storage systems on commercial bare metal clouds. In 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), pages 1–8.

Lim, S. and Park, D. (2024). Improving hadoop mapreduce performance on heterogeneous single board computer clusters. Future Generation Computer Systems, 160:752–766.

Moon, S., Lee, J., and Kee, Y. S. (2014). Introducing ssds to the hadoop mapreduce framework. In 2014 IEEE 7th International Conference on Cloud Computing, pages 272–279.

Reinsel, D., Gantz, J., and Rydning, J. (2017). Data age 2025: The evolution of data to life-critical. An IDC White Paper, Sponsored by Seagate.

Sanches, R. (2021). Tudo programado blog: Introdução a arquitetura hadoop.

Saxena, P. and Kumar, P. (2014). Performance evaluation of hdd and ssd on 10gige, ipoib rdma-ib with hadoop cluster performance benchmarking system. In 2014 5th International Conference - Confluence The Next Generation Information Technology Summit (Confluence), pages 30–35.

Tang, Z., Wang, W., Huang, Y., Wu, H., Wei, J., and Huang, T. (2017). Application-centric ssd cache allocation for hadoop applications. In Proceedings of the 9th Asia-Pacific Symposium on Internetware, Internetware ’17, New York, NY, USA. Association for Computing Machinery.

Valova, I. (2023). Using big data and hadoop in the student learning process - enhancing the educational process through real experience. page 470 – 475. Cited by: 0.

Wu, W., Lin, W., Hsu, C.-H., and He, L. (2018). Energy-efficient hadoop for big data analytics and computing: A systematic review and research insights. Future Generation Computer Systems, 86:1351–1367.

Zhu, Q. and Miao, L. (2014). The realization of green storage in hadoop. In Proceedings of 2014 International Conference on Cloud Computing and Internet of Things, pages 91–95.
Published
2024-11-05
SOUZA, João Victor Tabosa de; COSTA, Paulo do Amaral. Big Data Cluster with Apache Hadoop: A Systematic Literature Mapping. In: REGIONAL SCHOOL ON COMPUTING OF BAHIA, ALAGOAS, AND SERGIPE (ERBASE), 24. , 2024, Salvador/BA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 29-38. DOI: https://doi.org/10.5753/erbase.2024.4447.