The Node Status as a Prioritization Strategy for Replica Balancing in a HDFS Cluster

Rhauani Fazul; Patrícia Barcelos

doi:10.5753/sbesc_estendido.2020.13097

Rhauani Fazul UFSM
Patrícia Barcelos UFSM

DOI: https://doi.org/10.5753/sbesc_estendido.2020.13097

Resumo

Data replication is the main fault tolerance mechanism of HDFS, the Hadoop Distributed File System. Although replication is essential to ensure high availability and reliability, the replicas might not always be placed evenly among the nodes. The HDFS Balancer is an integrated solution of Apache Hadoop that performs replica balancing through the rearrangement of the data blocks stored in the file system. The Balancer, however, demands a high computational effort of the nodes during its operation. This work presents a customization for the HDFS Balancer that considers the status of the nodes as a strategy to minimize the overhead caused by the balancing operation in the cluster. To this end, metrics obtained at runtime are used as a way to prioritize the nodes during data redistribution, making it occurs primarily between nodes with low communication traffic. Also, the Balancer starts to operate aiming at a minimum balance level, reducing the number of data transfers required to even up the data stored in the cluster. The evaluation results showed that the proposed customization allows reducing the time and bandwidth needed to reach the system balance.

Palavras-chave: data replication, replica balancing, balancing overhead, data locality

Referências

Apache Software Foundation. (2020) HDFS Architecture. [Online]. Available: https://hadoop.apache.org/docs/r3.3.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html. [Accessed: May, 2020].

T. White, Hadoop: The Definitive Guide, 4th ed. Sebastopol: O’Reilly Media, Inc., 2015.

S. Achari, Hadoop Essentials, 1st ed. Packt Publishing Ltd, 2015.

K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,” in 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 2010, pp. 1–10.

G. Turkington, Hadoop Beginner’s Guide, 1st ed. Birmingham: Packt Publishing Ltd, 2013.

Z. Guo, G. Fox, and M. Zhou, “Investigation of data locality in mapreduce,” in Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012). Ottawa: IEEE Computer Society, 2012, pp. 419–426.

I. A. Ibrahim, W. Dai, and M. Bassiouni, “Intelligent data placement mechanism for replicas distribution in cloud storage systems,” in IEEE International Conference on Smart Cloud (SmartCloud). New York: IEEE, 2016, pp. 134–139.

K. Liu, G. Xu, and J. Yuan, “An improved hadoop data load balancing algorithm,” Journal of Networks, vol. 8, no. 12, pp. 2816–2822, 2013.

J. Dharanipragada, S. Padala, B. Kammili, and V. Kumar, “Tula: A disk latency aware balancing and block placement strategy for hadoop,” in International Conference on Big Data. IEEE, 2017, pp. 2853–2858.

A. Shah and M. Padole, “Load balancing through block rearrangement policy for hadoop heterogeneous cluster,” in 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI). Bangalore: IEEE, 2018, pp. 230–236.

Hortonworks Data Plataform. (2019) Scaling namespaces and optimizing data storage. [Online]. Available: https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/data-storage/content/balancing_data_across_hdfs_cluster.html. [Accessed: June 03, 2020].

R. Fazul, P. V. Cardoso, and P. P. Barcelos, “Improving data availability in hdfs through replica balancing,” in 2019 9th Latin-American Symposium on Dependable Computing (LADC). IEEE, 2019, pp. 1–6.