A practical analysis of balancing policies for rearranging data replicas in HDFS clusters

Rhauani Weber Aita Fazul; Patrícia Pitthan Barcelos

doi:10.5753/wscad.2022.225856

Rhauani Weber Aita Fazul UFSM
Patrícia Pitthan Barcelos UFSM

DOI: https://doi.org/10.5753/wscad.2022.225856

Resumo

Data replication is the main fault tolerance mechanism implemented by the HDFS. The placement of the replicated data across the nodes directly influences replica balancing and data locality, which are essential to ensure high reliability and data availability. The HDFS Balancer is the official solution to perform replica balancing through data redistribution. In this work, we conducted a practical experiment to evaluate different policies for replica rearrangement, namely: datanode, blockpool, and custom. The evaluation results underline the behavior and the effectiveness of each policy. In addition, we investigated the cost of the HDFS Balancer operation and the performance and availability improvements promoted by a balanced replica distribution.

Referências

Achari, S. (2015). Hadoop Essentials. Packt Publishing Ltd, Birmingham, 1st edition.

Cao, X.-y., Wang, C., Wang, B., and He, Z.-x. (2022). A method to calculate the number of dynamic hdfs copies based on file access popularity. Mathematical Biosciences and Engineering, 19(12):12212-12231.

Cloudera, Inc. (2021). Managing data storage. [Online]. Available: [link]. [Accessed: Jun 27, 2022].

Fazul, R. W. A., Cardoso, P. V., and Barcelos, P. P. (2019). Improving data availability in hdfs through replica balancing. In 2019 9th Latin-American Symposium on Dependable Computing (LADC), pages 1-6, New York. IEEE.

Foundation, A. S. (2021). HDFS Architecture. [Online]. Available: [link]. [Accessed: Mar 27, 2022].

Lamehamedi, H., Szymanski, B., Shentu, Z., and Deelman, E. (2002). Data replication strategies in grid environments. In 5th International Conference on Algorithms and Architectures for Parallel Processing, pages 378-383, New York. IEEE.

Liu, K., Peng, J., Wang, J., Liu, W., Huang, Z., and Pan, J. (2020). Scalable and adaptive data replica placement for geo-distributed cloud storages. IEEE Transactions on Parallel and Distributed Systems, 31(7):1575-1587.

Rajput, D., Goyal, A., and Tripathi, A. (2022). Priority-based replication management for hadoop distributed file system. In Congress on Intelligent Systems, pages 549-560. Springer.

Shvachko, K., Kuang, H., Radia, S., and Chansler, R. (2010). The hadoop distributed file system. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pages 1-10, New York. IEEE.

Shvachko, K. V. (2010). Hdfs scalability: The limits to growth. USENIX, 35(2):6-16.

Shwe, T. and Aritsugi, M. (2019). Preventing data popularity concentration in hdfs based cloud storage. UCC '19 Companion, page 65-70, New York, NY, USA. Association for Computing Machinery.

Turkington, G. (2013). Hadoop Beginner's Guide. Packt Publishing Ltd, Birmingham.

White, T. (2015). Hadoop: The Definitive Guide. O'Reilly Media, Inc., Sebastopol.

Yin, Y. and Deng, L. (2022). A dynamic decentralized strategy of replica placement on edge computing. International Journal of Distributed Sensor Networks, 18(8):9.