Política Customizada de Balanceamento de Réplicas para o HDFS Balancer do Apache Hadoop
Resumo
Data replication is a fundamental mechanism of the Hadoop Distributed File System (HDFS). However, the way data is spread across the cluster directly affects the replication balancing. The HDFS Balancer is a Hadoop integrated tool which can balance the storage load on each machine by moving data between nodes, although its operation does not address the specific needs of applications while performing block rearrangement. This paper proposes a customized balancing policy for HDFS Balancer based on a system of priorities, which can be adapted and configured according to usage demands. The priorities define whether HDFS parameters, or whether cluster topology should be considered during the operation, thus making the balancing more flexible.
Referências
Cowsalya, T. and Mugunthan, S. (2015). Hadoop architecture and fault tolerance based hadoop clusters in geographically distributed data center. ARPN Journal of Enginee-ring and Applied Sciences, 10(7):2818-2821.
Dharanipragada, J., Padala, S., Kammili, B., and Kumar, V. (2017). Tula: A disk latency aware balancing and block placement strategy for hadoop. In Big Data (Big Data), 2017 IEEE International Conference on, pages 2853-2858. IEEE.
Fazul, R. W. A., Cardoso, P. V., and Barcelos, P. P. (2019). Análise do impacto da replicação de dados implementada pelo apache hadoop no balanceamento de carga. Anais do X Computer on the Beach. No prelo.
Foundation, A. S. (2018). "HDFS Architecture". https://hadoop.apache.org/ docs/r2.9.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign. html. Fevereiro.
Hortonworks (2018). "HDFS Administration". https://docs.hortonworks. com/HDPDocuments/HDP2/HDP-2.6.5/bk_hdfs-administration/ content/ch_balancing-in-hdfs.html. Janeiro.
Ibrahim, I. A., Dai, W., and Bassiouni, M. (2016). Intelligent data placement mechanism for replicas distribution in cloud storage systems. In IEEE International Conference on Smart Cloud (SmartCloud), pages 134-139. IEEE.
Jain, H. and Goyal, A. (2017). An improved approach for analysis of hadoop data for all files. International Journal of Computer Applications, 157(4).
Lin, C.-Y. and Lin, Y.-C. (2015). A load-balancing algorithm for hadoop distributed file system. In International Conference on Network-Based Information Systems, pages 173-179. IEEE.
Liu, K., Xu, G., and Yuan, J. (2013). An improved hadoop data load balancing algorithm. Journal of Networks, 8(12):2816-2822.
Shvachko, K., Kuang, H., Radia, S., and Chansler, R. (2010). The hadoop distributed file system. In Symposium on Mass Storage Systems and Technologies, pages 1-10. IEEE.
Shwe, T. and Aritsugi, M. (2018). A data re-replication scheme and its improvement toward proactive approach. ASEAN Engineering Journal, 8(1):36-52.
White, T. (2015). Hadoop: The Definitive Guide. O'Reilly Media, Inc., 4th edition.