Analysis of the Influence of Modeling, Data Format and Processing Tool on the Performance of Hadoop-Hive Based Data Warehouse

Authors

  • Beatriz F. P. de Oliveira Universidade de Brasília
  • Aline S. O. Valente Universidade de Brasília
  • Marcio Victorino Universidade de Brasília
  • Edward Ribeiro Universidade de Brasília
  • Maristela Holanda Universidade de Brasília

DOI:

https://doi.org/10.5753/jidm.2022.2516

Keywords:

Data Format, Data Warehouse, Hadoop, Hive, Modeling, Spark

Abstract

With the emergence of Big Data and the continuous growth of massive data produced by web applications, smartphones, social networks, and others, organizations began to invest in alternative solutions that would derive value from this amount of data. In this context, this article evaluates three factors that can significantly influence the performance of Big Data Hive queries: data modeling, data format and processing tool. The objective is to present a comparative analysis of the Hive platform performance with the snowflake model and the fully denormalized one. Moreover, the influence of two types of table storage file types (CSV and Parquet) and two types of data processing tools, Hadoop and Spark, were also comparatively analyzed. The data used for analysis is the open data of the Brazilian Army in the Google Cloud environment. Analysis was performed for different data volumes in Hive and cluster configuration scenarios. The results yielded that the Parquet storage format always performed better than when CSV storage formats were used, regardless of the model and processing tool selected for the test scenario.

Downloads

Download data is not yet available.

References

Cassavia, N., Dicosta, P., Masciari, E., and Saccà, D. Data preparation for tourist data big data warehousing. In International Conference on Data Management Technologies and Applications. INSTICC, SciTePress, pp. 419–426, 2014.

Chambers, B. and Zaharia, M. Spark: The definitive guide. O’Reilly, 2018.

Costa, E., Costa, C., and Santos, M. Y. Efficient big data modelling and organization for hadoop hive-based data warehouses. In European, Mediterranean and Middle Eastern Conference on Information Systems, M. Themisto-cleous and V. Morabito (Eds.). Springer International Publishing, pp. 3–16, 2017.

de Oliveira, B., Valente, A., Victorino, M., Ribeiro, E., and Holanda, M. Análise da influência da modelagem e formato de dados no desempenho de data warehouse baseado em hadoop-hive. In Anais do XXXVI Simpósio Brasileiro de Bancos de Dados (SBBD). SBC, pp. 271–276, 2021.

Di Tria, F., Lefons, E., and Tangorra, F. Design process for big data warehouses. In International Conference on Data Science and Advanced Analytics (DSAA). pp. 512–518, 2014.

Inmon, W. H. Building the Data Warehouse. Wiley, 2005.

Jacobs, A. The pathologies of big data. Comm. of the ACM 52 (8): 36–44, 2009.

Luckow, A., Kennedy, K., Manhardt, F., Djerekarov, E., Vorster, B., and Apon, A. Automotive big data: Applications, workloads and infrastructures. Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015 , 2015.

Mohanty, S., Jagadeesh, M., and Srivatsa, H. Big data Imperatives: Enterprise Big Data Warehouse, BI Implementations and Analytics. Apress, 2013.

Plase, D., Niedrite, L., and Taranovs, R. A Comparison of HDFS Compact Data Formats: Avro Versus Parquet. Mokslas - Lietuvos ateitis 9 (3): 267–276, 2017.

Rodrigues, M., Santos, M. Y., and Bernardino, J. Big data processing tools: An experimental performance evaluation. WIREs Data Mining and Knowledge Discovery 9 (2): e1297, 2019.

Sandoval, L. J. Design of business intelligence applications using big data technology. In 2015 IEEE Thirty Fifth Central American and Panama Convention (CONCAPAN XXXV). pp. 1–6, 2015.

Santos, M. Y. and Costa, C. Data warehousing in big data: From multidimensional to tabular data models. In Ninth International C* Conference on Computer Science Software Engineering. ACM, pp. 51–60, 2016.

Vajk, T., Fehér, P., Fekete, K., and Charaf, H. Denormalizing data into schema-free databases. In 2013 IEEE 4th International Conference on Cognitive Infocommunications (CogInfoCom). pp. 747–752, 2013.

Weintraub, G., Gudes, E., and Dolev, S. Needle in a haystack queries in cloud data lakes. In EDBT/ICDT Workshops. CEUR-WS.org, 2021.

White, T. Hadoop: The definitive guide. O’Reilly, 2015.

Downloads

Published

2022-09-21

How to Cite

P. de Oliveira, B. F., O. Valente, A. S., Victorino, M., Ribeiro, E., & Holanda, M. (2022). Analysis of the Influence of Modeling, Data Format and Processing Tool on the Performance of Hadoop-Hive Based Data Warehouse. Journal of Information and Data Management, 13(3). https://doi.org/10.5753/jidm.2022.2516

Issue

Section

SBBD 2021 Short papers - Extended papers