Analysis of the Influence of Modeling and Data Format on Data Warehouse Performance Based on Hadoop-Hive

  • Beatriz Fragnan P. de Oliveira University of Brasilia (UnB)
  • Aline S. Oliveira Valente University of Brasilia (UnB)
  • Marcio Victorino University of Brasilia (UnB)
  • Edward Ribeiro University of Brasilia (UnB)
  • Maristela Holanda University of Brasilia (UnB)

Abstract


The advancement of data warehousing in cloud environments has grown. In this context, there is no defined model or pattern on how to handle data. Therefore, this work aims to present a comparative analysis of the performance in the use of the Hive platform with the snowflake model and the totally denormalized. The used data for this analysis are those of the Brazilian Army Open Data in the Google Cloud environment. The analysis is performed for different quantities of lines in Hive, for a cluster configuration scene and for two types of storage of tables. Lastly, using the Parquet format on the tables, a performance four times superior was achieved to that of the CSV format.
Keywords: Data warehouse, hive, nosql, Big data, csv, parquet, data modeling, data format

References

Cassavia, N., Dicosta, P., Masciari, E., and Sacca, D. (2014). Data preparation for tourist data big data warehousing. In International Conference on Data Management Technologies and Applications, pages 419–426. INSTICC, SciTePress.

Costa, E., Costa, C., and Santos, M. Y. (2017). Efficient big data modelling and organization for hadoop hive-based data warehouses. In Themistocleous, M. and Morabito, V., editors, European, Mediterranean and Middle Eastern Conference on Information Systems, pages 3–16. Springer International Publishing.

Di Tria, F., Lefons, E., and Tangorra, F. (2014). Design process for big data warehouses. In International Conference on Data Science and Advanced Analytics (DSAA), pages 512–518.

Jacobs, A. (2009). The pathologies of big data. Comm. of the ACM, 52(8):36–44.

Mohanty, S., Jagadeesh, M., and Srivatsa, H. (2013). Big data Imperatives: Enterprise Big Data Warehouse, BI Implementations and Analytics. Apress, 1st edition.

Rodrigues, M., Santos, M. Y., and Bernardino, J. (2019). Big data processing tools: An experimental performance evaluation. WIREs Data Mining and Knowledge Discovery, 9(2):e1297.

Sandoval, L. J. (2015). Design of business intelligence applications using big data technology. In 2015 IEEE Thirty Fifth Central American and Panama Convention (CONCAPAN XXXV), pages 1–6.

Santos, M. Y. and Costa, C. (2016). Data warehousing in big data: From multidimensional to tabular data models. In Ninth International C* Conference on Computer Science Software Engineering, pages 51–60. ACM.

Weintraub, G., Gudes, E., and Dolev, S. (2021). Needle in a haystack queries in cloud data lakes. In EDBT/ICDT Workshops. CEUR-WS.org.
Published
2021-10-04
OLIVEIRA, Beatriz Fragnan P. de; VALENTE, Aline S. Oliveira; VICTORINO, Marcio; RIBEIRO, Edward; HOLANDA, Maristela. Analysis of the Influence of Modeling and Data Format on Data Warehouse Performance Based on Hadoop-Hive. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 36. , 2021, Rio de Janeiro. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 . p. 271-276. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2021.17884.