Exploring multi-core architectures for processing query efficiency in Big Data management systems
Abstract
Big Data Management Systems usually manage each machine as one node in parallel query processing pipeline. In multi-core architectures, they leave several processor cores aside that could contribute to speed-up query processing. In this context, this paper explores the use of all available processor cores, assessing the query processing performance in several scenarios. In particular, we use the concept of worker nodes (which are allocated in cores without disk access) and data nodes (which are allocated in cores with disk access) in the same machine using the MyriaX engine as a base platform that supports this concept. We evaluate several cluster configurations varying the amount of data and worker nodes to process two types of queries (self-join and triangle) in a Twitter dataset. The results show that increasing the I/O parallelism in terms of data nodes is not always the most effective strategy. This reinforces the idea of using worker nodes in the query processing pipeline. In the best scenario, we achieved a speed-up of 2.92 by simply adding worker nodes in the available processing cores.
Keywords:
Big Data, parallel query processing, multi-core, worker nodes
References
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A. and Rasin, A. (2009). HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. Proceedings of the VLDB Endowment (PVLDB), v. 2, n. 1, p. 922–933.
Alsubaiee, S., Behm, A., Grover, R., et al. (2012). ASTERIX: Scalable Warehouse-Style Web Data Integration. International Workshop on Information Integration on the Web, p. 1–4.
Bittorf, M., Bobrovytsky, T., Erickson, C. C. A. C. J., et al. (2015). Impala: A Modern, Open-Source SQL Engine for Hadoop. In Conference on Innovative Data Systems Research (CIDR).
Dageville, B., Cruanes, T., Zukowski, M., et al. (2016). The Snowflake Elastic Data Warehouse. International Conference on Management of Data (SIGMOD), p. 215–226.
Das, S., Agrawal, D. and El Abbadi, A. (2013). ElasTraS: An Elastic, Scalable, and Self-Managing Transactional Database for the Cloud. Transactions on Database Systems (TODS), v. 38, n. 1, p. 5.
DeWitt, D. and Gray, J. (1992). Parallel Database Systems: The Future of High Performance Database Systems. Communications of the ACM, v. 35, n. 6, p. 85–98.
Gupta, A., Agarwal, D., Tan, D., et al. (2015). Amazon Redshift and the Case for Simpler Data Warehouses. In International Conference on Management of Data (SIGMOD).
Halperin, D., Teixeira de Almeida, V., Choo, L. L., et al. (2014). Demonstration of the Myria Big Data Management Service. International Conference on Management of Data (SIGMOD), p. 881–884.
Hu, X., Tao, Y. and Chung, C.-W. (2013). Massive Graph Triangulation. In SIGMOD.
Isard, M., Budiu, M., Yu, Y., Birrell, A. and Fetterly, D. (2007). Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In European Conference on Computer Systems (EuroSys).
Kim, C., Kaldewey, T., Lee, V. W., et al. (2009). Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-core CPUs. Proceedings of the VLDB Endowment (PVLDB), v. 2, n. 2, p. 1378–1389.
Malewicz, G., Austern, M. H., Bik, A. J., et al. (2010). Pregel: A System for Large-scale Graph Processing. In International Conference on Management of Data (SIGMOD).
Mehta, M. and DeWitt, D. J. (1997). Data Placement in Shared-nothing Parallel Database Systems. The VLDB Journal, v. 6, n. 1, p. 53–72.
Mishra, P. and Eich, M. H. (1992). Join Processing in Relational Databases. ACM Computing Surveys (CSUR), v. 24, n. 1, p. 63–113.
Schneider, D. A. and DeWitt, D. J. (1989). A performance Evaluation of Four Parallel Join Algorithms in a Shared-nothing Multiprocessor Environment. International Conference on Management of Data (SIGMOD), v. 18, p. 110–121.
Stonebraker, M. (1986). The Case for Shared Nothing. IEEE Database Engineering, v. 9, n. 1, p. 4–9.
Wang, J., Baker, T., Balazinska, M., et al. (2017). The Myria Big Data Management and Analytics System and Cloud Service. In Conference on Innovative Data Systems Research (CIDR).
Warneke, D. and Kao, O. (2009). Nephele: Efficient Parallel Data Processing in the Cloud. In Many-Task Computing on Grids and Supercomputers (MTAGS).
Alsubaiee, S., Behm, A., Grover, R., et al. (2012). ASTERIX: Scalable Warehouse-Style Web Data Integration. International Workshop on Information Integration on the Web, p. 1–4.
Bittorf, M., Bobrovytsky, T., Erickson, C. C. A. C. J., et al. (2015). Impala: A Modern, Open-Source SQL Engine for Hadoop. In Conference on Innovative Data Systems Research (CIDR).
Dageville, B., Cruanes, T., Zukowski, M., et al. (2016). The Snowflake Elastic Data Warehouse. International Conference on Management of Data (SIGMOD), p. 215–226.
Das, S., Agrawal, D. and El Abbadi, A. (2013). ElasTraS: An Elastic, Scalable, and Self-Managing Transactional Database for the Cloud. Transactions on Database Systems (TODS), v. 38, n. 1, p. 5.
DeWitt, D. and Gray, J. (1992). Parallel Database Systems: The Future of High Performance Database Systems. Communications of the ACM, v. 35, n. 6, p. 85–98.
Gupta, A., Agarwal, D., Tan, D., et al. (2015). Amazon Redshift and the Case for Simpler Data Warehouses. In International Conference on Management of Data (SIGMOD).
Halperin, D., Teixeira de Almeida, V., Choo, L. L., et al. (2014). Demonstration of the Myria Big Data Management Service. International Conference on Management of Data (SIGMOD), p. 881–884.
Hu, X., Tao, Y. and Chung, C.-W. (2013). Massive Graph Triangulation. In SIGMOD.
Isard, M., Budiu, M., Yu, Y., Birrell, A. and Fetterly, D. (2007). Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. In European Conference on Computer Systems (EuroSys).
Kim, C., Kaldewey, T., Lee, V. W., et al. (2009). Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-core CPUs. Proceedings of the VLDB Endowment (PVLDB), v. 2, n. 2, p. 1378–1389.
Malewicz, G., Austern, M. H., Bik, A. J., et al. (2010). Pregel: A System for Large-scale Graph Processing. In International Conference on Management of Data (SIGMOD).
Mehta, M. and DeWitt, D. J. (1997). Data Placement in Shared-nothing Parallel Database Systems. The VLDB Journal, v. 6, n. 1, p. 53–72.
Mishra, P. and Eich, M. H. (1992). Join Processing in Relational Databases. ACM Computing Surveys (CSUR), v. 24, n. 1, p. 63–113.
Schneider, D. A. and DeWitt, D. J. (1989). A performance Evaluation of Four Parallel Join Algorithms in a Shared-nothing Multiprocessor Environment. International Conference on Management of Data (SIGMOD), v. 18, p. 110–121.
Stonebraker, M. (1986). The Case for Shared Nothing. IEEE Database Engineering, v. 9, n. 1, p. 4–9.
Wang, J., Baker, T., Balazinska, M., et al. (2017). The Myria Big Data Management and Analytics System and Cloud Service. In Conference on Innovative Data Systems Research (CIDR).
Warneke, D. and Kao, O. (2009). Nephele: Efficient Parallel Data Processing in the Cloud. In Many-Task Computing on Grids and Supercomputers (MTAGS).
Published
2017-10-02
How to Cite
DA SILVA, Frank W. R.; DE ALMEIDA, Victor T.; BRAGANHOLO, Vanessa.
Exploring multi-core architectures for processing query efficiency in Big Data management systems. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 32. , 2017, Uberlândia/MG.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2017
.
p. 52-63.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2017.171397.
