Planning Your SQL-on-Hadoop Deployment Using a Low-Cost Simulation-Based Approach

  • Jun Liu Intel Corporation
  • Bianny Bian Intel Corporation
  • Samantika Subramaniam Sury Intel Corporation

Abstract


The term "SQL-on-Hadoop" has recently gained significant traction [19]. Impala represents a new emerging class of SQL-on-Hadoop systems that exploit a shared-nothing parallel database architecture over Hadoop. Impala was designed to close the gap of near real time data analytics on Hadoop stack and it has shown itself to be significantly more efficient than other SQL-on-Hadoop solutions [13]. However, it is not a trivial task to leverage Impala for handling queries with different business demands [12]. Improperly deploying an Impala cluster may not give you the expected performance you want. In this paper, we propose a novel Impala simulation framework to help IT professionals to understand its performance behavior. This would simplify the deployment planning work required to enable big data analytics on SQL-on-Hadoop systems. An Impala simulator models the behavior of a complete software stack and simulates the activities of cluster components such as storage, network, processors and memory. Moreover, the accuracy of the simulation remain high in response to both software configuration and hardware changes, it reflects the expected scaling trend with low cost overhead and fast simulation speed. The Impala simulator has been validated against various S/W and H/W configurations, using the well-known TPC-DS benchmark [15], and the simulation results are valid and expected. A use case is provided to show how one would use the simulator to solve their performance and deployment issues.
Keywords: Software, Hardware, Computer architecture, Metadata, Planning, Servers, performance, simulation, impala, sql-on-hadoop, modeling, systemc, simulator, big data, optimization, deployment planning
Published
2016-10-26
LIU, Jun; BIAN, Bianny; SURY, Samantika Subramaniam. Planning Your SQL-on-Hadoop Deployment Using a Low-Cost Simulation-Based Approach. In: INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING (SBAC-PAD), 28. , 2016, Los Angeles/EUA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2016 . p. 182-189.