Distributed Execution Plans for Multiway Spatial Join Queries using Multidimensional Histograms
Keywords:Distributed Processing, GIS, Multiway Spatial Join, Spatial Data
Multiway spatial join is a common and heavyweight type of query for spatial data processing on relational database management systems. This article presents a complete solution to process this type of query in distributed systems. We proposed a cost-based optimizer for multiway spatial join queries, based on a novel use of multidimensional histograms, which are used to represent two metrics that describe a dataset: cardinality and size of the spatial objects, together with one feature that represents the location of the dataset partitions in the cluster. A new method for histogram construction is proposed, together with a formula to estimate the cost of multiway spatial join queries and to select good executions plans. A greedy algorithm is proposed to schedule query execution while minimizing both the makespan and the communication cost. We adapted the Clone Join algorithm to improve its execution time and to process multiway spatial join queries. The evaluation demonstrated that the new method for histogram construction performs better than a well-know method, and provides a better estimate for the cost of execution plans in 75% of the queries evaluated. The cost estimated by the proposed method is shown to be sufficiently close to real execution times, and our complete methodology was able to use the estimated costs to select the best execution plan for all queries used in the experiments. It also provided a nearly exact ordering of the query execution plans with respect to execution time.