Efficient Processing of Analytical Queries Extended with Similarity Search Predicates over Images in Spark
An image data warehousing extends a conventional data warehousing to also manipulate images represented by feature vectors and attributes for similarity search. A challenge that arises is the efficient processing of analytical queries extended with a similarity search predicate. These queries have a high computational cost since they require the processing of costly star join operations and distance calculations in the same setting. We consider applications that manage huge volumes of data, where the use of parallel and distributed data processing frameworks is needed. In this article, we introduce two methods to efficiently solve this challenge in Spark. BrOmnImg is based on the integration of the broadcast join and the Omni techniques for the processing of the star join operation and the distance calculations, respectively. BrOmnImgCF extends BrOmnImg by using the conventional predicate to further reduce the number of distance calculations. Compared with the closest method available in the literature, BrOmnImg reduced the time spent on query processing by up to about 65%. Compared with BrOmnImg, BrOmnImgCF improved the performance by up to about 54%.