Efficient processing analytical queries extended with similarity search predicates in Spark
Abstract
An image data warehousing extends a conventional data warehousing to also manipulate images represented by feature vectors and attributes for similarity search. A challenge that arises is the efficient processing of analytical queries extended with a similarity search predicate since these queries have a high computational cost. In this article we propose the BrOmnImg method, which efficiently solves this challenge in Spark. Compared to its closest method, BrOmnImg improved query processing up to 65.49%.
References
Dean, J. and Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Commununications of the ACM, 51(1):107–113. DOI: https://doi.org/10.1145/1327452.1327492
Li, D., Zhang, W., Shen, S., and Zhang, Y. (2017). SES-LSH: Shuffle-efficient locality sensitive hashing for distributed similarity search. In ICWS 2017, pages 822–827. DOI: https://doi.org/10.1109/ICWS.2017.99
Nguyen, T. D. T. and Huh, E.-N. (2017). An efficient similar image search framework for large-scale data on cloud. In IMCOM 2017, pages 65:1–65:8. DOI: https://doi.org/10.1145/3022227.3022291
Nguyen, V.-Q., Ngoc, N., and Kim, K. (2017). Design of a platform for collecting and analyzing agricultural big data. Journal of Digital Contents Society, 18:149–158. DOI: https://doi.org/10.9728/dcs.2017.18.1.149
Rocha, G. M. and Ciferri, C. D. A. (2018). ImgDW generator: a tool for generating data for medical image data warehouses. In SBBD 2018 Proc. Companion, pages 23–28.
Sebaa, A., Chikh, F., Nouicer, A., and Tari, A. (2018). Medical big data warehouse: Architecture and system design, a case study: Improving healthcare resources distribution. Journal of Medical Systems, 42(4):59. DOI: https://doi.org/10.1007/s10916-018-0894-9
Teixeira, J. W., Annibal, L. P., Felipe, J. C., Ciferri, R. R., and Ciferri, C. D. A. (2015). A similarity-based data warehousing environment for medical images. Computers in Biology and Medicine, 66:190 – 208. DOI: https://doi.org/10.1016/j.compbiomed.2015.08.019
Traina, C., Filho, R. F. S., Traina, A. J. M., Vieira, M. R., and Faloutsos, C. (2007). The Omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient. The VLDB Journal, 16(4):483–505. DOI: https://doi.org/10.1007/s00778-005-0178-0
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. (2010). Spark: Cluster computing with working sets. In USENIX HotCloud 2010.
