Efficient processing analytical queries extended with similarity search predicates in Spark

  • Guilherme Muzzi da Rocha USP
  • Cristina Dutra de Aguiar Ciferri USP

Abstract


An image data warehousing extends a conventional data warehousing to also manipulate images represented by feature vectors and attributes for similarity search. A challenge that arises is the efficient processing of analytical queries extended with a similarity search predicate since these queries have a high computational cost. In this article we propose the BrOmnImg method, which efficiently solves this challenge in Spark. Compared to its closest method, BrOmnImg improved query processing up to 65.49%.

Keywords: Image data warehouse, OLAP queries extended with similarity search predicates, parallel and distributed processing, Spark.

References

Brito, J. J., Mosqueiro, T., Ciferri, R. R., and Ciferri, C. D. A. (2016). Faster cloud star joins with reduced disk spill and network communication. Procedia Computer Science, 80:74 – 85. DOI: https://doi.org/10.1016/j.procs.2016.05.299

Dean, J. and Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Commununications of the ACM, 51(1):107–113. DOI: https://doi.org/10.1145/1327452.1327492

Li, D., Zhang, W., Shen, S., and Zhang, Y. (2017). SES-LSH: Shuffle-efficient locality sensitive hashing for distributed similarity search. In ICWS 2017, pages 822–827. DOI: https://doi.org/10.1109/ICWS.2017.99

Nguyen, T. D. T. and Huh, E.-N. (2017). An efficient similar image search framework for large-scale data on cloud. In IMCOM 2017, pages 65:1–65:8. DOI: https://doi.org/10.1145/3022227.3022291

Nguyen, V.-Q., Ngoc, N., and Kim, K. (2017). Design of a platform for collecting and analyzing agricultural big data. Journal of Digital Contents Society, 18:149–158. DOI: https://doi.org/10.9728/dcs.2017.18.1.149

Rocha, G. M. and Ciferri, C. D. A. (2018). ImgDW generator: a tool for generating data for medical image data warehouses. In SBBD 2018 Proc. Companion, pages 23–28.

Sebaa, A., Chikh, F., Nouicer, A., and Tari, A. (2018). Medical big data warehouse: Architecture and system design, a case study: Improving healthcare resources distribution. Journal of Medical Systems, 42(4):59. DOI: https://doi.org/10.1007/s10916-018-0894-9

Teixeira, J. W., Annibal, L. P., Felipe, J. C., Ciferri, R. R., and Ciferri, C. D. A. (2015). A similarity-based data warehousing environment for medical images. Computers in Biology and Medicine, 66:190 – 208. DOI: https://doi.org/10.1016/j.compbiomed.2015.08.019

Traina, C., Filho, R. F. S., Traina, A. J. M., Vieira, M. R., and Faloutsos, C. (2007). The Omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient. The VLDB Journal, 16(4):483–505. DOI: https://doi.org/10.1007/s00778-005-0178-0

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. (2010). Spark: Cluster computing with working sets. In USENIX HotCloud 2010.
Published
2019-10-07
DA ROCHA, Guilherme Muzzi; CIFERRI, Cristina Dutra de Aguiar. Efficient processing analytical queries extended with similarity search predicates in Spark. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 34. , 2019, Fortaleza. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2019 . p. 229-234. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2019.8828.