Efficient Processing of Analytical Queries Extended with Similarity Search Predicates over Images in Spark

Guilherme Muzzi da Rocha; Cristina Dutra de Aguiar Ciferri

doi:10.5753/jidm.2020.2019

Authors

Guilherme Muzzi da Rocha University of São Paulo
Cristina Dutra de Aguiar Ciferri University of São Paulo

DOI:

https://doi.org/10.5753/jidm.2020.2019

Keywords:

Image data warehouse, analytical queries extended with a similarity search predicate, parallel and distributed processing, medical images, star join, distance calculations

Abstract

An image data warehousing extends a conventional data warehousing to also manipulate images represented by feature vectors and attributes for similarity search. A challenge that arises is the efficient processing of analytical queries extended with a similarity search predicate. These queries have a high computational cost since they require the processing of costly star join operations and distance calculations in the same setting. We consider applications that manage huge volumes of data, where the use of parallel and distributed data processing frameworks is needed. In this article, we introduce two methods to efficiently solve this challenge in Spark. BrOmnImg is based on the integration of the broadcast join and the Omni techniques for the processing of the star join operation and the distance calculations, respectively. BrOmnImgCF extends BrOmnImg by using the conventional predicate to further reduce the number of distance calculations. Compared with the closest method available in the literature, BrOmnImg reduced the time spent on query processing by up to about 65%. Compared with BrOmnImg, BrOmnImgCF improved the performance by up to about 54%.

Downloads

References

Batista, N. A., Sousa, G. A., Brandão, M. A., da Silva, A. P. C., and Moro, M. M. Tie strength metrics to rank pairs of developers from GitHub. Journal of Information and Data Management 9 (1): 69–83, 2018.

Brito, J. J., Mosqueiro, T., Ciferri, R. R., and Ciferri, C. D. A. Faster cloud star joins with reduced disk spill and network communication. Procedia Computer Science vol. 80, pp. 74–85, 2016.

Brito, J. J., Mosqueiro, T., Ciferri, R. R., and Ciferri, C. D. A. Random access with a distributed bitmap join index for star joins. Heliyon 6 (2): e03342, 2020.

Carélo, C. C. M., Pola, I. R. V., Ciferri, R. R., Traina, A. J. M., Traina-Jr, C., and Ciferri, C. D. A. Slicing the metric space to provide quick indexing of complex data in the main memory. Information Systems 36 (1): 79–98, 2011.

Cuzzocrea, A. Warehousing and protecting big data: state-of-the-art-analysis, methodologies, future challenges. In Proceedings of the International Conference on Internet of Things and Cloud Computing. Article No.: 14. pp. 1–7, 2016.

Dash, S., Shakyawar, S., Sharma, M., and Kaushik, S. Big data in healthcare: management, analysis and future prospects. Journal of Big Data 6 (54): 1–25, 2019.

Dean, J. and Ghemawat, S. MapReduce: simplified data processing on large clusters. Communications of the ACM 51 (1): 107–113, 2008.

Fu, A. W.-C., Chan, P. M.-S., Cheung, Y.-L., and Moon, Y. S. Dynamic VP-tree indexing for n-nearest neighbor search given pair-wise distances. The VLDB Journal 9 (2): 154–173, 2000.

Giangreco, I., Al Kabary, I., and Schuldt, H. Adam: A system for jointly providing IR and database queries in large-scale multimedia retrieval. In Proceedings of the 37th International ACM SIGIR Conference on Research Development in Information Retrieval. pp. 1257–1258, 2014.

Gonzalez, R. and Woods, R. Digital Image Processing. Prentice-Hall, 2006.

Guoliang, Z. and Guilan, W. GBFSJ: Bloom filter star join algorithms on GPUs. In Proceeding of the 12th International Conference on Fuzzy Systems and Knowledge Discovery. pp. 2427–2431, 2015.

Haralick, R. Statistical and structural approaches to texture. Proceedings of the IEEE 67 (5): 786–804, 1979. Hjaltason, G. R. and Samet, H. Index-driven similarity search in metric spaces. ACM Transactions on Database Systems 28 (4): 517–580, 2003.

Istephan, S. and Siadat, M.-R. Extensible query framework for unstructured medical data – a big data approach. In Proceedings of the 2015 IEEE International Conference on Data Mining Workshop. pp. 455–462, 2015.

Istephan, S. and Siadat, M.-R. Unstructured medical image query using big data – an epilepsy case study. Journal of Biomedical Informatics vol. 59, pp. 218–226, 2016.

Kimball, R. and Ross, M. The data warehouse toolkit: the complete guide to dimensional modeling, 2nd Edition. Wiley, 2002.

Kitchenham, B. and Charters, S. Guidelines for performing systematic literature reviews in software engineering, 2007.

Kuo, M., Chrimes, D., Moa, B., and Hu, W. Design and construction of a big data analytics framework for health applications. In Proceedings of the 2015 IEEE International Conference on Smart City/SocialCom/SustainCom. pp. 631–636, 2015.

Kuo, M.-H., Sahama, T., Kushniruk, A., Borycki, E., and Grunwell, D. Health big data analytics: current perspectives, challenges and potential solutions. International Journal of Big Data Intelligence vol. 1, pp. 114–126, 2014.

Li, D., Zhang, W., Shen, S., and Zhang, Y. SES-LSH: Shuffle-efficient locality sensitive hashing for distributed similarity search. In Proceedings of the 2017 IEEE International Conference on Web Services. pp. 822–827, 2017.

Nguyen, D.-T., Yong, C. H., Pham, X.-Q., Nguyen, H.-Q., Loan, T. T. K., and Huh, E.-N. An index scheme for similarity search on cloud computing using MapReduce over docker container. In Proceedings of the ACM International Conference on Ubiquitous Information Management and Communication. pp. 60:1–60:6, 2016.

Nguyen, T. D. T. and Huh, E.-N. An efficient similar image search framework for large-scale data on cloud. In Proceedings of the ACM International Conference on Ubiquitous Information Management and Communication. pp. 65:1–65:8, 2017.

Nguyen, V.-Q., Ngoc, N., and Kim, K. Design of a platform for collecting and analyzing agricultural big data. Journal of Digital Contents Society vol. 18, pp. 149–158, 2017.

Raghupathi, W. and Raghupathi, V. Big data analytics in healthcare: promise and potential. Health information science and systems 2 (1): 3, 2014.

Raja, P. V. and Sivasankar, E. Modern framework for distributed healthcare data analytics based on Hadoop. In Proceedings of the Second IFIP TC5/8 International Conference on Information and Communication Technology. pp. 348–355, 2014.

Rocha, G. M. and Ciferri, C. D. A. ImgDW generator: a tool for generating data for medical image data warehouses. In Proceedings Companion of the 33rd Brazilian Symposium on Databases: Demos and WTDBD. pp. 23–28, 2018.

Rocha, G. M. and Ciferri, C. D. A. Processamento eficiente de consultas analíticas estendidas com predicado de similaridade em Spark. In Proceedings of the 34th Brazilian Symposium on Databases: Short Papers. pp. 229–234, 2019.

Sebaa, A., Chikh, F., Nouicer, A., and Tari, A. Medical big data warehouse: Architecture and system design, a case study: Improving healthcare resources distribution. Journal of Medical Systems 42 (4): 59, 2018.

Sebaa, A., Nouicer, A., Chikh, F., and Tari, A. Big data technologies to improve medical data warehousing. In Proceedings of the 2nd international Conference on Big Data, Cloud and Applications. pp. 21:1–21:5, 2017.

Shvachko, K., Kuang, H., Radia, S., and Chansler, R. The Hadoop distributed file system. In Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies. pp. 1–10, 2010.

Tarkoma, S., Rothenberg, C. E., and Lagerspetz, E. Theory and practice of bloom filters for distributed systems. IEEE Communications Surveys and Tutorials 14 (1): 131–155, 2012.

Teixeira, J. W., Annibal, L. P., Felipe, J. C., Ciferri, R. R., and Ciferri, C. D. A. A similarity-based data warehousing environment for medical images. Computers in Biology and Medicine vol. 66, pp. 190 – 208, 2015.

Traina-Jr, C., Filho, R. F. S., Traina, A. J. M., Vieira, M. R., and Faloutsos, C. The Omni-family of all-purpose access methods: a simple and effective way to make similarity search more efficient. The VLDB Journal 16 (4): 483–505, 2007.

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. pp. 10–10, 2010.

Efficient Processing of Analytical Queries Extended with Similarity Search Predicates over Images in Spark

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

Make a Submission

Metrics: