Similarity Joins using Distributed Processing and Massive Parallelism

  • Larissa Ramos Marques Silva Federal University of Goiás
  • Leonardo Andrade Ribeiro Federal University of Goiás

Abstract


Similarity join returns all pairs of similar objects in a dataset. As this operation is computationally expensive, the runtime can be excessive on large volumes of data. This paper presents an efficient and scalable similarity join algorithm that exploits the massive parallelism of GPUs in a heterogeneous distributed environment. In this context, a coprocessing model is proposed to distribute the workload between CPU and GPU. Experimental results show that our proposal is effective and outperforms previous work.

Keywords: similarity join, data integration, data cleaning, advanced query processing, parallel and distributed computing

References

Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A Primitive Operator for Similarity Joins in Data Cleaning. In Proceedings of the ICDE Conference, page 5.

Doan, A., Halevy, A. Y., and Ives, Z. G. (2012). Principles of Data Integration. Morgan Kaufmann.

Fier, F., Augsten, N., Bouros, P., Leser, U., and Freytag, J. (2018). Set Similarity Joins on MapReduce: An Experimental Survey. Proceedings of the VLDB Endowment, 11(10):1110-1122.

Oliveira, D., Borges, F. F., and Ribeiro, L. A. (2017). Uma Abordagem para Processamento Distribuído de Junção por Similaridade sobre Múltiplos Atributos. In Proceedings of the Brazilian Symposium on Databases, pages 300-305.

Ribeiro, L. A. and Harder, T. (2011). Generalizing Prefix Filtering to Improve Set Similarity Joins. Information Systems, 36(1):62-78.

Ribeiro-Júnior, S., Quirino, R. D., Ribeiro, L. A., and Martins, W. S. (2017). Fast Parallel Set Similarity Joins on Many-core Architectures. Journal of Information and Data Management, 8(3):255-270.

Shanbhag, A., Madden, S., and Yu, X. (2020). A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics. In Proceedings of the SIGMOD Conference, pages 1617-1632.

Xu, L., Butt, A. R., Lim, S., and Kannan, R. (2018). A Heterogeneity-Aware Task Scheduler for Spark. In Proceedings of the IEEE International Conference on Cluster Computing, pages 245-256.

Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., Ghodsi, A., Gonzalez, J Shenker, S., and Stoica, I. (2016). Apache Spark: a Unified Engine for Big Data Processing. Communications of the ACM, 59(11):56-65.
Published
2022-09-19
SILVA, Larissa Ramos Marques; RIBEIRO, Leonardo Andrade. Similarity Joins using Distributed Processing and Massive Parallelism. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 37. , 2022, Búzios. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2022 . p. 421-426. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2022.226212.