GPU Partitioning for the Exact Similarity Join Problem

  • Gabriel Viana Dantas UFG
  • Wellington S. Martins UFG
  • Leonardo A. Ribeiro UFG

Abstract


The exact set similarity join finds all similar pairs between set collections, enabling a variety of applications. Its quadratic complexity demands efficient solutions capable of making use of the GPU’s computing power. The positional filter, used in existing algorithms, requires a large data structure for parallel execution. This work explores a new partitioning scheme for the exact and symmetric self-join, incorporating the positional filter and allowing for massive data parallelism without limiting the size of the index structures used.

References

Augsten, N. and Böhlen, M. H. (2013). Similarity joins in relational database systems. Synthesis Lectures on Data Management, 5(5):1–124.

Chakrabarti, K., Chaudhuri, S., Ganti, V., and Xin, D. (2008). An efficient filter for approximate membership checking. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 805–818.

Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A primitive operator for similarity joins in data cleaning. In 22nd International Conference on Data Engineering (ICDE’06), pages 5–5. IEEE.

Kirk, D. B. and Wen-Mei, W. H. (2016). Programming massively parallel processors: a hands-on approach. Morgan kaufmann.

Mann, W., Augsten, N., and Bouros, P. (2016). An empirical evaluation of set similarity join techniques. Proceedings of the VLDB Endowment, 9(9):636–647.

Quirino, R. D., Ribeiro-Júnior, S., Ribeiro, L. A., and Martins, W. S. (2017). fgssjoin: A gpu-based algorithm for set similarity joins. In ICEIS (1), pages 152–161.

Rajaraman, A. and Ullman, J. D. (2011). Mining of massive datasets. Cambridge University Press.

Xiao, C., Wang, W., Lin, X., Yu, J. X., and Wang, G. (2011). Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS), 36(3):1–41.
Published
2022-10-25
DANTAS, Gabriel Viana; MARTINS, Wellington S.; RIBEIRO, Leonardo A.. GPU Partitioning for the Exact Similarity Join Problem. In: REGIONAL SCHOOL ON INFORMATICS OF GOIÁS (ERI-GO), 10. , 2022, Goiás. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2022 . p. 198-201. DOI: https://doi.org/10.5753/erigo.2022.227689.