Bwjoin: A Blockwise GPU-based Algorithm for Set Similarity Joins

  • Rafael D. Quirino UFG
  • André M. Quirino Visiona Tecnologia Espacial
  • Leonardo A. Ribeiro UFG
  • Wellington S. Martins UFG

Resumo


Set similarity joins play a pivotal role in diverse fields, ranging from modern database management systems to near-duplicate detection and even galaxy cluster analysis in cosmology. However, due to their quadratic nature, these operations have been associated with substantial computational costs. To tackle this challenge, parallel solutions have been developed in recent years, spanning algorithms for distributed and shared memory architectures, as well as massively parallel systems like GPU accelerators. In this paper, we propose a new GPU-based algorithm, using the prefix-filtering technique, that harnesses the power of blockwise parallelism, achieving better performance than its competitors, especially for high threshold similarity joins in big datasets.

Referências

Bayardo, R. J., Ma, Y., and Srikant, R. (2007). Scaling up All Pairs Similarity Search. In Proc. of the 16th Intl. Conf. on World Wide Web, pages 131–140.

Bellas, C. and Gounaris, A. (2019). Exact Set Similarity Joins for Large Datasets in the GPGPU Paradigm. In Proceedings of the International Workshop on Data Management on New Hardware, pages 5:1–5:10.

Bellas, C. and Gounaris, A. (2020). An empirical evaluation of exact set similarity join techniques using gpus. Information Systems, 89:101485.

Bellas, C. and Gounaris, A. (2021). HySet: A Hybrid Framework for Exact Set Similarity Join using a GPU. Parallel Computing, 104-105:102790.

Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A Primitive Operator for Similarity Joins in Data Cleaning. In Proc. of the 22nd IEEE Intl. Conf. on Data Engineering, page 5.

Cruz, M. S. H., Kozawa, Y., Amagasa, T., and Kitagawa, H. (2015). GPU Acceleration of Set Similarity Joins. In DEXA, pages 384–398.

Cruz, M. S. H., Kozawa, Y., Amagasa, T., and Kitagawa, H. (2016). Accelerating Set Similarity Joins Using GPUs. T. Large-Scale Dataand Knowledge-Centered Systems, 28:1–22.

Fier, F. and Freytag, J. (2022). Parallelizing Filter-and-verification based Exact Set Similarity Joins on Multicores. Information Systems, 108:101912.

Li, Y., Yu, X., and Koudas, N. (2021). LES3: Learning-based Exact Set Similarity Search. Proceedings of the VLDB Endowment, 14(11):2073–2086.

Mann, W., Augsten, N., and Bouros, P. (2016). An Empirical Evaluation of Set Similarity Join Techniques. PVLDB, 9(9):636–647.

Quirino, R. D., Ribeiro-Júnior, S., Ribeiro, L. A., and Martins, W. S. (2017). fgssjoin: A GPU-based Algorithm for Set Similarity Joins. In Proc. of the 19th Intl. Conf. on Enterprise Information System, pages 152–161.

Ribeiro, L. A. and Härder, T. (2011). Generalizing Prefix Filtering to Improve Set Similarity Joins. Information Systems, 36(1):62–78.

Ribeiro-Júnior, S., Quirino, R. D., Ribeiro, L. A., and Martins, W. S. (2016). gSSJoin: a GPU-based Set Similarity Join Algorithm. In Proc. of the 31st Brazilian Symposium on Databases, pages 64–75.

Ribeiro-Júnior, S., Quirino, R. D., Ribeiro, L. A., and Martins, W. S. (2017). Fast parallel set similarity joins on many-core architectures. J. Inf. Data Manag., 8(3):255–270.

Sarawagi, S. and Kirpal, A. (2004). Efficient Set Joins on Similarity Predicates. In Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pages 743–754.

Vernica, R., Carey, M. J., and Li, C. (2010). Efficient Parallel Set-similarity Joins using MapReduce. In Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pages 495–506.

Wang, X., Qin, L., Lin, X., Zhang, Y., and Chang, L. (2017). Leveraging Set Relations in Exact Set Similarity Join. Proceedings of the VLDB Endowment, 10(9):925–936.

Xiao, C., Wang, W., Lin, X., Yu, J. X., and Wang, G. (2011). Efficient Similarity Joins for Near-duplicate Detection. ACM Trans. Database Syst., 36(3):15.
Publicado
17/10/2023
QUIRINO, Rafael D.; QUIRINO, André M.; RIBEIRO, Leonardo A.; MARTINS, Wellington S.. Bwjoin: A Blockwise GPU-based Algorithm for Set Similarity Joins. In: SIMPÓSIO EM SISTEMAS COMPUTACIONAIS DE ALTO DESEMPENHO (SSCAD), 24. , 2023, Porto Alegre/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 121-132. DOI: https://doi.org/10.5753/wscad.2023.235894.