Beyond fgssjoin: Parallel Algorithms for Joins by Similarity of Sets

  • Rafael David Quirino UFG
  • Wellington Santos Martins UFG

Abstract


Set similarity joins are very important operations on modern database systems, specially for the so called data warehouses, where several daily operations like integration, cleaning and data mining use it frequently. Exact algorithms, which yields all possible similar pairs regarding some similarity threshold are computationally expensive, which imposes slowness for analytical workloads and highlights the need for parallel solutions. Recent work shows parallel algorithms designed for devices with many-core architectures like GPUs. In this paper we present a new algorithm for the filtering phase of the fgssjoin, a well known parallel gpu-based algorithm for exact set similarity joins.

References

Bayardo, R. J., Ma, Y., and Srikant, R. (2007). Scaling up All Pairs Similarity Search. In WWW, pages 131–140.

Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A primitive operator for similarity joins in data cleaning. In ICDE, page 5.

Cruz, M. S. H., Kozawa, Y., Amagasa, T., and Kitagawa, H. (2015). GPU Acceleration of Set Similarity Joins. In DEXA, pages 384–398.

Doan, A., Halevy, A. Y., and Ives, Z. G. (2012). Principles of Data Integration. Morgan Kaufmann.

Leskovec, J., Rajaraman, A., and Ullman, J. D. (2014). Mining of Massive Datasets, 2nd Ed. Cambridge University Press.

Quirino, R. D., Ribeiro-Júnior, S., Ribeiro, L. A., and Martins, W. S. (2017). fgssjoin: A gpu-based algorithm for set similarity joins. In Hammoudi, S., Smialek, M., Camp, O., and Filipe, J., editors, ICEIS 2017 - Proceedings of the 19th International Con- ference on Enterprise Information Systems, Volume 1, Porto, Portugal, April 26-29, 2017, pages 152–161. SciTePress.

Ribeiro, L. A. and Harder, T. (2011). Generalizing Prefix Filtering to Improve Set Similarity Joins. Information Systems, 36(1):62–78.

Sarawagi, S. and Kirpal, A. Efficient Set Joins on Similarity Predicates. In SIGMOD.

Vernica, R., Carey, M. J., and Li, C. (2010). Efficient Parallel Set-similarity Joins using MapReduce. In SIGMOD, pages 495–506.

Wang, J., Li, G., and Feng, J. (2012). Can We Beat the Prefix Filtering?: An Adaptive Framework for Similarity Join and Search. In SIGMOD, pages 85–96.

Xiao, C., Wang, W., Lin, X., Yu, J. X., and Wang, G. (2011). Efficient Similarity Joins for Near-duplicate Detection. TODS, 36(3):15.
Published
2022-10-25
QUIRINO, Rafael David; MARTINS, Wellington Santos. Beyond fgssjoin: Parallel Algorithms for Joins by Similarity of Sets. In: REGIONAL SCHOOL ON INFORMATICS OF GOIÁS (ERI-GO), 10. , 2022, Goiás. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2022 . p. 185-188. DOI: https://doi.org/10.5753/erigo.2022.227677.