Beyond fgssjoin: Parallel Algorithms for Joins by Similarity of Sets
Abstract
Set similarity joins are very important operations on modern database systems, specially for the so called data warehouses, where several daily operations like integration, cleaning and data mining use it frequently. Exact algorithms, which yields all possible similar pairs regarding some similarity threshold are computationally expensive, which imposes slowness for analytical workloads and highlights the need for parallel solutions. Recent work shows parallel algorithms designed for devices with many-core architectures like GPUs. In this paper we present a new algorithm for the filtering phase of the fgssjoin, a well known parallel gpu-based algorithm for exact set similarity joins.
References
Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A primitive operator for similarity joins in data cleaning. In ICDE, page 5.
Cruz, M. S. H., Kozawa, Y., Amagasa, T., and Kitagawa, H. (2015). GPU Acceleration of Set Similarity Joins. In DEXA, pages 384–398.
Doan, A., Halevy, A. Y., and Ives, Z. G. (2012). Principles of Data Integration. Morgan Kaufmann.
Leskovec, J., Rajaraman, A., and Ullman, J. D. (2014). Mining of Massive Datasets, 2nd Ed. Cambridge University Press.
Quirino, R. D., Ribeiro-Júnior, S., Ribeiro, L. A., and Martins, W. S. (2017). fgssjoin: A gpu-based algorithm for set similarity joins. In Hammoudi, S., Smialek, M., Camp, O., and Filipe, J., editors, ICEIS 2017 - Proceedings of the 19th International Con- ference on Enterprise Information Systems, Volume 1, Porto, Portugal, April 26-29, 2017, pages 152–161. SciTePress.
Ribeiro, L. A. and Harder, T. (2011). Generalizing Prefix Filtering to Improve Set Similarity Joins. Information Systems, 36(1):62–78.
Sarawagi, S. and Kirpal, A. Efficient Set Joins on Similarity Predicates. In SIGMOD.
Vernica, R., Carey, M. J., and Li, C. (2010). Efficient Parallel Set-similarity Joins using MapReduce. In SIGMOD, pages 495–506.
Wang, J., Li, G., and Feng, J. (2012). Can We Beat the Prefix Filtering?: An Adaptive Framework for Similarity Join and Search. In SIGMOD, pages 85–96.
Xiao, C., Wang, W., Lin, X., Yu, J. X., and Wang, G. (2011). Efficient Similarity Joins for Near-duplicate Detection. TODS, 36(3):15.
