gSSJoin: a GPU-based Set Similarity Join Algorithm

Sidney R. Junior; Rafael D. Quirino; Leonardo Andrade Ribeiro; Wellington S. Martins

doi:10.5753/sbbd.2016.24309

Sidney R. Junior Universidade Federal de Goiás
Rafael D. Quirino Universidade Federal de Goiás
Leonardo Andrade Ribeiro Universidade Federal de Goiás
Wellington S. Martins Universidade Federal de Goiás

DOI: https://doi.org/10.5753/sbbd.2016.24309

Resumo

Set similarity join is a core operation for text data integration, cleaning, and mining. Previous research work on improving the performance of set similarity joins mostly focused on sequential, CPU-based algorithms. Main optimizations of such algorithms exploit high threshold values and the underlying data characteristics to derive efficient filters. In this paper, we investigate strategies to accelerate set similarity join using Graphic Processing Units (GPUs). Our approach exploits massive parallelism instead of filtering and, as a result, exhibits much better robustness to variations of threshold values and data distributions. Experimental evaluation shows that we are able to obtain up to 57x speedups over highly optimized CPU-based algorithms.

Palavras-chave: Similarity join, Acceleration using GPUs

Referências

Bayardo, R. J., Ma, Y., and Srikant, R. (2007). Scaling up All Pairs Similarity Search. In WWW, pages 131–140.

Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzenmacher, M. (1998). Min-Wise Independent Permutations (Extended Abstract). In STOC, pages 327–336.

Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A primitive operator for similarity joins in data cleaning. In ICDE, page 5

Cruz, M. S. H., Kozawa, Y., Amagasa, T., and Kitagawa, H. (2015). GPU Acceleration of Set Similarity Joins. In DEXA, pages 384–398.

Hassanzadeh, O., Chiang, F., Miller, R. J., and Lee, H. C. (2009). Framework for Evaluating Clustering Algorithms in Duplicate Detection. PVLDB, 2(1):1282–1293.

Indyk, P. and Motwani, R. (1998). Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In STOC, pages 604–613.

Kirk, D. B. and Wen-mei, W. H. (2012). Programming massively parallel processors: a hands-on approach. Newnes.

Leskovec, J., Rajaraman, A., and Ullman, J. D. (2014). Mining of Massive Datasets, 2nd Ed. Cambridge University Press.

Mann, W., Augsten, N., and Bouros, P. (2016). An Empirical Evaluation of Set Similarity Join Techniques. PVLDB, 9(9):636–647.

Ribeiro, L. A. and Härder, T. (2011). Generalizing Prefix Filtering to Improve Set Similarity Joins. Information Systems, 36(1):62–78.

Sarawagi, S. and Kirpal, A. Efficient Set Joins on Similarity Predicates. In SIGMOD.

Sidney, C. F., Mendes, D. S., Ribeiro, L. A., and Härder, T. (2015). Performance Prediction for Set Similarity Joins. In SAC, pages 967–972.

Vernica, R., Carey, M. J., and Li, C. (2010). Efficient Parallel Set-similarity Joins using MapReduce. In SIGMOD, pages 495–506.

Wang, J., Li, G., and Feng, J. (2012). Can We Beat the Prefix Filtering?: An Adaptive Framework for Similarity Join and Search. In SIGMOD, pages 85–96.

Xiao, C., Wang, W., Lin, X., Yu, J. X., and Wang, G. (2011). Efficient Similarity Joins for Near-duplicate Detection. TODS, 36(3):15.