Efficient Set Similarity Join on Multi-Attribute Data Using Lightweight Filters
Keywords:Advanced Query Processing, Data Cleaning, Data Integration, Multi-Attribute Data, Similarity Join
We consider the problem of efficiently answering set similarity joins on multi-attribute data. Traditional set similarity join algorithms assume string data represented by a single set and, thus, miss the opportunity to exploit predicates over multiple attributes to reduce the number of similarity computations. In this article, we present a framework to enhance existing algorithms with additional filters for dealing with multi-attribute data. We then instantiate this framework with a lightweight filtering technique based on a simple, yet effective data structure, for which exact and probabilistic implementations are evaluated. In this context, we devise a cost model to identify the best attribute ordering to reduce processing time. Moreover, alternative approaches are also investigated and a new algorithm combining key ideas from previous work is introduced. Finally, we present a thorough experimental evaluation, which demonstrates that our main proposal is efficient and significantly outperforms competing algorithms.
Bayardo, R. J., Ma, Y., and Srikant, R. Scaling up All Pairs Similarity Search. In Proceedings of the WWW Conference. Banff, Canada, pp. 131–140, 2007.
Bloom, B. H. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13 (7): 422–426, 1970.
Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzenmacher, M. Min-Wise Independent Permutations (Extended Abstract). In Proceedings of the STOC Symposium. Dallas, USA, pp. 327–336, 1998.
Chaudhuri, S., Ganti, V., and Kaushik, R. A Primitive Operator for Similarity Joins in Data Cleaning. In Proceedings of the ICDE Conference. Atlanta, USA, pp. 5, 2006.
Chu, X., Ilyas, I. F., Krishnan, S., and Wang, J. Data Cleaning: Overview and Emerging Challenges. In Proceedings of the SIGMOD Conference. San Francisco, USA, pp. 2201–2206, 2016.
CrowdFlower. 2016 Data Science Report. https://visit.figure-eight.com/data-science-report.html, 2016.
Deng, D., Tao, Y., and Li, G. Overlap Set Similarity Joins with Theoretical Guarantees. In Proceedings of the SIGMOD Conference. Houston, USA, pp. 905–920, 2018.
Fier, F., Augsten, N., Bouros, P., Leser, U., and Freytag, J. Set Similarity Joins on MapReduce: An Experimental Survey. Proceedings of the VLDB Endowment 11 (10): 1110–1122, 2018.
Indyk, P. and Motwani, R. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of the STOC Symposium. Dallas, USA, pp. 604–613, 1998.
Kaggle. The State of Data Science & Machine Learning. https://www.kaggle.com/kaggle/kaggle-survey-2017, 2017.
Li, G., He, J., Deng, D., and Li, J. Efficient Similarity Join and Search on Multi-Attribute Data. In Proceedings of the SIGMOD Conference. Melbourne, Victoria, Australia, pp. 1137–1151, 2015.
Mann, W., Augsten, N., and Bouros, P. An Empirical Evaluation of Set Similarity Join Techniques. PVLDB 9 (9): 636–647, 2016.
Navarro, G. A Guided Tour to Approximate String Matching. Communications of the ACM 33 (1): 31–88, 2001.
Oliveira, D., Borges, F. F., and Ribeiro, L. A. Uma abordagem para processamento distribuído de junção por similaridade sobre múltiplos atributos. In Proceedings of the Brazilian Symposium on Databases. Uberlãndia, Minas Gerais, Brazil, pp. 300–305, 2017.
Oliveira, D., Borges, F. F., Ribeiro, L. A., and Cuzzocrea, A. Set Similarity Joins with Complex Expressions on Distributed Platforms. In Proceedings of the Symposium on Advances in Databases and Information Systems. Budapest, Hungary, pp. 216–230, 2018.
Ribeiro, L. A., Borges, F. F., and Oliveira, D. A Framework for Set Similarity Join on Multi-Attribute Data. In Proceedings of the Brazilian Symposium on Databases. Porto Alegre, Brazil, pp. 61–72, 2020.
Ribeiro, L. A. and Härder, T. Generalizing Prefix Filtering to Improve Set Similarity Joins. Information Systems 36 (1): 62–78, 2011.
Ribeiro, L. A., Schneider, N. C., de Souza Inácio, A., Wagner, H. M., and von Wangenheim, A. Bridging Database Applications and Declarative Similarity Matching. Journal of Information and Data Management 7 (3): 217–232, 2016.
Ribeiro-Júnior, S., Quirino, R. D., Ribeiro, L. A., and Martins, W. S. Fast Parallel Set Similarity Joins on Many-core Architectures. Journal of Information and Data Management 8 (3): 255–270, 2017.
Sarawagi, S. and Kirpal, A. Efficient Set Joins on Similarity Predicates. In Proceedings of the SIGMOD Conference. Paris, France, pp. 743–754, 2004.
Wang, X., Qin, L., Lin, X., Zhang, Y., and Chang, L. Leveraging Set Relations in Exact Set Similarity Join. Proceedings of the VLDB Endowment 10 (9): 925–936, 2017.
Xiao, C., Wang, W., Lin, X., Yu, J. X., and Wang, G. Efficient Similarity Joins for Near-Duplicate Detection. ACM Transactions on Database Systems 36 (3): 15:1–15:41, 2011.