A Framework for Set Similarity Join on Multi-Attribute Data
Set similarity join, which finds all pairs of similar sets in a collection, plays an important role in data cleaning and integration. Many algorithms have been proposed to efficiently answer set similarity join on single-attribute data. However, real-world data often contain multiple attributes. In this paper, we propose a framework to enhance existing algorithms with additional filters for dealing with multi-attribute data. We then present a simple, yet effective filter based on lightweight indexes, for which exact and probabilistic implementation alternatives are evaluated. Finally, we devise a cost model to identify the best attribute ordering to reduce processing time. Our experimental results show that our approach is effective and significantly outperforms previous work.
Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422–426.
Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A Primitive Operator for SimilarityJoins in Data Cleaning. In Proceedings of the ICDE Conference, page 5.
Chu, X., Ilyas, I. F., Krishnan, S., and Wang, J. (2016). Data Cleaning: Overview andEmerging Challenges. In Proceedings of the SIGMOD Conference, pages 2201–2206.
CrowdFlower (2016). 2016 Data Science Report. https://visit.figure-eight.com/data-science-report.html.
do Carmo Oliveira, D. J., Borges, F. F., and Ribeiro, L. A. (2017). Uma abordagem para processamento distribuído de junção por similaridade sobre múltiplos atributos. In Proceedings of the Brazilian Symposium on Databases, pages 300–305.
Kaggle(2017).The State of Data Science & Machine Learning. https://www.kaggle.com/kaggle/kaggle-survey-2017.
Li, G., He, J., Deng, D., and Li, J. (2015). Efficient Similarity Join and Search on Multi-Attribute Data. In Proceedings of the SIGMOD Conference, pages 1137–1151.
Mann, W., Augsten, N., and Bouros, P. (2016). An Empirical Evaluation of Set SimilarityJoin Techniques. PVLDB, 9(9):636–647.
Oliveira, D. J. C., Borges, F. F., Ribeiro, L. A., and Cuzzocrea, A. (2018). Set Similar-ity Joins with Complex Expressions on Distributed Platforms. In Proceedings of theSymposium on Advances in Databases and Information Systems, pages 216–230.
Ribeiro, L. A. and Härder, T. (2011). Generalizing Prefix Filtering to Improve Set Similarity Joins. Information Systems, 36(1):62–78.
Ribeiro, L. A., Schneider, N. C., de Souza Inácio, A., Wagner, H. M., and von Wangenheim, A. (2016). Bridging Database Applications and Declarative Similarity Matching. Journal of Information and Data Management, 7(3):217–232.
Ribeiro-Júnior, S., Quirino, R. D., Ribeiro, L. A., and Martins, W. S. (2017). Fast Parallel Set Similarity Joins on Many-core Architectures. Journal of Information and Data Management, 8(3):255–270.
Wang, X., Qin, L., Lin, X., Zhang, Y., and Chang, L. (2017). Leveraging Set Relations in Exact Set Similarity Join. Proceedings of the VLDB Endowment, 10(9):925–936.
Xiao, C., Wang, W., Lin, X., Yu, J. X., and Wang, G. (2011). Efficient Similarity Joins for Near-Duplicate Detection. ACM Transactions on Database Systems, 36(3):15:1–15:41.