An Approach to Distributed Similarity Join Processing on Multiple Attributes
Abstract
Similarity join is a fundamental operation in data integration. Most existing algorithms consider single-attribute data. However, real data is typically multi-attribute. Besides requiring more complex similarity expressions, this type of data is larger and, therefore, processing cost on a single machine can be prohibitively expensive. This paper presents a distributed similarity join algorithm on multi-attribute data using Spark. Initial experimental results show that the proposed approach is efficient and scalable.
Keywords:
Similarity Join, Spark
References
Chaudhuri, S., Ganti, V., and Kaushik, R. (2006). A Primitive Operator for Similarity Joins in Data Cleaning. In ICDE, page 5.
Deng, D., Li, G., Hao, S., Wang, J., and Feng, J. (2014). MassJoin: A Mapreduce-based Method for Scalable String Similarity Joins. In ICDE, pages 340–351.
Li, G., He, J., Deng, D., and Li, J. (2015). Efficient Similarity Join and Search on Multi-Attribute Data. In SIGMOD, pages 1137–1151.
Ribeiro, L. A. and Härder, T. (2011). Generalizing Prefix Filtering to Improve Set Similarity Joins. Information Systems, 36(1):62–78.
Sidney, C. F., Mendes, D. S., Ribeiro, L. A., and Härder, T. (2015). Performance Prediction for Set Similarity Joins. In SAC, pages 967–972.
Vernica, R., Carey, M. J., and Li, C. (2010). Efficient Parallel Set-similarity Joins using MapReduce. In SIGMOD, pages 495–506.
Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., Ghodsi, A., Gonzalez, J., Shenker, S., and Stoica, I. (2016). Apache Spark: a Unified Engine for Big Data Processing. Communications of the ACM, 59(11):56–65.
Deng, D., Li, G., Hao, S., Wang, J., and Feng, J. (2014). MassJoin: A Mapreduce-based Method for Scalable String Similarity Joins. In ICDE, pages 340–351.
Li, G., He, J., Deng, D., and Li, J. (2015). Efficient Similarity Join and Search on Multi-Attribute Data. In SIGMOD, pages 1137–1151.
Ribeiro, L. A. and Härder, T. (2011). Generalizing Prefix Filtering to Improve Set Similarity Joins. Information Systems, 36(1):62–78.
Sidney, C. F., Mendes, D. S., Ribeiro, L. A., and Härder, T. (2015). Performance Prediction for Set Similarity Joins. In SAC, pages 967–972.
Vernica, R., Carey, M. J., and Li, C. (2010). Efficient Parallel Set-similarity Joins using MapReduce. In SIGMOD, pages 495–506.
Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., Ghodsi, A., Gonzalez, J., Shenker, S., and Stoica, I. (2016). Apache Spark: a Unified Engine for Big Data Processing. Communications of the ACM, 59(11):56–65.
Published
2017-10-02
How to Cite
OLIVEIRA, Diego Junior do Carmo; BORGES, Felipe Ferreira; RIBEIRO, Leonardo Andrade.
An Approach to Distributed Similarity Join Processing on Multiple Attributes. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 32. , 2017, Uberlândia/MG.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2017
.
p. 300-305.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2017.174658.
