Approximate Similarity Joins over Dense Vector Embeddings
Resumo
We consider the problem of efficiently answering similarity join queries over vector data generated by machine learning models. Owing to the high dimensionality and density of such vectors, approximate solutions are prevalent for dealing with large datasets. In this context, we investigate how to evaluate similarity joins using the Hierarchical Navigable Small World (HNSW), a state-of-the-art, graph-based index designed for approximate knearest neighbor (kNN) queries. We explore the design space of possible solutions, ranging from alternatives on top of HNSW to deeper integration of similarity join processing into this structure. Experimental results show that our proposal achieves substantial speedups with negligible accuracy loss.
Referências
Aumüller, M., Bernhardsson, E., and Faithfull, A. J. (2020). ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. Information Systems, page 101374.
Bayardo, R. J., Ma, Y., and Srikant, R. (2007). Scaling up All Pairs Similarity Search. In Proceedings of the WWW Conference, page 131–140.
Bernhardsson, E. (2015). Spotify/Annoy: Approximate Nearest Neighbors in c++/python Optimized for Memory usage and Loading/saving to Disk. https://github.com/spotify/annoy.
Christiani, T., Pagh, R., and Sivertsen, J. (2018). Scalable and Robust Set Similarity Join. In Proceedings of the ICDE Conference, pages 1240–1243.
Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186.
Echihabi, K., Zoumpatianos, K., Palpanas, T., and Benbrahim, H. (2019). Return of the Lernaean Hydra: Experimental Evaluation of Data Series Approximate Similarity Search. Proceedings of the VLDB Endowment, page 403–420.
Fan, G., Wang, J., Li, Y., Zhang, D., and Miller, R. J. (2023). Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation Learning. Proceedings of the VLDB Endowment, page 1726–1739.
Guo, R., Sun, P., Lindgren, E., Geng, Q., Simcha, D., Chern, F., and Kumar, S. (2020). Accelerating Large-Scale Inference with Anisotropic Vector Quantization. In Proceedings of the International Conference on Machine Learning, pages 3887–3896.
Johnson, J., Douze, M., and Jégou, H. (2019). Billion-scale Similarity Search with GPUs. IEEE Transactions on Big Data, pages 535–547.
Malkov, Y. A. and Yashunin, D. A. (2020). Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 824–836.
Paparrizos, J., Edian, I., Liu, C., Elmore, A. J., and Franklin, M. J. (2022). Fast Adaptive Similarity Search through Variance-Aware Quantization. In Proceedings of the ICDE Conference, pages 2969–2983.
Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 3982–3992.
Ribeiro, L. A. and Härder, T. (2011). Generalizing Prefix Filtering to Improve Set Similarity Joins. Information Systems, pages 62–78.
Santana, D. and Ribeiro, L. (2022). Junções por Similaridade em Espaços Vetoriais Semânticos. In Anais Estendidos do XXXVII Simpósio Brasileiro de Bancos de Dados, pages 147–153.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is All you Need. In Annual Conference on Neural Information Processing Systems, pages 5998–6008.
Wang, J., Yi, X., Guo, R., Jin, H., Xu, P., Li, S., Wang, X., Guo, X., Li, C., Xu, X., Yu, K., Yuan, Y., Zou, Y., Long, J., Cai, Y., Li, Z., Zhang, Z., Mo, Y., Gu, J., Jiang, R., Wei, Y., and Xie, C. (2021). Milvus: A Purpose-Built Vector Data Management System. In Proceedings of the SIGMOD Conference, pages 2614–2627.