Adding support for result diversification in HNSW indexes considering low and high dimensional spaces
Abstract
Hierarchical Navigable Small World (HNSW) indexes provide state-of-the-art performances for approximate k-nearest neighbor (kNN) queries. However, binding the approximate search quality with the characterization of the HNSW construction heuristic remains an open issue. This article investigates if result diversification can bridge this gap by discussing a new HNSW construction strategy based on a local, diversification-driven partitioning principle. Accordingly, we extend the HNSW kNN search algorithm to support result diversification. Experimental evaluations on ANN-Benchmark showed while diversification-based partitioning improves the search Recall, standard construction still yields higher throughput. We examined this trade-off through the lens of Local Intrinsic Dimensionality (LID), stratifying the datasets into quartiles. This evaluation indicated the throughput difference shrinks with the LID, with the diversification-based construction yielding a higher Recall in every LID-based manifold. Such outcomes suggest that HNSW long edges’ tuning may depend on the LID manifold.
Keywords:
HNSW, kNN, Approximate kNN Queries, High Dimensionality, Local Intrinsic Dimensionality
References
Amsaleg, L., Chelly, O., Furon, T., Girard, S., Houle, M., Kawarabayashi, K.-i., and Nett, M. (2018). Extreme-value-theoretic estimation of local intrinsic dimensionality. DMKD, 32(6):1768–1805.
Amsaleg, L., Chelly, O., Houle, M., Kawarabayashi, K., Radovanović, M., and Treratanajaru, W. (2019). Intrinsic dimensionality estimation within tight localities. In ICDM.
Aumüller, M., Bernhardsson, E., and Faithfull, A. (2020). Ann-benchmarks: A bench-marking tool for approximate nearest neighbor algorithms. Info. Sys., 87:101374.
Aumüller, M. and Ceccarello, M. (2021). The role of local dimensionality measures in benchmarking nearest neighbor search. Info. Sys., 101:101807.
Drosou, M., Jagadish, H., Pitoura, E., and Stoyanovich, J. (2017). Diversity in big data: A review. Big Data, 5:73–84.
He, J., Kumar, S., and Chang, S.-F. (2012). On the difficulty of nearest neighbor search. In ICML, pages 41–48.
Houle, M. (2013). Dimensionality, discriminability, density and distance distributions. In ICDM, pages 468–473. IEEE.
Jasbick, D., Dutra Santos, L., de Oliveira, D., and Bedo, M. (2020). Some branches may bear rotten fruits: Diversity browsing vp-trees. In SISAP, pages 140–154. Springer.
Jasbick, D., Santos, L., Azevedo-Marques, P., Traina, A., de Oliveira, D., and Bedo, M. (2023). Pushing diversity into higher dimensions: The LID effect on diversified similarity searching. Info. Sys., 114:102166.
Kucuktunc, O. and Ferhatosmanoglu, H. (2013). λ-diverse nearest neighbors browsing for multidimensional data. TKDE, 25(3):481–493.
Li, L., Xu, J., Li, Y., and Cai, J. (2021). Hctree+: A workload-guided index for approximate knn search. Info. Sc., 581:876–890.
Malkov, Y. and Yashunin, D. (2016). Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. TPAMI, PP.
Peng, Z., Zhang, M., Li, K., Jin, R., and Ren, B. (2022). Speed-ann: Low-latency and high-accuracy nearest neighbor search via intra-query parallelism.
Santana, D. and Ribeiro, L. (2023). Approximate similarity joins over dense vector embeddings. In SBBD, pages 51–62. SBC.
Santos, L., Oliveira, W., Ferreira, M., Traina, A., and Traina Jr, C. (2013). Parameter-free and domain-independent similarity search with diversity. In SSDBM, pages 1–12.
Shimomura, L. C., Oyamada, R. S., Vieira, M. R., and Kaster, D. S. (2021). A survey on graph-based methods for similarity searches in metric spaces. Info. Sys., 95:101507.
Volnyansky, I. and Pestov, V. (2009). Curse of dimensionality in pivot based indexes. In SISAP, pages 39–46. IEEE.
Wang, M., Xu, X., Yue, Q., and Wang, Y. (2021). A comprehensive survey and experimental comparison of graph-based approximate nn search. PVLDB, 14(11):1964–1978.
Xian, J., Teofili, T., Pradeep, R., and Lin, J. (2024). Vector search with OpenAI embeddings: Lucene is all you need. In ICWSDM, pages 1090–1093.
Amsaleg, L., Chelly, O., Houle, M., Kawarabayashi, K., Radovanović, M., and Treratanajaru, W. (2019). Intrinsic dimensionality estimation within tight localities. In ICDM.
Aumüller, M., Bernhardsson, E., and Faithfull, A. (2020). Ann-benchmarks: A bench-marking tool for approximate nearest neighbor algorithms. Info. Sys., 87:101374.
Aumüller, M. and Ceccarello, M. (2021). The role of local dimensionality measures in benchmarking nearest neighbor search. Info. Sys., 101:101807.
Drosou, M., Jagadish, H., Pitoura, E., and Stoyanovich, J. (2017). Diversity in big data: A review. Big Data, 5:73–84.
He, J., Kumar, S., and Chang, S.-F. (2012). On the difficulty of nearest neighbor search. In ICML, pages 41–48.
Houle, M. (2013). Dimensionality, discriminability, density and distance distributions. In ICDM, pages 468–473. IEEE.
Jasbick, D., Dutra Santos, L., de Oliveira, D., and Bedo, M. (2020). Some branches may bear rotten fruits: Diversity browsing vp-trees. In SISAP, pages 140–154. Springer.
Jasbick, D., Santos, L., Azevedo-Marques, P., Traina, A., de Oliveira, D., and Bedo, M. (2023). Pushing diversity into higher dimensions: The LID effect on diversified similarity searching. Info. Sys., 114:102166.
Kucuktunc, O. and Ferhatosmanoglu, H. (2013). λ-diverse nearest neighbors browsing for multidimensional data. TKDE, 25(3):481–493.
Li, L., Xu, J., Li, Y., and Cai, J. (2021). Hctree+: A workload-guided index for approximate knn search. Info. Sc., 581:876–890.
Malkov, Y. and Yashunin, D. (2016). Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. TPAMI, PP.
Peng, Z., Zhang, M., Li, K., Jin, R., and Ren, B. (2022). Speed-ann: Low-latency and high-accuracy nearest neighbor search via intra-query parallelism.
Santana, D. and Ribeiro, L. (2023). Approximate similarity joins over dense vector embeddings. In SBBD, pages 51–62. SBC.
Santos, L., Oliveira, W., Ferreira, M., Traina, A., and Traina Jr, C. (2013). Parameter-free and domain-independent similarity search with diversity. In SSDBM, pages 1–12.
Shimomura, L. C., Oyamada, R. S., Vieira, M. R., and Kaster, D. S. (2021). A survey on graph-based methods for similarity searches in metric spaces. Info. Sys., 95:101507.
Volnyansky, I. and Pestov, V. (2009). Curse of dimensionality in pivot based indexes. In SISAP, pages 39–46. IEEE.
Wang, M., Xu, X., Yue, Q., and Wang, Y. (2021). A comprehensive survey and experimental comparison of graph-based approximate nn search. PVLDB, 14(11):1964–1978.
Xian, J., Teofili, T., Pradeep, R., and Lin, J. (2024). Vector search with OpenAI embeddings: Lucene is all you need. In ICWSDM, pages 1090–1093.
Published
2024-10-14
How to Cite
WEBER, Mauro; SILVA-LEITE, João; SANTOS, Lúcio F. D.; DE OLIVEIRA, Daniel; BEDO, Marcos.
Adding support for result diversification in HNSW indexes considering low and high dimensional spaces. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 39. , 2024, Florianópolis/SC.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2024
.
p. 14-26.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2024.240618.
