Canopy-Guided Construction of ANN Search Graphs under Cosine Similarity

  • Rafael F. Pinheiro USP
  • Karla Roberta P. S. Lima USP

Resumo


Research Context: Nearest Neighbor Search (NNS) is a fundamental problem in computing, with applications in recommendation systems, semantic search, information retrieval, and Retrieval-Augmented Generation (RAG) in Large Language Models (LLMs). High-dimensional datasets with millions of vectors make exact search computationally intractable, demanding efficient approximations. Practical Problem: Approximate Nearest Neighbor Search (ANNS) has become the standard approach, with graph-based methods standing out. However, structures such as Hierarchical Navigable Small World (HNSW) graphs impose high construction and update costs, which limits their scalability in real-world scenarios. Proposed Solution: This paper proposes the integration of Canopy Clustering as a divide-and-conquer heuristic to mitigate the computational burden of HNSW graph construction under cosine similarity, enabling more cost-effective ANNS pipelines in information retrieval systems. Related IS Theory: The research aligns with Information Systems theories on efficiency of knowledge retrieval, and the optimization of data-intensive processes in decision-support environments. It builds upon principles of data organization and access structures in IS, particularly the trade-offs between accuracy and computational feasibility in large-scale information retrieval. Research Method: Controlled experiments were conducted to evaluate the applicability of divide-and-conquer and Canopy Clustering prior to HNSW construction. Comparative analyses measured build time, query latency, and recall performance. Summary of Results: Preliminary findings suggest that Canopy Clustering can reduce construction costs while sustaining recall levels adequate for IS applications. Contributions and Impact to IS area: The work introduces a novel heuristic combination to enhance the scalability of graph-based similarity search indexing, providing methodological advances that may extend to other high-dimensional data processing tasks. It offers actionable insights for designing more efficient and viable intelligent information systems.

Referências

Aguerrebere, C., Bhati, I. S., Hildebrand, M., Tepper, M., and Willke, T. (2023). Similarity search in the blink of an eye with compressed indices. Proc. VLDB Endow., 16(11):3433–3446.

Aumüller, M., Bernhardsson, E., and Faithfull, A. (2020). Ann-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems, 87:101374.

Boscarioli, C., de Araujo, R. M., and Maciel, R. S. P., editors (2017). I GranDSI-BR: Grand Research Challenges in Information Systems in Brazil 2016–2026. Special Committee on Information Systems (CE-SI), Brazilian Computer Society (SBC), Porto Alegre, Brazil.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.

Fortune, S. (1995). Voronoi diagrams and delaunay triangulations. In Du, D.-Z. and Hwang, F., editors, Computing in Euclidean Geometry, pages 225–265. World Scientific.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA. Curran Associates Inc.

Malkov, Y., Ponomarenko, A., Logvinov, A., and Krylov, V. (2014). Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems, 45:61–68.

Malkov, Y. A. and Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell., 42(4):824–836.

McCallum, A., Nigam, K., and Ungar, L. H. (2000). Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’00, page 169–178, New York, NY, USA. Association for Computing Machinery.

Reimers, N. and Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Inui, K., Jiang, J., Ng, V., and Wan, X., editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.

Toussaint, G. T. (1980). The relative neighbourhood graph of a finite planar set. Pattern Recognition, 12(4):261–268.

Wang, M., Xu, X., Yue, Q., and Wang, Y. (2021). A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search. Proc. VLDB Endow., 14(11):1964–1978.
Publicado
25/05/2026
PINHEIRO, Rafael F.; LIMA, Karla Roberta P. S.. Canopy-Guided Construction of ANN Search Graphs under Cosine Similarity. In: SIMPÓSIO BRASILEIRO DE SISTEMAS DE INFORMAÇÃO (SBSI), 22. , 2026, Vitória/ES. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2026 . p. 271-290. DOI: https://doi.org/10.5753/sbsi.2026.248343.