Similarity Searches with Result Diversification Supported by Metric Indexes

  • Daniel L. Jasbick Fluminense Federal University
  • Daniel de Oliveira Fluminense Federal University
  • Marcos V. N. Bedo Fluminense Federal University

Abstract


Result diversification enriches k-Nearest Neighbor (k-NN) queries by retrieving objects dissimilar among themselves. Our first contribution, the diversity browsing routine, extends the incremental k-NN distance browsing routing, ensuring result diversification through Influence criteria. Our second contribution leverages diversity browsing to address the search of high-dimensional datasets from the perspective of Local Intrinsic Dimensionality (LID). Our experimental investigation revealed (i) result diversification is constrained by both LID and Influence-based partition principle, (ii) diversity browsing produces query-based manifolds with lower distance concentration regarding the original dataset, (iii) tuned indexes speed up diversity browsing in low/medium LID folds, and (iv) the performance gap between k-NN with and without diversification decreases with LID, with the latter offering richer results.

Keywords: Similarity search, Result diversification, Metric spaces, Local intrinsic dimensionality, Distance concentration

References

Amsaleg, L., Chelly, O., Furon, T., Girard, S., Houle, M., Kawarabayashi, K., and Nett, M. (2015). Estimating local intrinsic dimensionality. Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD), pages 29–38. ACM.

Amsaleg, L., Chelly, O., Houle, M., Kawarabayashi, K.-I., Radovanovic, M., and Treeratanajaru, W. (2019). Intrinsic dimensionality estimation within tight localities. International Conference on Data Mining (SDM), pages 181–189. SIAM.

Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. (1999). When is “nearest neighbor” meaningful? International Conference on Database Theory (ICDT), pages 217– 235. Springer.

Chávez, E., Navarro, G., Baeza-Yates, R., and Marroquín, J. (2001). Searching in metric spaces. Computing Surveys, volume 33, pages 273–321. ACM.

Drosou, M., Jagadish, H., Pitoura, E., and Stoyanovich, J. (2017). Diversity in big data: A review. Big data, 5(2):73–84.

Francois, D., Wertz, V., and Verleysen, M. (2007). The concentration of fractional distances. Transactions on Knowledge and Data Engineering (TKDE), volume 19, pages 873–886. IEEE.

Hetland, M. L. (2009). The basic principles of metric indexing. Swarm intelligence for multi-objective problems in data mining, pages 199–232. Springer.

Hjaltason, G. and Samet, H. (2003). Index-driven similarity search in metric spaces. Transactions on Database Systems (TODS), 28(4):517–580.

Jasbick, D., Santos, L., Azevedo-Marques, P. M., Traina, A. J., de Oliveira, D., and Bedo, M. (2023). Pushing diversity into higher dimensions: The lid effect on diversified similarity searching. Information Systems, 114:102166.

Jasbick, D., Santos, L., de Oliveira, D., and Bedo, M. (2020). Some branches may bear rotten fruits: Diversity browsing vp-trees. Similarity Search and Applications (SISAP), pages 140–154. Springer.

Kucuktunc, O. and Ferhatosmanoglu, H. (2011). λ-diverse nearest neighbors browsing for multidimensional data. Transactions on Knowledge and Data Engineering (TKDE), volume 25, pages 481–493. IEEE.

Lopes, C., Santos, L., Jasbick, D., de Oliveira, D., and Bedo, M. (2021). An empirical assessment of quality metrics for diversified similarity searching. Journal of Information and Data Management (JIDM), 12(3).

Pestov, V. (2013). Lower bounds on performance of metric tree indexing schemes for exact similarity search in high dimensions. Algorithmica, 66(2):310–328.

Santos, L., Blanco, G., Oliveira, D., Traina, A., Traina Jr, C., and Bedo, M. (2018). Exploring Diversified Similarity with Kundaha. Conference on Information and Knowledge Management (CIKM), pages 1903–1906. ACM.

Santos, L., Oliveira, W., Ferreira, M., Traina, A., and Traina Jr, C. (2013). Parameter-free and domain-independent similarity search with diversity. International Conference on Scientific and Statistical Database Management (SSDBM), pages 1–12.

Vieira, M., Razente, H., Barioni, M., Hadjieleftheriou, M., Srivastava, D., Traina Jr., C., and Tsotras, V. (2011). On query result diversification. International Conference on Data Engineering (ICDE), pages 1163–1174. IEEE.
Published
2023-09-25
JASBICK, Daniel L.; DE OLIVEIRA, Daniel; BEDO, Marcos V. N.. Similarity Searches with Result Diversification Supported by Metric Indexes. In: THESIS AND DISSERTATION CONTEST (CTDBD) - BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 38. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 198-212. DOI: https://doi.org/10.5753/sbbd_estendido.2023.231573.