Similarity Searches with Result Diversification Supported by Metric Indexes
Abstract
Result diversification enriches k-Nearest Neighbor (k-NN) queries by retrieving objects dissimilar among themselves. Our first contribution, the diversity browsing routine, extends the incremental k-NN distance browsing routing, ensuring result diversification through Influence criteria. Our second contribution leverages diversity browsing to address the search of high-dimensional datasets from the perspective of Local Intrinsic Dimensionality (LID). Our experimental investigation revealed (i) result diversification is constrained by both LID and Influence-based partition principle, (ii) diversity browsing produces query-based manifolds with lower distance concentration regarding the original dataset, (iii) tuned indexes speed up diversity browsing in low/medium LID folds, and (iv) the performance gap between k-NN with and without diversification decreases with LID, with the latter offering richer results.
References
Amsaleg, L., Chelly, O., Houle, M., Kawarabayashi, K.-I., Radovanovic, M., and Treeratanajaru, W. (2019). Intrinsic dimensionality estimation within tight localities. International Conference on Data Mining (SDM), pages 181–189. SIAM.
Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. (1999). When is “nearest neighbor” meaningful? International Conference on Database Theory (ICDT), pages 217– 235. Springer.
Chávez, E., Navarro, G., Baeza-Yates, R., and Marroquín, J. (2001). Searching in metric spaces. Computing Surveys, volume 33, pages 273–321. ACM.
Drosou, M., Jagadish, H., Pitoura, E., and Stoyanovich, J. (2017). Diversity in big data: A review. Big data, 5(2):73–84.
Francois, D., Wertz, V., and Verleysen, M. (2007). The concentration of fractional distances. Transactions on Knowledge and Data Engineering (TKDE), volume 19, pages 873–886. IEEE.
Hetland, M. L. (2009). The basic principles of metric indexing. Swarm intelligence for multi-objective problems in data mining, pages 199–232. Springer.
Hjaltason, G. and Samet, H. (2003). Index-driven similarity search in metric spaces. Transactions on Database Systems (TODS), 28(4):517–580.
Jasbick, D., Santos, L., Azevedo-Marques, P. M., Traina, A. J., de Oliveira, D., and Bedo, M. (2023). Pushing diversity into higher dimensions: The lid effect on diversified similarity searching. Information Systems, 114:102166.
Jasbick, D., Santos, L., de Oliveira, D., and Bedo, M. (2020). Some branches may bear rotten fruits: Diversity browsing vp-trees. Similarity Search and Applications (SISAP), pages 140–154. Springer.
Kucuktunc, O. and Ferhatosmanoglu, H. (2011). λ-diverse nearest neighbors browsing for multidimensional data. Transactions on Knowledge and Data Engineering (TKDE), volume 25, pages 481–493. IEEE.
Lopes, C., Santos, L., Jasbick, D., de Oliveira, D., and Bedo, M. (2021). An empirical assessment of quality metrics for diversified similarity searching. Journal of Information and Data Management (JIDM), 12(3).
Pestov, V. (2013). Lower bounds on performance of metric tree indexing schemes for exact similarity search in high dimensions. Algorithmica, 66(2):310–328.
Santos, L., Blanco, G., Oliveira, D., Traina, A., Traina Jr, C., and Bedo, M. (2018). Exploring Diversified Similarity with Kundaha. Conference on Information and Knowledge Management (CIKM), pages 1903–1906. ACM.
Santos, L., Oliveira, W., Ferreira, M., Traina, A., and Traina Jr, C. (2013). Parameter-free and domain-independent similarity search with diversity. International Conference on Scientific and Statistical Database Management (SSDBM), pages 1–12.
Vieira, M., Razente, H., Barioni, M., Hadjieleftheriou, M., Srivastava, D., Traina Jr., C., and Tsotras, V. (2011). On query result diversification. International Conference on Data Engineering (ICDE), pages 1163–1174. IEEE.
