Quality metrics for diversified similarity searching: What they stand for?

Camila L. Lopes; Daniel L. Jasbick; Marcos Bedo; Lúcio F.D. Santos

doi:10.5753/sbbd.2020.13620

Camila L. Lopes Instituto Federal do Norte de Minas Gerais
Daniel L. Jasbick Universidade Federal Fluminense
Marcos Bedo Universidade Federal Fluminense
Lúcio F.D. Santos Instituto Federal do Norte de Minas Gerais

DOI: https://doi.org/10.5753/sbbd.2020.13620

Resumo

Diversity-oriented searches retrieve objects not only similar to a reference element but also related to the different types of collections within the queried dataset. While such characterization is flexible enough to include methods originally from information retrieval, data clustering, and similarity searching under the same umbrella, diversity metrics are expected to be much less paradigm-biased in order to discriminate which approaches are more suitable and when they should be applied. Accordingly, we extend and implement a broad set of quality metrics from those distinct realms and experimentally discuss their trends and limitations. In particular, we evaluate the suitability of data clustering indexes, and similarity-driven measures regarding their adherence to diversified similarity searching. Experiments in real-world datasets indicate such measures are capable of distinguishing diversity methods from different paradigms, but they heavily favor the approaches of the same group – especially cluster indexes. As an alternative, we argue diversity is better addressed by a set of measures rather than a single quality value. Therefore, we propose the Diversity Features Model (DFM) that combines the perspectives of the competing approaches into a multidimensional point whose features are calculated based on the distance distribution within both retrieved and queried datasets. Empirical evaluations showed DFM compares different diversity searching approaches by considering multiple criteria, whereas overall winners can be found by ranking aggregation or visualized through parallel coordinates maps.

Palavras-chave: Quality, Metrics, similarity search, algorithms

Referências

Aggarwal, C. C. (2015).Data mining: the textbook. Springer.

Agrawal, R., Gollapudi, S., Halverson, A., and Ieong, S. (2009). Diversifying searchresults.ACM WSDM, 1(1):5–14.

Carbonell, J. and Goldstein, J. (1998). The use of MMR, diversity-based reranking forreordering documents and producing summaries.ACM SIGIR, 1(1):335–336.

Chen, L., Gao, Y., Zheng, B., Jensen, C. S., Yang, H., and Yang, K. (2017). Pivot-basedmetric indexing.PVLDB, 10(10).

Drosou, M., Jagadish, H., Pitoura, E., and Stoyanovich, J. (2017). Diversity in big data:A review.Big data, 5(2):73–84.

Fagin, R., Kumar, R., and Sivakumar, D. (2003). Efficient similarity search and classifi-cation via rank aggregation. InACM SIGMOD, pages 301–312.

Hetland, M. (2009). The Basic Principles of Metric Indexing. InSwarm Intell. for Multi-objective Problems in Data Mining, pages 199–232. Springer.

Jain, A., Sarda, P., and Haritsa, J. R. (2004). Providing diversity in k-nearest neighborquery results. InCKDM, pages 404–413. Springer.

Pestov, V. (2013). Is the k-nn classifier in high dimensions affected by the curse of dimensionality? Computers & Mathematics with Applications, 65(10):1427–1437.

Pouyanfar, S., Yang, Y., Chen, S.-C., Shyu, M.-L., and Iyengar, S. (2018). Multimedia big data analytics: A survey.ACM CSUR, 51(1):1–34.

Santos, L., Oliveira, W., Ferreira, M., Cordeiro, R., Traina, A., and Traina Jr, C. (2013a). Evaluating the diversification of similarity query results.JIDM, 4(3):188–188.

Santos, L., Oliveira, W., Ferreira, M., Traina, A., and Traina Jr, C. (2013b). Parameter-free and domain-independent similarity search with diversity. In SSDBM, pages 1–12.

Smyth, B. and McClave, P. (2001). Similarity vs. diversity.PICCR, 1(1):347–361.

Vieira, M., Razente, H., Barioni, M., Hadjieleftheriou, M., Srivastava, D., Traina Jr., C.,and Tsotras, V. (2011). On query result diversification. In ICDE, pages 1163–1174.

Yu, C., Lakshmanan, L. V., and Amer-Yahia, S. (2009). Recommendation diversification using explanations. In ICDE, pages 1299–1302. IEEE.

Zheng, K., Wang, H., Qi, Z., Li, J., and Gao, H. (2017). A survey of query result diversification. Knowledge and Information Sys., 51(1):1–36.