Similarity Grouping by Influence: Exploring Result Diversification in Similarity Group-by Operators


The group-by operator groups the tuples sharing the same values in specified attributes, then extracts summaries from each group. However, several data stored by modern applications are best queried not by equality but by similarity, giving rise to a number of questions, such as: "How to obtain groups, such that each one contains the k tuples most similar?" or "How to include diversity in the results?". In this paper, we present a binary grouping operator focused on diversified similarity comparisons, which is able to answer such questions. We define the operator algebraically and show its applicability to enable the execution of grouping operations over complex attributes, such as multidimensional data. We provide an algorithm, called Similarity Grouping by Influence -- SGIa --- to implement the binary operator. An experimental evaluation performed on real data shows the SGIa is able to timely meet real application needs with significant results.
Palavras-chave: Group-by operator, Similarity Search, Result Diversification


Drosou, M., Jagadish, H. V., Pitoura, E., and Stoyanovich, J. (2017). Diversity in big data: A review. Big Data, 5(2):73–84.

Jasbick, D. L., Santos, L. F. D., de Oliveira, D., and Bedo, M. V. N. (2020). Some branches may bear rotten fruits: Diversity browsing vp-trees. In SISAP 2020, volume 12440, pages 140–154. Springer.

Lopes, C. R., Santos, L. F. D., Jasbick, D. L., de Oliveira, D., and Bedo, M. V. N. (2021). An empirical assessment of quality metrics for diversified similarity searching. J. Inf. Data Manag., 12(3).

Santos, L. F. D., Carvalho, L. O., Oliveira, W. D., Traina, A. J. M., and Jr., C. T. (2015). Diversity in similarity joins. In SISAP 2015, volume 9371, pages 42–53. Springer.

Schallehn, E., Sattler, K.-U., and Saake, G. (2004). Efficient similarity-based operations for data integration. Data & Knowledge Engineering, 48(3):361–387.

Silva, Y. N., Aly, A. M., Aref, W. G., and Larson, P.-A. (2010). SimDB: a similarity-aware database system. In ACM SIGMOD, pages 1243–1246. ACM.

Silva, Y. N., Sandoval, M., Prado, D., Wallace, X., and Rong, C. (2019). Similarity grouping in big data systems. In Similarity Search and Applications, pages 212–220.

Smyth, B. and McClave, P. (2001). Similarity vs. diversity. In Proceedings of the ICCBR, pages 347–361, Vancouver, Canada.

Tang, M., Tahboub, R., Aref, W., Atallah, M., Malluhi, Q., Ouzzani, M., and Silva, Y. (2016). Similarity group-by operators for multi-dimensional relational data. Knowledge and Data Engineering, IEEE Transactions on, 28(2):510–523.

van Leuken, R. H., Garcia, L., Olivares, X., and van Zwol, R. (2009). Visual diversification of image search results. In Proceedings of the WWW, pages 341–350, Spain.

Vieira, M. R., Razente, H. L., Barioni, M. C. N., Hadjieleftheriou, M., Srivastava, D., Traina Jr., C., and Tsotras, V. J. (2011). On query result diversification. In Proceedings of the IEEE ICDE, pages 1163–1174, Hannover, Germany.

Yang, C., Chen, L., Wang, H., Shang, S., Mao, R., and Zhang, X. (2023). Dynamic set similarity join: An update log based approach. IEEE Trans. Knowl. Data Eng., 35(4):3727–3741.
OLIVEIRA, Willian D.; LAUTON, Anna J. C.; TRAINA JR., Caetano; SANTOS, Lucio F. D.. Similarity Grouping by Influence: Exploring Result Diversification in Similarity Group-by Operators. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 38. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 402-407. ISSN 2763-8979. DOI: