Beyond hit-or-miss: a comparative study of synopses for similarity searching
Keywords:Similarity searching, cost model, synopsis, distance distribution
A DBMS optimizer module takes its decisions by modeling the query costs upon the distribution of the data space. Cost modeling of similarity queries, however, requires the representation of distances’ rather than data distributions. Therefore, the finding of a suitable representation (or synopsis) for the distance distribution has a major impact in the optimization of similarity searches. In this study, we evaluate the quality of estimates drawn from five synopses of distinct paradigms regarding two common query criteria. Moreover, we embed the synopses into a new parametric cost model, called Stockpile, for the cost estimation of similarity queries on metric trees. The model uses the synopses estimation for calculating the probability of traversing a metric tree node, which defines the expected number of both disk accesses (I/O costs) and distance calculations (CPU costs). We performed an extensive set of experiments on real-world data sources regarding the estimates of each synopsis (and its parametric variations) by using paired ranking tests. In global terms, three synopses have outperformed their competitors regarding selectivity estimation, whereas two of them have also surpassed the others in the prediction of both I/O and CPU costs with respect to Stockpile model predictions. Additionally, results also indicate the choice of the most suitable synopsis may depend on characteristics of the distance distribution.
Aly, A. M., Aref, W. G., and Ouzzani, M. Cost estimation of spatial k-nearest-neighbor operators. In Int. Conference on Extending Database Technology. Springer, Berlin, pp. 457–468, 2015.
Bedo, M. V. N., Kaster, D. S., Traina, A. J., and Traina Jr., C. The Merkurion approach for similarity searching optimization in Database Management Systems. Data & Knowledge Engineering vol. 113, pp. 18 – 42, 2018.
Bedo, M. V. N., Traina, A. J. M., and Traina Jr., C. A spline-based cost model for metric trees. In Brazilian Symposium on Databases. SBC, SBC, Porto Alegre, pp. 28–39, 2017.
Cha, S.-H. Comprehensive survey on distance/similarity measures between probability density functions. Int. Journal of Mathematical Models and methods in Applied Sciences 1 (2): 8, 2007.
Chen, L., Gao, Y., Li, X., Jensen, C., and Chen, G. Efficient metric indexing for similarity search. In International Conference on Data Engineering. IEEE, IEEE, New York, pp. 591–602, 2015.
Chen, L., Gao, Y., Zheng, B., Jensen, C. S., Yang, H., and Yang, K. Pivot-based metric indexing. Proceedings of the VLDB Endowment 10 (10): 1058–1069, 2017.
Ciaccia, P., Nanni, A., and Patella, M. A query-sensitive cost model for similarity queries with m-tree. In Australasian Database Conference. Springer, Berlin, pp. 65–76, 1999.
Ciaccia, P., Patella, M., and Zezula, P. A cost model for similarity queries in metric spaces. In Symposium on Principles of Database Systems. ACM, New York, pp. 59–68, 1998.
Clauset, A., Shalizi, C. R., and Newman, M. E. Power-law distributions in empirical data. Society for Industrial and Applied Mathematics Review 51 (4): 661–703, 2009.
Cormode, G., Garofalakis, M., Haas, P. J., and Jermaine, C. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases 4 (1–3): 1–294, 2012.
Demsar, J. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research vol. 7, pp. 1–30, 2006.
Garcia, S. and Herrera, F. An extension on “Statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. Journal of Machine Learning Research vol. 9, pp. 2677–2694, 2008.
Hetland, M. L. The basic principles of metric indexing. In Swarm Intelligence for Multi-objective Problems in Data Mining. Springer, Berlin, pp. 199–232, 2009.
Ioannidis, Y. The history of histograms (abridged). In Int. Conference on Very large Data Bases. VLDB Endowment, New York, pp. 19–30, 2003.
Lu, Y., Lu, J., Cong, G., Wu, W., and Shahabi, C. Efficient algorithms and cost models for reverse spatial-keyword kNN search. ACM Transactions on Database Systems 39 (2): 13:1–13:46, 2014.
Pestov, V. Indexability, concentration, and VC theory. Journal of Discrete Algorithms vol. 13, pp. 2–18, 2012.
Shekelyan, M., Dignös, A., and Gamper, J. DigitHist: A Histogram-based Data Summary with Tight Error Bounds. Proceedings of VLDB Endowment 10 (11): 1514–1525, 2017.
Tao, Y., Zhang, J., P., D., and Mamoulis, N. An efficient model for optimization of kNN in low and medium dimensional spaces. Transactions on Knowledge and Data Engineering 16 (10): 1169–1184, 2004.
Tasan, M. and Ozsoyoglu, Z. M. Improvements in distance-based indexing. In Scientific and Statistical Database Management. IEEE, New York, pp. 161–170, 2004.
Traina, A. J., Traina Jr., C., Faloutsos, C., and Seeger, B. Fast indexing and visualization of metric data sets using Slim-Trees. Transactions on Knowledge and Data Engineering 14 (2): 244–260, 2002.
Vieira, M. R., Traina Jr., C., Traina, A. J., Arantes, A., and Faloutsos, C. Boosting k-nearest neighbor queries estimating suitable query radii. In Int. Conference on Scientific and Statistical Database Management. IEEE, New York, pp. 10–10, 2007.
Zezula, P., Amato, G., Dohnal, V., and Batko, M. Similarity search: the metric space approach. Vol. 1. Springer, Berlin, 2010.