A Spline-based Cost Model for Metric Trees
Resumo
Whenever two (or more) access methods are alternatives for the execution of a query, how to choose which one is the best for the task? Such a decision is made by the DBMS optimizer module, which models the query costs according to the distribution of the data space. Cost modeling of similarity searches, however, requires the representation of distances’ rather than data distribution. In this paper, we propose the Stockpile model for cost estimation of similarity queries on metric trees by using pivot-based distance histograms that represent the local densities around the query elements. By combining the local densities to the probability of traversing the tree nodes, Stockpile provides a fair estimation of both disk accesses (I/O costs) and distance calculations (CPU costs). We compared Stockpile and two literature models regarding similarity queries in real-world data sources and our model was up to 85% more precise than the competitors.
Palavras-chave:
Similarity Search, Cost Estimation, Metric Trees
Referências
Aly, A. M., Aref, W. G., and Ouzzani, M. (2015). Cost estimation of spatial k-nearest-neighbor operators. In EDBT, pages 457–468.
Baioco, G. B., Traina, A. J. M., and Traina Jr., C. (2007). MAMCost: Global and Local Estimates leading to Robust Cost Estimation. In SSDBM, pages 1:1–1:6.
Bedo, M. V. N., Kaster, D. S., Traina, A. J. M., and Traina Jr., C. (2015). CDH: a novel structure to boost k-nearest neighbor queries. In SSDBM, pages 36:1–36:6.
Bustos, B., Navarro, G., and Chávez, E. (2003). Pivot selection techniques for proximity searching in metric spaces. PRL, 24(14):2357–2366.
Ciaccia, P., Patella, M., and Zezula, P. (1997). M-tree: An efficient access method for similarity search in metric spaces. In VLDB, pages 426–435.
Ciaccia, P., Patella, M., and Zezula, P. (1998). A Cost Model for Similarity Queries in Metric Spaces. In PODS, pages 59–68.
Hjaltason, G. R. and Samet, H. (2003). Index-driven similarity search in metric spaces. TDS, 28(1):517–580.
Ioannidis, Y. (2003). The history of histograms (abridged). In VLDB, pages 19–30.
Kaufman, L. and Rousseeuw, P. (1987). Clustering by means of medoids. North-Holland.
König, A. C. and Weikum, G. (2002). A framework for the physical design problem for data synopses. In EDBT, pages 627–645.
Korn, F., Pagel, B., and Faloutsos, C. (2001). On the ‘Dimensionality Curse’ and the ‘Self-Similarity Blessing’. TKDE, 13(1):96–111.
Lokoč, J. (2010). Tree-based indexing methods for similarity search in metric and non-metric spaces. PhD thesis, Charles University, Ovocný trh 3-5, 116 36 Praha.
Lu, Y., Lu, J., Cong, G., Wu, W., and Shahabi, C. (2014). Efficient algorithms and cost models for reverse spatial-keyword kNN search. TDS, 39(2):13:1–13:46.
Navarro, G., Paredes, R., Reyes, N., and Bustos, C. (2017). An empirical evaluation of intrinsic dimension estimators. IS, 64:206–218.
Skopal, T., Pokorný, J., and Snásel, V. (2004). PM-tree: Pivoting Metric Tree for Similarity Search in Multimedia Databases. In ADBIS, pages 1 – 16.
Tao, Y., Zhang, J., P., D., and Mamoulis, N. (2004). An efficient model for optimization of kNN in low and medium dimensional spaces. TKDE, 16(10):1169–1184.
Tasan, M. and Özsoyoglu, Z. M. (2004). Improvements in distance-based indexing. In SSDBM, pages 161–170.
Traina Jr., C., Traina, A. J. M., Faloutsos, C., and Seeger, B. (2002). Fast indexing and visualization of metric data sets using slim-trees. TKDE, 14(2):244–260.
Vieira, M. R., Traina Jr., C., Traina, A. J. M., Arantes, A. S., and Faloutsos, C. (2007). Boosting kNN queries estimating suitable query radii. In SSDBM, pages 1:1 – 1:10.
Zezula, P., Amato, G., Dohnal, V., and Batko, M. (2006). Similarity Search - The Metric Space Approach, volume 32. Kluwer.
Baioco, G. B., Traina, A. J. M., and Traina Jr., C. (2007). MAMCost: Global and Local Estimates leading to Robust Cost Estimation. In SSDBM, pages 1:1–1:6.
Bedo, M. V. N., Kaster, D. S., Traina, A. J. M., and Traina Jr., C. (2015). CDH: a novel structure to boost k-nearest neighbor queries. In SSDBM, pages 36:1–36:6.
Bustos, B., Navarro, G., and Chávez, E. (2003). Pivot selection techniques for proximity searching in metric spaces. PRL, 24(14):2357–2366.
Ciaccia, P., Patella, M., and Zezula, P. (1997). M-tree: An efficient access method for similarity search in metric spaces. In VLDB, pages 426–435.
Ciaccia, P., Patella, M., and Zezula, P. (1998). A Cost Model for Similarity Queries in Metric Spaces. In PODS, pages 59–68.
Hjaltason, G. R. and Samet, H. (2003). Index-driven similarity search in metric spaces. TDS, 28(1):517–580.
Ioannidis, Y. (2003). The history of histograms (abridged). In VLDB, pages 19–30.
Kaufman, L. and Rousseeuw, P. (1987). Clustering by means of medoids. North-Holland.
König, A. C. and Weikum, G. (2002). A framework for the physical design problem for data synopses. In EDBT, pages 627–645.
Korn, F., Pagel, B., and Faloutsos, C. (2001). On the ‘Dimensionality Curse’ and the ‘Self-Similarity Blessing’. TKDE, 13(1):96–111.
Lokoč, J. (2010). Tree-based indexing methods for similarity search in metric and non-metric spaces. PhD thesis, Charles University, Ovocný trh 3-5, 116 36 Praha.
Lu, Y., Lu, J., Cong, G., Wu, W., and Shahabi, C. (2014). Efficient algorithms and cost models for reverse spatial-keyword kNN search. TDS, 39(2):13:1–13:46.
Navarro, G., Paredes, R., Reyes, N., and Bustos, C. (2017). An empirical evaluation of intrinsic dimension estimators. IS, 64:206–218.
Skopal, T., Pokorný, J., and Snásel, V. (2004). PM-tree: Pivoting Metric Tree for Similarity Search in Multimedia Databases. In ADBIS, pages 1 – 16.
Tao, Y., Zhang, J., P., D., and Mamoulis, N. (2004). An efficient model for optimization of kNN in low and medium dimensional spaces. TKDE, 16(10):1169–1184.
Tasan, M. and Özsoyoglu, Z. M. (2004). Improvements in distance-based indexing. In SSDBM, pages 161–170.
Traina Jr., C., Traina, A. J. M., Faloutsos, C., and Seeger, B. (2002). Fast indexing and visualization of metric data sets using slim-trees. TKDE, 14(2):244–260.
Vieira, M. R., Traina Jr., C., Traina, A. J. M., Arantes, A. S., and Faloutsos, C. (2007). Boosting kNN queries estimating suitable query radii. In SSDBM, pages 1:1 – 1:10.
Zezula, P., Amato, G., Dohnal, V., and Batko, M. (2006). Similarity Search - The Metric Space Approach, volume 32. Kluwer.
Publicado
02/10/2017
Como Citar
BEDO, Marcos V. N.; TRAINA, Agma J. M.; TRAINA JR., Caetano.
A Spline-based Cost Model for Metric Trees. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 32. , 2017, Uberlândia/MG.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2017
.
p. 28-39.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2017.171404.