Compact metric suffix array
Abstract
In this paper we present a compact data structure for the Metric Suffix Array (MSA), a permutation-based index for similarity queries. The proposed data structure uses variable-length integer encoding to store the values in the MSA using less memory, which are decoded during the queries. The compact structure is built directly from the data, where it is not necessary to compute the complete MSA beforehand to obtain its compact representation, which allows the indexing of larger volumes of data. In experiments with high-dimensional features extracted from 20 million images, our compact representation used 33% of the original method's total memory, increasing the query execution time by a factor of 3.4.
Keywords:
compact data structure, permutation-based index, similarity query, metric suffix array, algorithms
References
Anh, V. N. and Moffat, A. (2005). Inverted index compression using word-aligned binary codes. Inf. Retr., 8(1):151–166. doi:10.1023/B:INRT.0000048490.99518.5c.
Bolettieri, P., Esuli, A., Falchi, F., Lucchese, C., Perego, Piccioli, and Rabitti (2009). CoPhIR: a test collection for content-based image retrieval. CoRR, abs/0905.4627.
Chávez, E., Figueroa, K., and Navarro, G. (2008). Effective proximity retrieval by ordering permutations. IEEE Trans. Pattern Anal. Mach. Intell., 30(9):1647–1658.
Elias, P. (1975). Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory, 21(2):194–203. doi:10.1109/TIT.1975.1055349.
Fagin, R., Kumar, R., and Sivakumar, D. (2003). Comparing top k lists. SIAM J. Discret. Math., 17(1):134–160. doi:10.1137/S0895480102412856.
Jégou, H., Tavenard, R., Douze, M., and Amsaleg, L. (2011). Searching in one billion vectors: Re-rank with source coding. In IEEE Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP), pages 861–864, Prague. doi:10.1109/ICASSP.2011.5946540.
Lew, M. S., Sebe, N., Djeraba, C., and Jain, R. C. (2006). Content-based multimedia information retrieval: State of the art and challenges. ACM Trans. Multim. Comput. Commun. Appl., 2(1):1–19. doi:10.1145/1126004.1126005.
Manber, U. and Myers, E. W. (1993). Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935–948. doi:10.1137/0222058.
Mohamed, H. and Marchand-Maillet, S. (2013). Metric suffix array for large-scale similarity search. In ACM WSDM 2013 Workshop on Large Scale and Distributed Systems for Information Retrieval, Rome, IT, pages 1–6.
Navarro, G. (2016). Compact Data Structures - A Practical Approach. Cambridge University Press.
Bolettieri, P., Esuli, A., Falchi, F., Lucchese, C., Perego, Piccioli, and Rabitti (2009). CoPhIR: a test collection for content-based image retrieval. CoRR, abs/0905.4627.
Chávez, E., Figueroa, K., and Navarro, G. (2008). Effective proximity retrieval by ordering permutations. IEEE Trans. Pattern Anal. Mach. Intell., 30(9):1647–1658.
Elias, P. (1975). Universal codeword sets and representations of the integers. IEEE Trans. Inf. Theory, 21(2):194–203. doi:10.1109/TIT.1975.1055349.
Fagin, R., Kumar, R., and Sivakumar, D. (2003). Comparing top k lists. SIAM J. Discret. Math., 17(1):134–160. doi:10.1137/S0895480102412856.
Jégou, H., Tavenard, R., Douze, M., and Amsaleg, L. (2011). Searching in one billion vectors: Re-rank with source coding. In IEEE Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP), pages 861–864, Prague. doi:10.1109/ICASSP.2011.5946540.
Lew, M. S., Sebe, N., Djeraba, C., and Jain, R. C. (2006). Content-based multimedia information retrieval: State of the art and challenges. ACM Trans. Multim. Comput. Commun. Appl., 2(1):1–19. doi:10.1145/1126004.1126005.
Manber, U. and Myers, E. W. (1993). Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935–948. doi:10.1137/0222058.
Mohamed, H. and Marchand-Maillet, S. (2013). Metric suffix array for large-scale similarity search. In ACM WSDM 2013 Workshop on Large Scale and Distributed Systems for Information Retrieval, Rome, IT, pages 1–6.
Navarro, G. (2016). Compact Data Structures - A Practical Approach. Cambridge University Press.
Published
2023-09-25
How to Cite
ROSA, Frederico R.; LOUZA, Felipe A.; RAZENTE, Humberto L..
Compact metric suffix array. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 38. , 2023, Belo Horizonte/MG.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2023
.
p. 414-419.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2023.233440.
