Similarity Search in CassandraDB
Abstract
The need for scalability in the Big Data era has lead to flourishing of distributed NOSQL systems. However, none of the widely used solutions available support similarity searches, another important feature in such scenarios. Here we propose to overcome this limitation by extending Cassandra with similarity search capabilities. Our approach is based on using locality sensitive hashing to spread data to nodes, to target nodes for queries, and for efficiently recovering similar data from stable storage at the nodes. We overview how to implement the approach, our test plan, and initial results.
Keywords:
NOSQL, CassandraDB, Similarity Search, Locale-Aware Hashing
References
Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422–426.
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. (2008). Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2):4:1–4:26.
Charikar, M. S. (2002). Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, STOC ’02, pages 380–388, New York, NY, USA. ACM.
Decandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Hastorun, D., Decandia, G., and Vogels, W. (2007). Dynamo: Amazon’s highly available Key-Value store. SOSP.
Gao, J. (2007). Efficient support for similarity searches in dht-based peer-to-peer systems. In IEEE International Conference on Communications (ICC’07.
Hua, Y., Xiao, B., Veeravalli, B., and Feng, D. (2012). Locality-sensitive bloom filter for approximate membership query. IEEE Transactions on Computers, 61(6):817–830.
Lakshman, A. and Malik, P. (2010). Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2):35–40.
O’neil, P. E., Cheng, E., Gawlick, D., and O’neil, E. J. (1996). The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica, 33:351–385.
Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., and Balakrishnan, H. (2001). Chord: A scalable peer-to-peer lookup service for internet applications. In Proc. of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, SIGCOMM ’01, pages 149–160, New York, NY, USA. ACM.
Villaça, R. S. (2013). Hamming DHT e HCube: Arquiteturas distribuídas para busca por similaridade. PhD thesis, Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de Computação.
Villaca, R. S., de Paula, L. B., Pasquini, R., and Magalhaes, M. F. (2013). Hamming dht: Taming the similarity search. In Consumer Communications and Networking Conference (CCNC), 2013 IEEE, pages 7–12.
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. (2008). Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2):4:1–4:26.
Charikar, M. S. (2002). Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, STOC ’02, pages 380–388, New York, NY, USA. ACM.
Decandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Hastorun, D., Decandia, G., and Vogels, W. (2007). Dynamo: Amazon’s highly available Key-Value store. SOSP.
Gao, J. (2007). Efficient support for similarity searches in dht-based peer-to-peer systems. In IEEE International Conference on Communications (ICC’07.
Hua, Y., Xiao, B., Veeravalli, B., and Feng, D. (2012). Locality-sensitive bloom filter for approximate membership query. IEEE Transactions on Computers, 61(6):817–830.
Lakshman, A. and Malik, P. (2010). Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2):35–40.
O’neil, P. E., Cheng, E., Gawlick, D., and O’neil, E. J. (1996). The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica, 33:351–385.
Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., and Balakrishnan, H. (2001). Chord: A scalable peer-to-peer lookup service for internet applications. In Proc. of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, SIGCOMM ’01, pages 149–160, New York, NY, USA. ACM.
Villaça, R. S. (2013). Hamming DHT e HCube: Arquiteturas distribuídas para busca por similaridade. PhD thesis, Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de Computação.
Villaca, R. S., de Paula, L. B., Pasquini, R., and Magalhaes, M. F. (2013). Hamming dht: Taming the similarity search. In Consumer Communications and Networking Conference (CCNC), 2013 IEEE, pages 7–12.
Published
2016-10-04
How to Cite
MOURÃO, Antonio; PASQUINI, Rafael; VILLAÇA, Rodolfo; CAMARGOS, Lasaro.
Similarity Search in CassandraDB. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 31. , 2016, Salvador/BA.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2016
.
p. 133-138.
ISSN 2763-8979.
DOI: https://doi.org/10.5753/sbbd.2016.24317.
