Semantic retrieval of soundscapes using vector databases
Abstract
The semantic recovery of soundscapes emerges as a crucial component for monitoring ecosystems. However, due to the continuous nature of monitoring over time, we face considerable challenges due to the vast volume of collected audio records. In addition to the large data volume, we also encounter the inconvenience of the lack of labels in the recordings. Currently, there are several proposals based on supervised machine learning to recognize and classify animal species based on their vocalizations. However, there is a shortage of studies implementing semantic recovery of soundscapes through the application of pre-trained Deep Learning models and vector databases (ex: VectorDB). In this study, we develop a vector database for querying and retrieving similar acoustic landscapes with anuran calls.
References
L. Barrington, A. Chan, D. Turnbull, and G. Lanckriet. Audio information retrieval using semantic similarity. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP ’07, volume 2, pages II–725–II–728, 2007. DOI: 10.1109/ICASSP.2007.366338.
M. J. Bianco, P. Gerstoft, J. Traer, E. Ozanich, M. A. Roch, S. Gannot, and C.-A. Deledalle. Machine learning in acoustics: Theory and applications. The Journal of the Acoustical Society of America, 146:3590–3628, 2019. DOI: 10.1121/1.5133944.
J. Bjorck, B. H. Rappazzo, D. Chen, R. Bernstein, P. H. Wrege, and C. P. Gomes. Automatic Detection and Compression for Passive Acoustic Monitoring of the African Forest Elephant. pages 476–484, 2019. DOI: 10.1609/aaai.v33i01.3301476.
D. V. Devalraju and P. Rajan. Multiview embeddings for soundscape classification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:1197–1206, 2022. DOI: 10.1109/TASLP.2022.3153272.
L. Fanioudakis and I. Potamitis. Deep Networks tag the location of bird vocalisations on audio spectrograms. arXiv.org, 2017. DOI: 10.48550/arXiv.1711.04347.
E. Fonseca, M. Plakal, F. Font, D. P. W. Ellis, X. Favory, J. Pons, and X. Serra. General-purpose tagging of freesound audio with audioset labels: Task description,dataset, and baseline. arXiv, 2018.
B. Ghani, T. Denton, S. Kahl, and H. Klinck. Global birdsong embeddings enable superior transfer learning for bioacoustic classification. 2023. DOI: 10.1038/s41598-023-49989-z.
M. Hagiwara, B. Hoffman, J.-Y. Liu, M. Cusimano, F. Effenberger, and K. Zacarian. Beans: The benchmark of animal sounds. In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. DOI: 10.1109/ICASSP49357.2023.10096686.
S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson. CNN Architectures for Large-Scale Audio Classification. pages 131–135. IEEE Intl. Conf. on Acoustics, Speech and Signal Process. (ICASSP), 2017. DOI: 10.1109/ICASSP.2017.7952132.
A. Jati and D. Emmanouilidou. Supervised deep hashing for efficient audio event retrieval. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4497–4501, 2020. DOI: 10.1109/ICASSP40776.2020.9053766.
L. Jin, Z. Li, and J. Tang. Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals. IEEE Transactions on Neural Networks and Learning Systems, 2023. DOI: 10.1109/TNNLS.2020.2997020.
Jina-Ai. Jina-ai/vectordb: A Python vector database you just need no more, no less., 2023. URL [link].
A. S. Koepke, A.-M. Oncescu, J. F. Henriques, Z. Akata, and S. Albanie. Audio retrieval with natural language queries: A benchmark study. IEEE Transactions on Multimedia, 25:2675–2685, 2023. DOI: 10.1109/TMM.2022.3149712.
A. Kumar, M. Khadkevich, and C. Fügen. Knowledge transfer from weakly labeled audio using convolutional neural network for sound events and scenes. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 326–330, 2018. DOI: 10.1109/ICASSP.2018.8462200.
Y. Lin, X. Chen, R. Takashima, and T. Takiguchi. zero-shot sound event classification using a sound attribute vector with global and local feature learning. In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 5, 2023. DOI: 10.1109/ICASSP49357.2023.10096367.
Y. A. Malkov and D. A. Yashunin. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 824–836, 2020. DOI: 10.1109/TPAMI.2018.2889473.
L. Meihan, D. Yongxing, B. Yan, and D. Ling-Yu. Deep product quantization module for efficient image retrieval. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4382–4386, 2020. DOI: 10.1109/ICASSP40776.2020.9054175.
S. Nakamura, K. Hiyane, F. Asano, T. Nishiura, and T. Yamada. In LREC, pages 965–968, 2000.
F. Petersen, H. Kuehne, C. Borgelt, and O. Deussen. Differentiable top-k classification learning. In 39 th International Conference on Machine Learning, 2022.
K. J. Piczak. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd ACM International Conference on Multimedia, page 1015–1018. Association for Computing Machinery, 2015. DOI: 10.1145/2733373.2806390.
B. C. Pijanowski, L. J. Villanueva-Rivera, S. L. Dumyahn, A. Farina, B. L. Krause, B. M. Napoletano, S. H. Gage, and N. Pieretti. Soundscape Ecology: The Science of Sound in the Landscape. BioScience, 61:203–216, 2011. DOI: 10.1525/bio.2011.61.3.6.
K. Presannakumar and A. Mohamed. Deep learning based source identification of environmental audio signals using optimized convolutional neural networks. Applied Soft Computing, 2023. DOI: 10.1016/j.asoc.2023.110423.
S. J. S. Quaderi, S. A. Labonno, S. Mostafa, and S. Akhter. Identify the beehive sound using deep learning. arXiv.org, 2022. DOI: 10.48550/arXiv.2209.01374.
T. Sainburg, M. Thielk, and T. Q. Gentner. Finding, visualizing, and quantifying latent structure across diverse animal vocal repertoires. PLoS computational biology, 16(10), 2020.
R. M. Schafer. The Soundscape. Amazon, Rochester, Vt. : United States, Oct. 1993. ISBN 978-0-89281-455-8.
M. Slaney. Semantic-audio retrieval. In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 4, pages IV–4108–IV–4111, 2002. DOI: 10.1109/ICASSP.2002.5745561.
M. Tan and Q. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 6105–6114, 09-15 Jun 2019.
C. Wang, H. Yang, and C. Meinel. Deep semantic mapping for cross-modal retrieval. In IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI), pages 234–241, 2015. DOI: 10.1109/ICTAI.2015.45.
H. Xu. Cross-Modal Sound-Image Retrieval Based on Deep Collaborative Hashing. In 5th International Conference on Information Science, Computer Technology and Transportation (ISCTT), pages 188–197, 2020. DOI: 10.1109/ISCTT51595.2020.00041.
P. Yadav, P. Sujatha, P. Dhavachelvan, and K. Prasad. Weight based precision oriented metrics for multilingual information retrieval system. In IEEE International Conference on Advanced Communications, Control and Computing Technologies, pages 1114–1119, 2014. DOI: 10.1109/ICACCCT.2014.7019271.
