Recuperação semântica de paisagens sonoras usando banco de dados vetoriais
Resumo
A recuperação semântica de paisagens sonoras emerge como um componente crucial para monitorar ecossistemas. No entanto, devido à natureza contínua do monitoramento ao longo do tempo, enfrentamos desafios consideráveis devido ao vasto volume de registros de áudio coletados. Além do grande volume de dados, também nos deparamos com a falta de rótulos nas gravações. Atualmente, existem várias propostas baseadas em aprendizado de máquina supervisionado para reconhecer e classificar espécies animais com base em suas vocalizações. No entanto, há uma escassez de estudos que implementam a recuperação semântica de paisagens sonoras por meio da aplicação de modelos de Deep Learning pré-treinados e bancos de vetoriais (por exemplo, VectorDB). Neste estudo, desenvolvemos um banco de vetoriais para consultar e recuperar paisagens acústicas semelhantes com vocalizações de anuros.
Referências
L. Barrington, A. Chan, D. Turnbull, and G. Lanckriet. Audio information retrieval using semantic similarity. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP ’07, volume 2, pages II–725–II–728, 2007. DOI: 10.1109/ICASSP.2007.366338.
M. J. Bianco, P. Gerstoft, J. Traer, E. Ozanich, M. A. Roch, S. Gannot, and C.-A. Deledalle. Machine learning in acoustics: Theory and applications. The Journal of the Acoustical Society of America, 146:3590–3628, 2019. DOI: 10.1121/1.5133944.
J. Bjorck, B. H. Rappazzo, D. Chen, R. Bernstein, P. H. Wrege, and C. P. Gomes. Automatic Detection and Compression for Passive Acoustic Monitoring of the African Forest Elephant. pages 476–484, 2019. DOI: 10.1609/aaai.v33i01.3301476.
D. V. Devalraju and P. Rajan. Multiview embeddings for soundscape classification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:1197–1206, 2022. DOI: 10.1109/TASLP.2022.3153272.
L. Fanioudakis and I. Potamitis. Deep Networks tag the location of bird vocalisations on audio spectrograms. arXiv.org, 2017. DOI: 10.48550/arXiv.1711.04347.
E. Fonseca, M. Plakal, F. Font, D. P. W. Ellis, X. Favory, J. Pons, and X. Serra. General-purpose tagging of freesound audio with audioset labels: Task description,dataset, and baseline. arXiv, 2018.
B. Ghani, T. Denton, S. Kahl, and H. Klinck. Global birdsong embeddings enable superior transfer learning for bioacoustic classification. 2023. DOI: 10.1038/s41598-023-49989-z.
M. Hagiwara, B. Hoffman, J.-Y. Liu, M. Cusimano, F. Effenberger, and K. Zacarian. Beans: The benchmark of animal sounds. In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. DOI: 10.1109/ICASSP49357.2023.10096686.
S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. Wilson. CNN Architectures for Large-Scale Audio Classification. pages 131–135. IEEE Intl. Conf. on Acoustics, Speech and Signal Process. (ICASSP), 2017. DOI: 10.1109/ICASSP.2017.7952132.
A. Jati and D. Emmanouilidou. Supervised deep hashing for efficient audio event retrieval. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4497–4501, 2020. DOI: 10.1109/ICASSP40776.2020.9053766.
L. Jin, Z. Li, and J. Tang. Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals. IEEE Transactions on Neural Networks and Learning Systems, 2023. DOI: 10.1109/TNNLS.2020.2997020.
Jina-Ai. Jina-ai/vectordb: A Python vector database you just need no more, no less., 2023. URL [link].
A. S. Koepke, A.-M. Oncescu, J. F. Henriques, Z. Akata, and S. Albanie. Audio retrieval with natural language queries: A benchmark study. IEEE Transactions on Multimedia, 25:2675–2685, 2023. DOI: 10.1109/TMM.2022.3149712.
A. Kumar, M. Khadkevich, and C. Fügen. Knowledge transfer from weakly labeled audio using convolutional neural network for sound events and scenes. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 326–330, 2018. DOI: 10.1109/ICASSP.2018.8462200.
Y. Lin, X. Chen, R. Takashima, and T. Takiguchi. zero-shot sound event classification using a sound attribute vector with global and local feature learning. In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 5, 2023. DOI: 10.1109/ICASSP49357.2023.10096367.
Y. A. Malkov and D. A. Yashunin. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 824–836, 2020. DOI: 10.1109/TPAMI.2018.2889473.
L. Meihan, D. Yongxing, B. Yan, and D. Ling-Yu. Deep product quantization module for efficient image retrieval. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4382–4386, 2020. DOI: 10.1109/ICASSP40776.2020.9054175.
S. Nakamura, K. Hiyane, F. Asano, T. Nishiura, and T. Yamada. In LREC, pages 965–968, 2000.
F. Petersen, H. Kuehne, C. Borgelt, and O. Deussen. Differentiable top-k classification learning. In 39 th International Conference on Machine Learning, 2022.
K. J. Piczak. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd ACM International Conference on Multimedia, page 1015–1018. Association for Computing Machinery, 2015. DOI: 10.1145/2733373.2806390.
B. C. Pijanowski, L. J. Villanueva-Rivera, S. L. Dumyahn, A. Farina, B. L. Krause, B. M. Napoletano, S. H. Gage, and N. Pieretti. Soundscape Ecology: The Science of Sound in the Landscape. BioScience, 61:203–216, 2011. DOI: 10.1525/bio.2011.61.3.6.
K. Presannakumar and A. Mohamed. Deep learning based source identification of environmental audio signals using optimized convolutional neural networks. Applied Soft Computing, 2023. DOI: 10.1016/j.asoc.2023.110423.
S. J. S. Quaderi, S. A. Labonno, S. Mostafa, and S. Akhter. Identify the beehive sound using deep learning. arXiv.org, 2022. DOI: 10.48550/arXiv.2209.01374.
T. Sainburg, M. Thielk, and T. Q. Gentner. Finding, visualizing, and quantifying latent structure across diverse animal vocal repertoires. PLoS computational biology, 16(10), 2020.
R. M. Schafer. The Soundscape. Amazon, Rochester, Vt. : United States, Oct. 1993. ISBN 978-0-89281-455-8.
M. Slaney. Semantic-audio retrieval. In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 4, pages IV–4108–IV–4111, 2002. DOI: 10.1109/ICASSP.2002.5745561.
M. Tan and Q. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 6105–6114, 09-15 Jun 2019.
C. Wang, H. Yang, and C. Meinel. Deep semantic mapping for cross-modal retrieval. In IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI), pages 234–241, 2015. DOI: 10.1109/ICTAI.2015.45.
H. Xu. Cross-Modal Sound-Image Retrieval Based on Deep Collaborative Hashing. In 5th International Conference on Information Science, Computer Technology and Transportation (ISCTT), pages 188–197, 2020. DOI: 10.1109/ISCTT51595.2020.00041.
P. Yadav, P. Sujatha, P. Dhavachelvan, and K. Prasad. Weight based precision oriented metrics for multilingual information retrieval system. In IEEE International Conference on Advanced Communications, Control and Computing Technologies, pages 1114–1119, 2014. DOI: 10.1109/ICACCCT.2014.7019271.