Improving Soundscape Retrieval for Bioacoustic Monitoring: An Analysis of Fusion Techniques with Pre-trained Embeddings

  • Andrés D. Peralta UFAM
  • Eulanda Miranda dos Santos UFAM
  • Marcelo Gordo UFAM
  • Jie Xie Nanji Normal University
  • Juan G. Colonna UFAM / Victoria University of Wellington

Resumo


The retrieval of similar soundscapes is essential for bioacoustic and ecoacoustic monitoring, yet it remains challenging due to the large volume of unlabeled data, environmental noise, and the complexity of acoustic scenes. To overcome the limitations of traditional, feature-based methods, this study proposes an efficient system that integrates embeddings extracted from a pretrained deep learning model, combined with a noise reduction technique and feature fusion strategies within a vector database to enable similarity-based retrieval. We evaluated the system using bird, amphibian, and mammal recordings across four experimental methodologies, including a use case focused on endangered species. Results show that embedding vectors consistently outperform traditional MFCC (Mel-frequency cepstral coefficients) features in capturing acoustic similarity, and that approximate search algorithms (HNSW) significantly improve both retrieval precision and query efficiency. Additionally, the system effectively retrieves recordings of the critically endangered species Crax alberti and maps their geographic distribution, highlighting its potential for conservation planning and early-warning monitoring.

Palavras-chave: Bioacoustic, Deep Learning, Pre-trained models, Vector database

Referências

Luke Barrington, Antoni Chan, Douglas Turnbull, and Gert Lanckriet. 2007. Audio Information Retrieval using Semantic Similarity. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, Vol. 2. II–725–II–728. DOI: 10.1109/ICASSP.2007.366338

Michael J. Bianco, Peter Gerstoft, James Traer, Emma Ozanich, Marie A. Roch, Sharon Gannot, and Charles-Alban Deledalle. 2019. Machine learning in acoustics: Theory and applications. The Journal of the Acoustical Society of America 146 (2019), 3590–3628. DOI: 10.1121/1.5133944

Johan Bjorck, Brendan H. Rappazzo, Di Chen, Richard Bernstein, Peter H. Wrege, and Carla P. Gomes. 2019. Automatic Detection and Compression for Passive Acoustic Monitoring of the African Forest Elephant. (2019), 476–484. DOI: 10.1609/aaai.v33i01.3301476

Yudong Chen, Zhihui Lai, Yujuan Ding, Kaiyi Lin, and Wai Keung Wong. 2019. Deep supervised hashing with anchor graph. In Proceedings of the IEEE/CVF international conference on computer vision. 9796–9804.

Dhanunjaya Varma Devalraju and Padmanabhan Rajan. 2022. Multiview Embeddings for Soundscape Classification. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 1197–1206. DOI: 10.1109/TASLP.2022.3153272

Spotify Engineering. 2020. Approximate Nearest Neighbor Search for Audio Embeddings at Spotify. [link] Accessed: 2025-06-26.

Burooj Ghani, Tom Denton, Stefan Kahl, and Holger Klinck. 2023. Global birdsong embeddings enable superior transfer learning for bioacoustic classification. Scientific Reports (2023). DOI: 10.1038/s41598-023-49989-z

Masato Hagiwara, Benjamin Hoffman, Jen-Yu Liu, Maddie Cusimano, Felix Effenberger, and Katie Zacarian. 2023. BEANS: The Benchmark of Animal Sounds. In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 1–5. DOI: 10.1109/ICASSP49357.2023.10096686

Jenny Hamer, Eleni Triantafillou, Bart van Merriënboer, Tom Denton, Vincent Dumoulin, Stefan Kahl, and Holger Klinck. 2023. BIRB: A Generalization Benchmark for Information Retrieval in Bioacoustics. Preprint under review (2023). [link]

Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. 2017. CNN Architectures for Large-Scale Audio Classification. IEEE Intl. Conf. on Acoustics, Speech and Signal Process. (ICASSP), 131–135. DOI: 10.1109/ICASSP.2017.7952132

iNaturalist community. 2025. iNaturalist – Citizen science platform for biodiversity observations. Online at [link].

IUCN Red List. 2024. The IUCN Red List of Threatened Species. [link]

Arindam Jati and Dimitra Emmanouilidou. 2020. Supervised Deep Hashing for Efficient Audio Event Retrieval. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4497–4501. DOI: 10.1109/ICASSP40776.2020.9053766

Jina-Ai. 2023. Jina-ai/vectordb: A Python vector database you just need - no more, no less. [link]

A. Sophia Koepke, Andreea-Maria Oncescu, João F. Henriques, Zeynep Akata, and Samuel Albanie. 2023. Audio Retrieval With Natural Language Queries: A Benchmark Study. IEEE Transactions on Multimedia 25 (2023), 2675–2685. DOI: 10.1109/TMM.2022.3149712

Omar Krauss, Marcelo Balbino, and Cristiane Nobre. 2023. Evaluation of methods of counterfactual explanation - A qualitative and quantitative analysis. In Anais do XI Symposium on Knowledge Discovery, Mining and Learning. SBC. DOI: 10.5753/kdmile.2023.232932

Anurag Kumar, Maksim Khadkevich, and Christian Fügen. 2018. Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 326–330. DOI: 10.1109/ICASSP.2018.8462200

Yu Liang, Shiliang Zhang, Li Ken Li, and Xiaoyu Wang. 2023. Unleashing the full potential of product quantization for large-scale image retrieval. Advances in Neural Information Processing Systems (2023), 61712–61724.

Yu A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020), 824–836. DOI: 10.1109/TPAMI.2018.2889473

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. 2016. TUT database for acoustic scene classification and sound event detection. In 24th European Signal Processing Conference (EUSIPCO). IEEE, 1128–1132.

Murillo Bedoya, D. and Buitrago-Cardona, A. and Acevedo-Charry, O. and Ochoa-Quintero, J. M. 2021. Colección de Sonidos Ambientales Mauricio Álvarez-Rebolledo (IAvH-CSA). Instituto Humboldt (Colombia). [link]

Bryan C. Pijanowski, Luis J. Villanueva-Rivera, Sarah L. Dumyahn, Almo Farina, Bernie L. Krause, Brian M. Napoletano, Stuart H. Gage, and Nadia Pieretti. 2011. Soundscape Ecology: The Science of Sound in the Landscape. BioScience 61 (2011), 203–216. DOI: 10.1525/bio.2011.61.3.6

Shah Jafor Sadeek Quaderi, Sadia Afrin Labonno, Sadia Mostafa, and Shamim Akhter. 2022. Identify The Beehive Sound Using Deep Learning. arXiv.org (2022). DOI: 10.48550/arXiv.2209.01374

Mirco Ravanelli, Benjamin Elizalde, Karl Ni, and Gerald Friedland. 2014. Audio concept classification with Hierarchical Deep Neural Networks. In 22nd European Signal Processing Conference (EUSIPCO). 606–610.

Google Research. 2023. Perch-Hoplite: A repository for bird sound classification and few-shot learning. [link]

Tim Sainburg, Marvin Thielk, and Timothy Q Gentner. 2020. Finding, visualizing, and quantifying latent structure across diverse animal vocal repertoires. PLoS computational biology 16, 10 (2020).

Tim Sainburg, Marvin Thielk, and Timothy Q Gentner. 2020. Finding, visualizing, and quantifying latent structure across diverse animal vocal repertoires. PLoS computational biology 16, 10 (2020).

Mustafa Sert and Ahmet Melih Başbuğ. 2019. Combining Acoustic and Semantic Similarity for Acoustic Scene Retrieval. In 2019 IEEE International Symposium on Multimedia (ISM). DOI: 10.1109/ISM46123.2019.00036

Malcolm Slaney. 2002. Semantic-audio retrieval. In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 4. IV–4108–IV–4111. DOI: 10.1109/ICASSP.2002.5745561

Kevin Smith, Uzay Ghani, and Juan G. Colonna. 2024. Towards Deep Active Learning in Avian Bioacoustics. In ICASSP 2024 - IEEE International Conference on Acoustics, Speech and Signal Processing.

Irina Tolkova. 2019. Feature Representations for Conservation Bioacoustics: Review and Discussion. Harvard University (2019). [link]

Vellinga, W.P. and Planqué, R. 2025. Xeno-canto – Bird sounds from around the world. GBIF Occurrence Dataset. [link]

Avery Li-Chun Wang. 2003. An Industrial Strength Audio Search Algorithm. In Proceedings of the 4th International Conference on Music Information Retrieval (ISMIR).

ChengWang, Haojin Yang, and Christoph Meinel. 2015. Deep Semantic Mapping for Cross-Modal Retrieval. In IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI). 234–241. DOI: 10.1109/ICTAI.2015.45

Gordon Wichern, Jiachen Xue, Harvey Thornburg, Brandon Mechtley, and Andreas Spanias. 2010. Segmentation, Indexing, and Retrieval for Environmental and Natural Sounds. IEEE Transactions on Audio, Speech, and Language Processing 3 (2010), 688–707. DOI: 10.1109/TASL.2010.2041384

Jie Xie and Mingying. Zhu. 2023. Acoustic Classification of Bird Species Using an Early Fusion of Deep Features. Birds (2023), 11. DOI: 10.3390/birds4010011

Hanxiao Xu. 2020. Cross-Modal Sound-Image Retrieval Based on Deep Collaborative Hashing. In 5th International Conference on Information Science, Computer Technology and Transportation (ISCTT). 188–197. DOI: 10.1109/ISCTT51595.2020.00041
Publicado
10/11/2025
PERALTA, Andrés D.; SANTOS, Eulanda Miranda dos; GORDO, Marcelo; XIE, Jie; COLONNA, Juan G.. Improving Soundscape Retrieval for Bioacoustic Monitoring: An Analysis of Fusion Techniques with Pre-trained Embeddings. In: BRAZILIAN SYMPOSIUM ON MULTIMEDIA AND THE WEB (WEBMEDIA), 31. , 2025, Rio de Janeiro/RJ. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 312-320. DOI: https://doi.org/10.5753/webmedia.2025.15196.

Artigos mais lidos do(s) mesmo(s) autor(es)

1 2 > >>