Evaluating Transformer-Based Architectures for Simultaneous Audio Speech Transcription and Background Audio Captioning

  • João Vitor R. da Silva UFES
  • Francisco de Assis Boldt IFES
  • Luis A. Souza Jr UFES
  • Mariella Berger IFES
  • Anselmo Frizera UFES
  • Alberto F. De Souza UFES
  • Thiago Oliveira-Santos UFES
  • Claudine Badue UFES

Resumo


Este estudo avalia modelos baseados na arquitetura de Redes Neurais Transformers para transcrição de fala e descrição de áudio de fundo simultâneas em cenários com sinais de áudio mistos. Utilizando o Whisper para fala e o Prompteus para sons ambientais, os modelos foram testados no conjunto de dados Clotho Voice, que combina fala em português (Common Voice 5.1) e sons ambientais (Clotho 2.1). Os resultados, obtidos por meio das métricas WER e FENSE, mostram que cada modelo apresenta bom desempenho em sua área de especialização, mas sofre degradação quando há sobreposição de sinais. O Whisper se mostra robusto a ruídos moderados, enquanto o Prompteus apresenta dificuldades quando a fala é dominante. Os achados destacam a necessidade de abordagens híbridas para viabilizar um processamento de áudio confiável e sensível ao contexto em ambientes complexos.

Referências

Ardila, R., Branson, M., Davis, K., Kohler, M., Meyer, J., Henretty, M., Morais, R., Saunders, L., Tyers, F., and Weber, G. (2020). Common voice: A massively-multilingual speech corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222.

Crocco, M., Cristani, M., Trucco, A., and Murino, V. (2016). Audio surveillance: A systematic review. ACM Computing Surveys (CSUR), 48(4):1–46.

Czyżewski, A., Skarzynaski, H., Kostek, B., and Geremek, A. (1998). Multimedia technology for hearing impaired people. 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175), pages 181–186.

Drossos, K., Adavanne, S., and Virtanen, T. (2017). Automated audio captioning with recurrent neural networks. 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 374–378.

Drossos, K., Lipping, S., and Virtanen, T. (2020). Clotho: an audio captioning dataset. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 736–740.

Gemmeke, J., Ellis, D., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., and Ritter, M. (2017a). Audio set: An ontology and human-labeled dataset for audio events. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780.

Gemmeke, J. F., Ellis, D. P. W., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., and Ritter, M. (2017b). Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780.

Ikawa, S. and Kashino, K. (2019). Neural audio captioning based on conditional sequence-to-sequence model. In Workshop on Detection and Classification of Acoustic Scenes and Events.

Kadlčík, M., Hájek, A., Kieslich, J., and Winiecki, R. (2023). A whisper transformer for audio captioning trained with synthetic captions and transfer learning.

Kim, C. D., Kim, B., Lee, H., and Kim, G. (2019a). Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132.

Kim, C. D., Kim, B., Lee, H., and Kim, G. (2019b). AudioCaps: Generating captions for audios in the wild. In Burstein, J., Doran, C., and Solorio, T., editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132, Minneapolis, Minnesota. Association for Computational Linguistics.

Lane, N., Georgiev, P., and Qendro, L. (2015). Deepear: robust smartphone audio sensing in unconstrained acoustic environments using deep learning. Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing.

Mei, X., Liu, X., Huang, Q., Plumbley, M. D., and Wang, W. (2021). Audio captioning transformer.

Mei, X., Meng, C., Liu, H., Kong, Q., Ko, T., Zhao, C., Plumbley, M., Zou, Y., and Wang, W. (2023). Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:3339–3354.

Morris, A. C., Maier, V., and Green, P. D. (2004). From wer and ril to mer and wil: improved evaluation measures for connected speech recognition. In Interspeech, pages 2765–2768.

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492–28518. PMLR.

Stöter, F.-R., Uhlich, S., Liutkus, A., and Mitsufuji, Y. (2019). Open-unmix - a reference implementation for music source separation. J. Open Source Softw., 4:1667.

Wang, Y., Liu, Z., and Huang, J. (2000). Multimedia content analysis-using both audio and visual clues. IEEE Signal Process. Mag., 17:12–36.

Wu, M., Dinkel, H., and Yu, K. (2019). Audio caption: Listen and tell. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 830–834. IEEE.

Zhou, Z., Zhang, Z., Xu, X., Xie, Z., Wu, M., and Zhu, K. Q. (2021). Can audio captions be evaluated with image caption metrics? ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 981–985.
Publicado
20/07/2025
SILVA, João Vitor R. da; BOLDT, Francisco de Assis; SOUZA JR, Luis A.; BERGER, Mariella; FRIZERA, Anselmo; SOUZA, Alberto F. De; OLIVEIRA-SANTOS, Thiago; BADUE, Claudine. Evaluating Transformer-Based Architectures for Simultaneous Audio Speech Transcription and Background Audio Captioning. In: SEMINÁRIO INTEGRADO DE SOFTWARE E HARDWARE (SEMISH), 52. , 2025, Maceió/AL. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 633-644. ISSN 2595-6205. DOI: https://doi.org/10.5753/semish.2025.9474.