Medical Dialogue Audio Transcription: Dataset and Benchmarking of ASR Models

  • Aline E. Gassenn Universidade de São Paulo (USP)
  • Luís G. M. Andrade Universidade Estadual Paulista (UNESP)
  • Douglas Teodoro Universidade de Genebra
  • José F. Rodrigues-Jr Universidade de São Paulo (USP)

Resumo


The development of Automatic Speech Recognition (ASR) technologies for healthcare applications is hindered by the limited availability of publicly accessible speech corpora that reflect both natural medical dialogues and the acoustic conditions typically found in clinical environments. In this study, we present the creation and characterization of MedDialogue-Audio, a new synthetic English-language corpus designed to address this gap. The dataset was derived from the MedDialog-EN transcription set and enriched through a multi-stage processing pipeline that involved text normalization with a large language model, speech synthesis, and the controlled addition of both white noise and hospital ambient sounds. We provide descriptive statistics for the corpus, which comprises more than 10,000 dialogues, as well as benchmarking results from leading ASR models. The experiments assess transcription performance across varying signal-to-noise ratios and establish baseline metrics to support future research in this field.

Palavras-chave: Audio, Automatic Speech Recognition, Medical Dataset, Text-to-Speech

Referências

Ali, S. N. and Shuvo, S. B. (2021). Hospital ambient noise dataset.

Ali, S. N., Shuvo, S. B., Al-Manzo, M. I. S., Hasan, A., and Hasan, T. (2023). An end-to-end deep learning framework for real-time denoising of heart sounds for cardiac disease detection in unseen noise. IEEE Access, 11:87887–87901.

Arora, R. K., Wei, J., Hicks, R. S., Bowman, P., Quiñonero-Candela, J., Tsimpourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., Heidecke, J., and Singhal, K. (2025). Healthbench: Evaluating large language models towards improved human health.

Baevski, A., Zhou, H., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS), pages 12449–12460. Curran Associates, Inc.

Banerjee, S., Agarwal, A., and Ghosh, P. (2024). High-precision medical speech recognition through synthetic data and semantic correction: United-medasr. arXiv preprint arXiv:2412.00055.

Canopyai (2025). Canopyai/orpheus-tts: Towards human-sounding speech.

Devatine, N. and Abraham, L. (2024). Assessing human editing effort on llm-generated texts via compression-based edit distance. arXiv preprint arXiv:2412.17321.

Gonçalves, Y. T., Alves, J. V. B., Sá, B. A. D., da Silva, L. N., de Macedo, J. A. F., and da Silva, T. L. C. (2024). Speech recognition models in assisting medical history. In Proceedings of the 39th Brazilian Symposium on Databases (SBBD), pages 485–497, Florianópolis, SC, Brazil.

Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., and Mohamed, A. (2021). Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.

Le-Duc, K. (2024). VietMed: A dataset and benchmark for automatic speech recognition of Vietnamese in the medical domain. In Calzolari, N., Kan, M.-Y., Hoste, V., Lenci, A., Sakti, S., and Xue, N., editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 17365–17370, Torino, Italia. ELRA and ICCL.

Le-Duc, K., Phan, P., Pham, T.-H., Tat, B. P., Ngo, M.-H., Nguyen-Tang, T., and Hy, T.-S. (2025). MultiMed: Multilingual medical speech recognition via attention encoder decoder. In Rehm, G. and Li, Y., editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pages 1113–1150, Vienna, Austria. Association for Computational Linguistics.

Lee, S.-H., Park, J., Yang, K., Min, J., and Choi, J. (2022). Accuracy of cloud-based speech recognition open application programming interface for medical terms of korean. Journal of Korean Medical Science, 37(18).

Norvig, P. (2025). Pyspellchecker: Pure python spell checking library.

Nurfadhilah, E., Jarin, A., Ruslana Aini, L., Pebiana, S., Santosa, A., Teduh Uliniansyah, M., Butarbutar, E., Desiani, and Gunarso (2021). Evaluating the bppt medical speech corpus for an asr medical record transcription system. In 2021 9th International Conference on Information and Communication Technology (ICoICT), pages 657–661.

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning (ICML), pages 28492–28518. PMLR.

Tang, C., Zhang, H., Loakman, T., Lin, C., and Guerin, F. (2023). Terminology-aware medical dialogue generation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.

Zeng, G., Yang, W., Ju, Z., Yang, Y., Wang, S., Zhang, R., Zhou, M., Zeng, J., Dong, X., Zhang, R., Fang, H., Zhu, P., Chen, S., and Xie, P. (2020). MedDialog: Large-scale medical dialogue datasets. In Webber, B., Cohn, T., He, Y., and Liu, Y., editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9241–9250, Online. Association for Computational Linguistics.
Publicado
29/09/2025
GASSENN, Aline E.; ANDRADE, Luís G. M.; TEODORO, Douglas; RODRIGUES-JR, José F.. Medical Dialogue Audio Transcription: Dataset and Benchmarking of ASR Models. In: DATASET SHOWCASE WORKSHOP (DSW), 7. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 71-82. DOI: https://doi.org/10.5753/dsw.2025.248010.