Speech Recognition Models in Assisting Medical History

  • Yanna Torres Gonçalves Universidade Federal do Ceará (UFC)
  • João Victor B. Alves Universidade Federal do Ceará (UFC)
  • Breno Alef Dourado Sá Universidade Federal do Ceará (UFC)
  • Lázaro Natanael da Silva Universidade Federal do Ceará (UFC)
  • José A. Fernandes de Macedo Universidade Federal do Ceará (UFC)
  • Ticiana L. Coelho da Silva Universidade Federal do Ceará (UFC)

Resumo


This paper addresses challenges highlighted by health professionals, where up to 50\% of a medical consultation's time is spent on history creation. To streamline this process, we propose leveraging Automatic Speech Recognition (ASR) models to convert spoken language into text. In our study, we assess the effectiveness of pre-trained ASR models for medical history transcription in Brazilian Portuguese. By incorporating language models to enhance ASR output, we aim to improve the accuracy and semantic fidelity of transcriptions. Our results demonstrate that integrating a 5-gram model with Wav2Vec2 PT significantly reduces transcription errors, while also maintaining superior performance in capturing textual nuances and similarity.
Palavras-chave: Medical History, Automatic Speech Recognition, Language Model

Referências

Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. In NeurIPS, pages 12449–12460.

Chiu, C.-C., Tripathi, A., Chou, K., Co, C., Jaitly, N., Jaunzeikare, D., Kannan, A., Nguyen, P., Sak, H., Sankar, A., et al. (2017). Speech recognition for medical conversations. arXiv preprint arXiv:1711.07274.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., and Mohamed, A. (2021). Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM TASLP, 29:3451–3460.

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. (2023). Mistral 7b. arXiv preprint arXiv:2310.06825.

Kar, S., Mishra, P., Lin, J., Woo, M.-J., Deas, N., Linduff, C., Niu, S., Yang, Y., McClendon, J., Smith, D. H., et al. (2021). Systematic evaluation and enhancement of speech recognition in operational medical environments. In IJCNN, pages 1–8.

Lee, T.-Y., Li, C.-C., Chou, K.-R., Chung, M.-H., Hsiao, S.-T., Guo, S.-L., Hung, L.-Y., and Wu, H.-T. (2023). Machine learning-based speech recognition system for nursing documentation–a pilot study. IJMI, 178:105213.

Li, B., Zhou, H., He, J., Wang, M., Yang, Y., and Li, L. (2020). On the sentence embeddings from pre-trained language models. In Webber, B., Cohn, T., He, Y., and Liu, Y., editors, Proceedings of the EMNLP, pages 9119–9130.

Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J. M., Nguyen, H., and Gadde, R. T. (2019). Jasper: An End-to-End Convolutional Neural Acoustic Model. In Proc. Interspeech 2019, pages 71–75. ISCA.

Paats, A., Alumäe, T., Meister, E., and Fridolin, I. (2018). Retrospective analysis of clinical performance of an estonian speech recognition system for radiology: effects of different acoustic and language models. JDI, 31(5):615–621.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th ACL, page 311–318, USA. Association for Computational Linguistics.

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In ICML, pages 28492–28518.

Reddy, D. R. (1976). Speech recognition by machine: A review. Proceedings of the IEEE, 64(4):501–531.

Rubenstein, P. K., Asawaroengchai, C., Nguyen, D. D., Bapna, A., Borsos, Z., Quitry, F. d. C., Chen, P., Badawy, D. E., Han, W., Kharitonov, E., et al. (2023). Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925.

Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. In Interspeech 2019, pages 3465–3469.

Sullivan, P., Shibano, T., and Abdul-Mageed, M. (2022). Improving automatic speech recognition for non-native english with transfer learning and language model decoding. In AANLSP, pages 21–44.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In NIPS, pages 6000–6010.

Wilcoxon, F. (1992). Individual comparisons by ranking methods. In Kotz, S. and Johnson, N. L., editors, Breakthroughs in Statistics: Methodology and Distribution, pages 196–202. Springer New York, New York, NY.
Publicado
14/10/2024
GONÇALVES, Yanna Torres; ALVES, João Victor B.; SÁ, Breno Alef Dourado; SILVA, Lázaro Natanael da; MACEDO, José A. Fernandes de; COELHO DA SILVA, Ticiana L.. Speech Recognition Models in Assisting Medical History. In: SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 39. , 2024, Florianópolis/SC. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 485-497. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2024.240270.