Audio MFCC-gram Transformers for respiratory insufficiency detection in COVID-19

  • Marcelo Matheus Gauy USP
  • Marcelo Finger USP


This work explores speech as a biomarker and investigates the detection of respiratory insufficiency (RI) by analyzing speech samples. Previous work [Casanova et al. 2021] constructed a dataset of respiratory insufficiency COVID-19 patient utterances and analyzed it by means of a convolutional neural network achieving an accuracy of 87.04%, validating the hypothesis that one can detect RI through speech. Here, we study how Transformer neural network architectures can improve the performance on RI detection. This approach enables construction of an acoustic model. By choosing the correct pretraining technique, we generate a self-supervised acoustic model, leading to improved performance (96.53%) of Transformers for RI detection.


Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.

Baevski, A., Schneider, S., and Auli, M. (2019). vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453.

Baevski, A., Zhou, H., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477.

Botelho, M. C., Trancoso, I., Abad, A., and Paiva, T. (2019). Speech as a biomarker for obstructive sleep apnea detection. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5851–5855. IEEE.

Brigham, E. O. and Morrow, R. (1967). The fast fourier transform. IEEE spectrum, 4(12):63–70.

Casanova, E., Gris, L., Camargo, A., Silva, D., Gazzola, M., Sabino, E., Levin, A., Candido Jr, A., Aluisio, S., and Finger, M. (2021). Deep learning against covid-19: Respiratory insufficiency detection in brazilian portuguese speech. To appear in ACL2021.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training arXiv preprint transformers for language understanding. of deep bidirectional arXiv:1810.04805.

Gonçalves, S. C. L. (2019). Projeto alip (amostra linguística do interior paulista) e banco de dados iboruna: 10 anos de contribuição com a descrição do português brasileiro. Estudos Linguísticos (São Paulo. 1978), 48(1):276–297.

Gong, Y., Chung, Y.-A., and Glass, J. (2021). Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778.

Laguarta, J., Hueto, F., and Subirana, B. (2020). Covid-19 artificial intelligence diagnosis IEEE Open Journal of Engineering in Medicine and using only cough recordings. Biology, 1:275–281.

Liu, A. T., Li, S.-W., and Lee, H.-y. (2020a). Tera: Self-supervised learning of transformer encoder representation for speech. arXiv preprint arXiv:2007.06028.

Liu, A. T., Yang, S.-w., Chi, P.-H., Hsu, P.-c., and Lee, H.-y. (2020b). Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6419–6423. IEEE.

Mendes, R. B. (2013). Projeto sp2010: Amostra da fala paulistana. Acesso em, 1(12):2013.

Nevler, N., Ash, S., Irwin, D. J., Liberman, M., and Grossman, M. (2019). Validated automatic speech biomarkers in primary progressive aphasia. Annals of Clinical and Translational Neurology, 6(1):4–14.

Oliviera Jr, M. et al. (2016). Nurc digital um protocolo para a digitalização, anotação, arquivamento e disseminação do material do projeto da norma urbana linguística culta (nurc). CHIMERA: Revista de Corpus de Lenguas Romances y Estudios Lingüísticos, 3(2):149–174.

Pham, N.-Q., Nguyen, T.-S., Niehues, J., Müller, M., Stüker, S., and Waibel, A. (2019). Very deep self-attention networks for end-to-end speech recognition. arXiv preprint arXiv:1904.13377.

Pinkas, G., Karny, Y., Malachi, A., Barkai, G., Bachar, G., and Aharonson, V. (2020). Sars-cov-2 detection from voice. IEEE Open Journal of Engineering in Medicine and Biology, 1:268–274.

Raso, T. and Mello, H. (2012). The c-oral-brasil i: reference corpus for informal spoken brazilian portuguese. In International Conference on Computational Processing of the Portuguese Language, pages 362–367. Springer.

Robin, J., Harrison, J. E., Kaufman, L. D., Rudzicz, F., Simpson, W., and Yancheva, M. (2020). Evaluation of speech-based digital biomarkers: Review and recommendations. Digital Biomarkers, 4(3):99–108.

Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862.

Song, X., Wang, G., Wu, Z., Huang, Y., Su, D., Yu, D., and Meng, H. (2019). Speechxlnet: Unsupervised acoustic model pretraining for self-attention networks. arXiv preprint arXiv:1910.10387.

Taylor, W. L. (1953). “cloze procedure”: A new tool for measuring readability. Journalism quarterly, 30(4):415–433.

Tobin, M. J., Laghi, F., and Jubran, A. (2020). Why covid-19 silent hypoxemia is bafing to physicians. American journal of respiratory and critical care medicine, 202(3):356– 360.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30:5998–6008.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., and Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
GAUY, Marcelo Matheus; FINGER, Marcelo. Audio MFCC-gram Transformers for respiratory insufficiency detection in COVID-19. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 13. , 2021, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 . p. 143-152. DOI: