Evaluation of Automatic Speech Recognition Systems

Matheus Xavier Sampaio; Regis Pires Magalhães; Ticiana Linhares Coelho da Silva; Lívia Almada Cruz; Davi Romero de Vasconcelos; José Antônio Fernandes de Macêdo; Marianna Gonçalves Fontenele Ferreira

doi:10.5753/sbbd.2021.17889

Matheus Xavier Sampaio Universidade Federal do Ceará (UFC)
Regis Pires Magalhães Universidade Federal do Ceará (UFC)
Ticiana Linhares Coelho da Silva Universidade Federal do Ceará (UFC)
Lívia Almada Cruz Universidade Federal do Ceará (UFC)
Davi Romero de Vasconcelos Universidade Federal do Ceará (UFC)
José Antônio Fernandes de Macêdo Universidade Federal do Ceará (UFC)
Marianna Gonçalves Fontenele Ferreira Universidade Federal do Ceará (UFC)

DOI: https://doi.org/10.5753/sbbd.2021.17889

Resumo

O Reconhecimento Automático de Fala (ASR) é uma tarefa essencial para muitos aplicativos, como geração automática de legendas para vídeos, pesquisa por voz, comandos de voz para casas inteligentes e chatbots. Devido à crescente popularidade desses aplicativos e aos avanços nos modelos de deep learning para transcrição de fala em texto, este trabalho tem como objetivo avaliar o desempenho de soluções comerciais para ASR que utilizam modelos de deep learning, como Facebook Wit.ai, Microsoft Azure Speech, e Google Cloud Speech-to-Text. Os resultados demonstram que as soluções avaliadas diferem ligeiramente. No entanto, o Microsoft Azure Speech superou as outras APIs analisadas.

Palavras-chave: Machine learning, experiments, analysis

Referências

Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, pages 12449–12460.

Banerjee, S. and Lavie, A. (2005). Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.

Chan, W., Jaitly, N., Le, Q., and Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition.

Chiu, C.-C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R. J., Rao, K., Gonina, E., et al. (2018). State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4774–4778. IEEE.

de Lima, T. A. and Da Costa-Abreu, M. (2020). A survey on automatic speech recognition systems for portuguese language and its variations. Computer Speech & Language, 62:101055.

Filippidou, F. and Moussiades, L. (2020). A benchmarking of ibm, google and wit automatic speech recognition systems. In IFIP International Conference on Artificial Intelligence Applications and Innovations, pages 73–82. Springer.

Graves, A., Mohamed, A.-R., and Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. IEEE.

Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Rodrigues, J., and Aluisio, S. (2017). Portuguese word embeddings: Evaluating on word analogies and natural language tasks. arXiv preprint arXiv:1708.06025.

Kepuska, V. and Bohouta, G. (2017). Comparing speech recognition systems (microsoft api, google api and cmu sphinx). Int. J. Eng. Res. Appl, 7(03):20–24.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2001). Bleu: a method for automatic evaluation of machine translation. In ACL.

Reddy, D. R. (1976). Speech recognition by machine: A review. Proceedings of the IEEE, 64(4):501–531.

Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. In Interspeech 2019, pages 3465–3469.

Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D., and Zweig, G. (2016). Achieving human parity in conversational speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing.

Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., and Stolcke, A. (2018). The microsoft 2017 conversational speech recognition system. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5934–5938. IEEE.