Evaluation of Automatic Speech Recognition Systems
Resumo
O Reconhecimento Automático de Fala (ASR) é uma tarefa essencial para muitos aplicativos, como geração automática de legendas para vídeos, pesquisa por voz, comandos de voz para casas inteligentes e chatbots. Devido à crescente popularidade desses aplicativos e aos avanços nos modelos de deep learning para transcrição de fala em texto, este trabalho tem como objetivo avaliar o desempenho de soluções comerciais para ASR que utilizam modelos de deep learning, como Facebook Wit.ai, Microsoft Azure Speech, e Google Cloud Speech-to-Text. Os resultados demonstram que as soluções avaliadas diferem ligeiramente. No entanto, o Microsoft Azure Speech superou as outras APIs analisadas.
Referências
Banerjee, S. and Lavie, A. (2005). Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
Chan, W., Jaitly, N., Le, Q., and Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition.
Chiu, C.-C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R. J., Rao, K., Gonina, E., et al. (2018). State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4774–4778. IEEE.
de Lima, T. A. and Da Costa-Abreu, M. (2020). A survey on automatic speech recognition systems for portuguese language and its variations. Computer Speech & Language, 62:101055.
Filippidou, F. and Moussiades, L. (2020). A benchmarking of ibm, google and wit automatic speech recognition systems. In IFIP International Conference on Artificial Intelligence Applications and Innovations, pages 73–82. Springer.
Graves, A., Mohamed, A.-R., and Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. IEEE.
Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Rodrigues, J., and Aluisio, S. (2017). Portuguese word embeddings: Evaluating on word analogies and natural language tasks. arXiv preprint arXiv:1708.06025.
Kepuska, V. and Bohouta, G. (2017). Comparing speech recognition systems (microsoft api, google api and cmu sphinx). Int. J. Eng. Res. Appl, 7(03):20–24.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2001). Bleu: a method for automatic evaluation of machine translation. In ACL.
Reddy, D. R. (1976). Speech recognition by machine: A review. Proceedings of the IEEE, 64(4):501–531.
Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. In Interspeech 2019, pages 3465–3469.
Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D., and Zweig, G. (2016). Achieving human parity in conversational speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., and Stolcke, A. (2018). The microsoft 2017 conversational speech recognition system. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5934–5938. IEEE.