Desenvolvimento de um modelo de reconhecimento de voz para o Português Brasileiro com poucos dados utilizando o Wav2vec 2.0

Lucas Rafael Stefanel Gris; Edresson Casanova; Frederico Santos de Oliveira; Anderson da Silva Soares; Arnaldo Candido-Junior

doi:10.5753/bresci.2021.15798

Lucas Rafael Stefanel Gris UTFPR http://orcid.org/0000-0002-2099-5004
Edresson Casanova USP https://orcid.org/0000-0003-0160-7173
Frederico Santos de Oliveira UFMT https://orcid.org/0000-0002-5885-6747
Anderson da Silva Soares UFG http://orcid.org/0000-0002-2967-6077
Arnaldo Candido-Junior UTFPR https://orcid.org/0000-0002-5647-0891

DOI: https://doi.org/10.5753/bresci.2021.15798

Resumo

Técnicas de aprendizado profundo têm se mostrado muito eficientes nas mais diversas tarefas, em especial, no desenvolvimento de sistemasde reconhecimento de voz. Apesar do avanço na área, seu desenvolvimento ainda pode ser considerado uma tarefa difícil, especialmente em idiomas que apresentam poucos dados abertos disponíveis, como o Português Brasileiro. Considerando essa limitação, o Wav2vec 2.0, uma arquitetura que dispensa a necessidade de uma grande quantidade de áudios rotulados, pode ser uma alternativa interessante. Nesse sentido, este trabalho apresenta como objetivo avaliar o desenvolvimento de um reconhecedor de voz utilizando poucos dados disponíveis gratuitamente a partir do ajuste do modelo Wav2vec 2.0 pré-treinado em muitas línguas. Este trabalho mostra que é possível construir um sistema de reconhecimento de voz utilizando apenas 1h de fala transcrita para o Português Brasileiro. O modelo ajustado apresenta um WER de somente 34% contra o dataset da Common Voice.

Palavras-chave: reconhecimento automático de voz, aprendizado profundo, Português Brasileiro

Referências

Alencar, V. and Alcaim, A. (2008). Lsf and lpc-derived features for large vocabulary distributed continuous speech recognition in brazilian portuguese. In 2008 42nd Asilomar Conference on Signals, Systems and Computers, pages 1237–1241. IEEE.

Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C.,Casper, J., Catanzaro, B., Cheng, Q., Chen, G., et al. (2016). Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning, pages 173–182. PMLR

Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., Morais, R.,Saunders, L., Tyers, F. M., and Weber, G. (2020). Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4211–4215.

Baevski, A. and Mohamed, A. (2020). Effectiveness of self-supervised pre-training for asr. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), pages 7694–7698.

Baevski, A., Schneider, S., and Auli, M. (2020a). vq-wav2vec: Self-supervised learning of discrete speech representations. In International Conference on Learning Representations (ICLR).

Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020b). wav2vec 2.0: A framework for self-supervised learning of speech representations. In Larochelle, H.,Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 12449–12460. Curran Associates, Inc.

Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015.

Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk,H., and Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar. Association for Computational Linguistics.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán,F., Grave, É., Ott, M., Zettlemoyer, L., and Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451.

Conneau, A. and Lample, G. (2019). Crosslingual language model pretraining. InWallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32.Curran Associates, Inc.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep learning. MIT press.

Haykin, S. S. et al. (2009). Neural networks and learning machines/Simon Haykin.New York: Prentice Hall.

McCowan, I. A., Moore, D., Dines, J., Gatica-Perez, D., Flynn, M., Wellner, P.,and Bourlard, H. (2004). On the use of information retrieval measures for speech recognition evaluation. Technical report, IDIAP.

Neto, N., Patrick, C., Klautau, A., and Trancoso, I. (2011). Free tools and resources for brazilian portuguese speech recognition. Journal of the Brazilian Computer Society, 17(1):53–68.

Neto, N., Silva, P., Klautau, A., and Adami, A. (2008). Spoltech and ogi-22 baseline systems for speech recognition in brazilian portuguese. In International Conference on Computational Processing of the Portuguese Language, pages 256–259.Springer.

Nielsen, M. A. (2015). Neural networks and deep learning, volume 25. Determination press San Francisco, CA, USA.

Quintanilha, I. M. (2017). End-to-end speech recognition applied to brazilian por-tuguese using deep learning. MSc dissertation.

Quintanilha, I. M., Netto, S. L., and Biscainho, L. W. P. (2020). An open-source end-to-end asr system for brazilian portuguese using dnns built from newly assembled corpora. Journal of Communication and Information Systems, 35(1):230–242.

Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. In INTERSPEECH.

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, page 3104–3112,Cambridge, MA, USA. MIT Press.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser,L., and Polosukhin, I. (2017). Attention is all you need. In Neural Information Processing Systems (NIPS).

Yi, C., Wang, J., Cheng, N., Zhou, S., and Xu, B. (2020). Applying wav2vec2.0 to speech recognition in various low-resource languages. arXiv preprint ar-Xiv:2012.12121.