Bringing NURC/SP to Digital Life: the Role of Open-source Automatic Speech Recognition Models

Lucas Rafael Stefanel Gris; Arnaldo Candido Junior; Vinícius G. dos Santos; Bruno A. Papa Dias; Marli Quadros Leite; Flaviane Romani Fernandes Svartman; Sandra Aluísio

doi:10.5753/eniac.2022.227305

Lucas Rafael Stefanel Gris UFG
Arnaldo Candido Junior UNESP
Vinícius G. dos Santos USP
Bruno A. Papa Dias USP
Marli Quadros Leite USP
Flaviane Romani Fernandes Svartman USP
Sandra Aluísio USP

DOI: https://doi.org/10.5753/eniac.2022.227305

Resumo

The NURC Project that started in 1969 to study the cultured linguistic urban norm spoken in five Brazilian capitals, was responsible for compiling a large corpus for each capital. The digitized NURC/SP comprises 375 inquiries in 334 hours of recordings taken in São Paulo capital. Although 47 inquiries have transcripts, there was no alignment between the audio-transcription, and 328 inquiries were not transcribed. This article presents an evaluation and error analysis of three automatic speech recognition models trained with spontaneous speech in Portuguese and one model trained with prepared speech. The evaluation allowed us to choose the best model, using WER and CER metrics, in a manually aligned sample of NURC/SP, to automatically transcribe 284 hours.

Palavras-chave: NURC/SP corpus, automatic speech recognition evaluation, Portuguese language, spontaneous speech

Referências

Alencar, V. F. S. and Alcaim, A. (2008). LSF and LPC derived features for large vocabulary distributed continuous speech recognition in Brazilian Portuguese. In 2008 42nd Asilomar Conference on Signals, Systems and Computers, pages 1237-1241.

Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 12449-12460. Curran Associates Inc.

Candido Junior, A., Casanova, E., Soares, A., de Oliveira, F. S., Oliveira, L., Junior, R. C. F., da Silva, D. P. P., Fayet, F. G., Carlotto, B. B., Gris, L. R. S., and Aluísio, S. M. (2021). Coraa: a large corpus of spontaneous and prepared speech manually validated for speech recognition in brazilian portuguese. arXiv preprint arXiv:2110.15731.

Ferreira, A. I. S. and Oliveira, G. d. R. (2022). Domain specific wav2vec 2.0 fine-tuning for the se&r 2022 challenge. In Marcacini, R., Junior, A. C., and Casanova, E., editors, Proceedings of SE&R 2022, co-located with PROPOR 2022, pages 9 - 14.

Gonçalves, S. C. L. (2019). Projeto ALIP (Amostra Linguística do Interior Paulista) e banco de dados Iboruna: 10 anos de contribuição com a descrição do português brasileiro. Revista Estudos Linguísticos, 48(1):276-297.

Grosman, J. (2022). XLS-R Wav2vec2 Portuguese by Jonatas Grosman. https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-portuguese.

Karpagavalli, S. and Chandra, E. (2016). A review on automatic speech recognition architecture and approaches. International Journal of Signal Processing, Image Processing and Pattern Recognition, 9(4):393-404.

Li, J., Deng, L., Häb-Umbach, R., and Gong, Y. (2015). Robust Automatic Speech Recognition: A Bridge to Practical Applications. Elsevier Science.

Mendes, R. B. and Oushiro, L. (2012). Mapping Paulistano Portuguese: the SP2010 Project. In Proceedings of the VIIth GSCP International Conference: Speech and Corpora, pages 459-463, Firenze, Italy. Fizenze University Press.

Oliveira Jr., M. (2016). NURC Digital: Um protocolo para a digitalização, anotação, arquivamento e disseminação do material do Projeto da Norma Urbana Linguística Culta (NURC). CHIMERA: Revista de Corpus de Lenguas Romances y Estudios Lingüísticos, 3(2):149-174.

Oliveira Jr., M. (2019). NURC 50 anos: 1969-2019. Parábola Editoral, São Paulo, SP.

Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., and Collobert, R. (2020). MLS: A large-scale multilingual dataset for speech research. In Meng, H., Xu, B., and Zheng, T. F., editors, Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, pages 2757-2761. ISCA.

Preti, D. (1999). Normas para transcrição dos exemplos. In Preti, D., editor, Análise de Textos Orais, volume 1 of Série Projetos Paralelos, pages 11-12. Humanitas Publicações FFLCH/USP, 4th edition.

Raso, T. and Mello, H. (2012). C-ORAL-BRASIL I: Corpus de referência do português brasileiro falado informal. Editora UFMG, Belo Horizonte, MG.

Salesky, E., Wiesner, M., Bremerman, J., Cattoni, R., Negri, M., Turchi, M., Oard, D. W., and Post, M. (2021). The multilingual tedx corpus for speech recognition and translation. CoRR, abs/2102.01757.

Santos, V. G., Alves, C., Carlotto, B., Dias, B., Gris, L., Izaias, R., Morais, M. L., Oliveira, P., Sicoli, R., Svartman, F. R. F., Leite, M. Q., and Aluísio, S. (2022). CORAA NURC-SP Minimal Corpus: a manually annotated corpus of Brazilian Portuguese spontaneous speech. In 6th International Conference on Speech and Language Technologies on Iberian languages, IberSPEECH 2022, Granada, (to appear).

Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. In Kubin, G. and Kacic, Z., editors, Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pages 3465-3469. ISCA.

Stefanel Gris, L. R., Casanova, E., de Oliveira, F. S., da Silva Soares, A., and Candido Junior, A. (2022). Brazilian portuguese speech recognition using wav2vec 2.0. In Pinheiro, V., Gamallo, P., Amaro, R., Scarton, C., Batista, F., Silva, D., Magro, C., and Pinto, H., editors, Computational Processing of the Portuguese Language, pages 333-343, Cham. Springer International Publishing.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. (2017). Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30. Curran Associates Inc.

Wang, C., Pino, J., Wu, A., and Gu, J. (2020). CoVoST: A diverse multilingual speech-totext translation corpus. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4197-4203, Marseille, France. ELRA.

Bringing NURC/SP to Digital Life: the Role of Open-source Automatic Speech Recognition Models

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)