Avaliação de modelos para reconhecimento automático de fala aplicados para identificação da qualidade de leituras em voz alta de narrativas breves
Resumo
Os avanços na área de reconhecimento automático de fala (ASR) tem permitido o surgimento de soluções inovadoras na área de Informática na Educação, especialmente no domínio de avaliação da alfabetização. O seu uso para reconhecimento de fala infantil, contudo, ainda traz desafios e faltam trabalhos que analisem novas tecnologias neste domínio de aplicação. Este trabalho apresenta uma comparação entre duas tecnologias de ASR no contexto de fala de crianças para a avaliação automática de fluência de leitura: uma abordagem supervisionada e uma abordagem auto-supervisionada. Foram utilizados 59 áudios de leituras de crianças. O Wav2Vec2 em conjunto com um modelo de língua apresentou resultados substancialmente melhores que os demais modelos em relação à taxa de erro de palavras.
Palavras-chave:
reconhecimento de fala, leitura infantil, kaldi, wav2vec2, avaliação, fluência
Referências
Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460.
Bauer, A., Alavarse, O. M., and Oliveira, R. P. d. (2015). Avaliações em larga escala: uma sistematização do debate. Educação e Pesquisa, 41:1367–1384.
Beck, J. E., Jia, P., and Mostow, J. (2004). Automatically assessing oral reading fluency in a computer tutor that listens. Technology Instruction Cognition and Learning, 2:61–82.
Bernstein, J., Cohen, M., Murveit, H., Rtischev, D., and Weintraub, M. (1990). Automatic evaluation and training in english pronunciation. In First International Conference on Spoken Language Processing.
Bhardwaj, V., Kadyan, V., et al. (2020). Deep neural network trained punjabi children speech recognition system using kaldi toolkit. In 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA), pages 374–378. IEEE.
Black, M., Tepperman, J., Lee, S., and Narayanan, S. S. (2008). Estimation of children’s reading ability by fusion of automatic pronunciation verification and fluency detection. In Ninth Annual Conference of the International Speech Communication Association.
Black, M., Tepperman, J., Lee, S., Price, P., and Narayanan, S. S. (2007). Automatic detection and classification of disfluent reading miscues in young children’s speech for the purpose of assessment. In Eighth Annual Conference of the International Speech Communication Association.
Black, M. P., Tepperman, J., and Narayanan, S. S. (2010). Automatic prediction of children’s reading ability for high-level literacy assessment. IEEE Transactions on Audio, Speech, and Language Processing, 19(4):1015–1028.
Bolanos, D. (2008). Advances in the application of support vector machines for continuous automatic speech recognition. PhD thesis, Ph. D. thesis, Computer Science Department, Universidad Autonoma de Madrid.
Bolaños, D., Cole, R. A., Ward, W., Borts, E., and Svirsky, E. (2011). Flora: Fluent oral reading assessment of children’s speech. ACM Transactions on Speech and Language Processing (TSLP), 7(4):1–19.
Bolanos, D., Cole, R. A., Ward, W. H., Tindal, G. A., Schwanenflugel, P. J., and Kuhn, M. R. (2013). Automatic assessment of expressive oral reading. Speech Communication, 55(2):221–236.
Carchedi, L. C., Barrére, E., and de Souza, J. F. (2021). Avalia online: um sistema para avaliação em larga escala de testes de fluência de leitura. In Anais do XXXII Simpósio Brasileiro de Informática na Educação, pages 01–11. SBC.
Carchedi, L. C., Barrére, E., and Souza, J. (2018). Abordagem colaborativa para apoio à avaliação do ensino de português. In Brazilian Symposium on Computers in Education (Simpósio Brasileiro de Informática na Educação-SBIE), volume 29, page 1593.
Cheng, J., Chen, X., and Metallinou, A. (2015). Deep neural network acoustic models for spoken assessment applications. Speech Communication, 73:14–27.
Duchateau, J., Cleuren, L., Ghesquière, P., et al. (2007). Automatic assessment of children’s reading level. In Proceedings of the European Conference on Speech Communication and Technology, pages 1210–1213.
Fan, R., Afshan, A., and Alwan, A. (2021). Bi-apc: Bidirectional autoregressive predictive coding for unsupervised pre-training and its application to children’s asr. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7023–7027. IEEE.
Hagen, A. and Pellom, B. (2005). A multi-layered lexical-tree based recognition of subword speech units. Proc. L&TC, Poznan, Poland.
Hagen, A., Pellom, B., and Cole, R. (2007). Highly accurate children’s speech recognition for interactive reading tutors using subword units. speech communication, 49(12):861– 873.
Hagen, A., Pellom, B., Van Vuuren, S., and Cole, R. (2004). Advances in children’s speech recognition within an interactive literacy tutor. In Proceedings of HLT-NAACL 2004: Short Papers, pages 25–28.
Jain, R., Yiwere, M., Bigioi, D., and Corcoran, P. (2022). Can self-supervised learning solve the problem of child speech recognition? arXiv preprint arXiv:2204.05419.
Junior, A. C., Casanova, E., Soares, A., de Oliveira, F. S., Oliveira, L., Junior, R. C. F., da Silva, D. P. P., Fayet, F. G., Carlotto, B. B., Gris, L. R. S., et al. (2021). Coraa: a large corpus of spontaneous and prepared speech manually validated for speech recognition in brazilian portuguese. arXiv preprint arXiv:2110.15731.
Metallinou, A. and Cheng, J. (2014). Using deep neural networks to improve proficiency assessment for children english language learners. In Fifteenth Annual Conference of the International Speech Communication Association.
Mostow, J., Aist, G., Burkhead, P., Corbett, A., Cuneo, A., Eitelman, S., Huang, C., Junker, B., Sklar, M. B., and Tobin, B. (2003). Evaluation of an automated reading tutor that listens: Comparison to human tutoring and classroom instruction. Journal of Educational Computing Research, 29(1):61–117.
Poulsen, R., Hastings, P., and Allbritton, D. (2007). Tutoring bilingual students with an automated reading tutor that listens. Journal of Educational Computing Research, 36(2):191–221.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al. (2011). The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, number CONF. IEEE Signal Processing Society.
Reeder, K., Shapiro, J., and Wakefield, J. (2007). The effectiveness of speech recognition technology in promoting reading proficiency and attitudes for canadian immigrant children. In 15th European Conference on Reading.
Ruiz, N. and Federico, M. (2015). Phonetically-oriented word error alignment for speech recognition error analysis in speech translation. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 296–302. IEEE.
Sabu, K. and Rao, P. (2018). Automatic assessment of children’s oral reading using speech recognition and prosody modeling. CSI Transactions on ICT, 6(2):221–225.
Sabu, K., Swarup, P., Tulsiani, H., and Rao, P. (2017). Automatic assessment of children’s l2 reading for accuracy and fluency. In SLaTE, pages 121–126.
Sorgatto, D. W., Nogueira, B. M., Cáceres, E. N., and Mongelli, H. (2021). Avaliação de classificadores para relacionar características escolares a indicadores educacionais. In Anais do XXXII Simpósio Brasileiro de Informática na Educação, pages 1232–1242. SBC.
Tao, J., Ghaffarzadegan, S., Chen, L., and Zechner, K. (2016). Exploring deep learning architectures for automatically grading non-native spontaneous speech. In 2016 IEEE International conference on acoustics, speech and signal processing (ICASSP), pages 6140–6144. IEEE.
Tchistiakova, S. (2019). Time delay neural network. https://kaleidoescape.github.io/tdnn/.
Vaessen, N. and van Leeuwen, D. A. (2021). Fine-tuning wav2vec2 for speaker recognition. arXiv preprint arXiv:2109.15053.
Yilmaz, E., Pelemans, J., et al. (2014). Automatic assessment of children’s reading with the flavor decoding using a phone confusion model. Proceedings Interspeech 2014, pages 969–972.
Yu, F., Yao, Z., Wang, X., An, K., Xie, L., Ou, Z., Liu, B., Li, X., and Miao, G. (2021). The slt 2021 children speech recognition challenge: Open datasets, rules and baselines. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 1117–1123. IEEE.
Zechner, K., Sabatini, J., and Chen, L. (2009). Automatic scoring of children’s read-aloud text passages and word lists. In Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications, pages 10–18.
Bauer, A., Alavarse, O. M., and Oliveira, R. P. d. (2015). Avaliações em larga escala: uma sistematização do debate. Educação e Pesquisa, 41:1367–1384.
Beck, J. E., Jia, P., and Mostow, J. (2004). Automatically assessing oral reading fluency in a computer tutor that listens. Technology Instruction Cognition and Learning, 2:61–82.
Bernstein, J., Cohen, M., Murveit, H., Rtischev, D., and Weintraub, M. (1990). Automatic evaluation and training in english pronunciation. In First International Conference on Spoken Language Processing.
Bhardwaj, V., Kadyan, V., et al. (2020). Deep neural network trained punjabi children speech recognition system using kaldi toolkit. In 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA), pages 374–378. IEEE.
Black, M., Tepperman, J., Lee, S., and Narayanan, S. S. (2008). Estimation of children’s reading ability by fusion of automatic pronunciation verification and fluency detection. In Ninth Annual Conference of the International Speech Communication Association.
Black, M., Tepperman, J., Lee, S., Price, P., and Narayanan, S. S. (2007). Automatic detection and classification of disfluent reading miscues in young children’s speech for the purpose of assessment. In Eighth Annual Conference of the International Speech Communication Association.
Black, M. P., Tepperman, J., and Narayanan, S. S. (2010). Automatic prediction of children’s reading ability for high-level literacy assessment. IEEE Transactions on Audio, Speech, and Language Processing, 19(4):1015–1028.
Bolanos, D. (2008). Advances in the application of support vector machines for continuous automatic speech recognition. PhD thesis, Ph. D. thesis, Computer Science Department, Universidad Autonoma de Madrid.
Bolaños, D., Cole, R. A., Ward, W., Borts, E., and Svirsky, E. (2011). Flora: Fluent oral reading assessment of children’s speech. ACM Transactions on Speech and Language Processing (TSLP), 7(4):1–19.
Bolanos, D., Cole, R. A., Ward, W. H., Tindal, G. A., Schwanenflugel, P. J., and Kuhn, M. R. (2013). Automatic assessment of expressive oral reading. Speech Communication, 55(2):221–236.
Carchedi, L. C., Barrére, E., and de Souza, J. F. (2021). Avalia online: um sistema para avaliação em larga escala de testes de fluência de leitura. In Anais do XXXII Simpósio Brasileiro de Informática na Educação, pages 01–11. SBC.
Carchedi, L. C., Barrére, E., and Souza, J. (2018). Abordagem colaborativa para apoio à avaliação do ensino de português. In Brazilian Symposium on Computers in Education (Simpósio Brasileiro de Informática na Educação-SBIE), volume 29, page 1593.
Cheng, J., Chen, X., and Metallinou, A. (2015). Deep neural network acoustic models for spoken assessment applications. Speech Communication, 73:14–27.
Duchateau, J., Cleuren, L., Ghesquière, P., et al. (2007). Automatic assessment of children’s reading level. In Proceedings of the European Conference on Speech Communication and Technology, pages 1210–1213.
Fan, R., Afshan, A., and Alwan, A. (2021). Bi-apc: Bidirectional autoregressive predictive coding for unsupervised pre-training and its application to children’s asr. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7023–7027. IEEE.
Hagen, A. and Pellom, B. (2005). A multi-layered lexical-tree based recognition of subword speech units. Proc. L&TC, Poznan, Poland.
Hagen, A., Pellom, B., and Cole, R. (2007). Highly accurate children’s speech recognition for interactive reading tutors using subword units. speech communication, 49(12):861– 873.
Hagen, A., Pellom, B., Van Vuuren, S., and Cole, R. (2004). Advances in children’s speech recognition within an interactive literacy tutor. In Proceedings of HLT-NAACL 2004: Short Papers, pages 25–28.
Jain, R., Yiwere, M., Bigioi, D., and Corcoran, P. (2022). Can self-supervised learning solve the problem of child speech recognition? arXiv preprint arXiv:2204.05419.
Junior, A. C., Casanova, E., Soares, A., de Oliveira, F. S., Oliveira, L., Junior, R. C. F., da Silva, D. P. P., Fayet, F. G., Carlotto, B. B., Gris, L. R. S., et al. (2021). Coraa: a large corpus of spontaneous and prepared speech manually validated for speech recognition in brazilian portuguese. arXiv preprint arXiv:2110.15731.
Metallinou, A. and Cheng, J. (2014). Using deep neural networks to improve proficiency assessment for children english language learners. In Fifteenth Annual Conference of the International Speech Communication Association.
Mostow, J., Aist, G., Burkhead, P., Corbett, A., Cuneo, A., Eitelman, S., Huang, C., Junker, B., Sklar, M. B., and Tobin, B. (2003). Evaluation of an automated reading tutor that listens: Comparison to human tutoring and classroom instruction. Journal of Educational Computing Research, 29(1):61–117.
Poulsen, R., Hastings, P., and Allbritton, D. (2007). Tutoring bilingual students with an automated reading tutor that listens. Journal of Educational Computing Research, 36(2):191–221.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al. (2011). The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, number CONF. IEEE Signal Processing Society.
Reeder, K., Shapiro, J., and Wakefield, J. (2007). The effectiveness of speech recognition technology in promoting reading proficiency and attitudes for canadian immigrant children. In 15th European Conference on Reading.
Ruiz, N. and Federico, M. (2015). Phonetically-oriented word error alignment for speech recognition error analysis in speech translation. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 296–302. IEEE.
Sabu, K. and Rao, P. (2018). Automatic assessment of children’s oral reading using speech recognition and prosody modeling. CSI Transactions on ICT, 6(2):221–225.
Sabu, K., Swarup, P., Tulsiani, H., and Rao, P. (2017). Automatic assessment of children’s l2 reading for accuracy and fluency. In SLaTE, pages 121–126.
Sorgatto, D. W., Nogueira, B. M., Cáceres, E. N., and Mongelli, H. (2021). Avaliação de classificadores para relacionar características escolares a indicadores educacionais. In Anais do XXXII Simpósio Brasileiro de Informática na Educação, pages 1232–1242. SBC.
Tao, J., Ghaffarzadegan, S., Chen, L., and Zechner, K. (2016). Exploring deep learning architectures for automatically grading non-native spontaneous speech. In 2016 IEEE International conference on acoustics, speech and signal processing (ICASSP), pages 6140–6144. IEEE.
Tchistiakova, S. (2019). Time delay neural network. https://kaleidoescape.github.io/tdnn/.
Vaessen, N. and van Leeuwen, D. A. (2021). Fine-tuning wav2vec2 for speaker recognition. arXiv preprint arXiv:2109.15053.
Yilmaz, E., Pelemans, J., et al. (2014). Automatic assessment of children’s reading with the flavor decoding using a phone confusion model. Proceedings Interspeech 2014, pages 969–972.
Yu, F., Yao, Z., Wang, X., An, K., Xie, L., Ou, Z., Liu, B., Li, X., and Miao, G. (2021). The slt 2021 children speech recognition challenge: Open datasets, rules and baselines. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 1117–1123. IEEE.
Zechner, K., Sabatini, J., and Chen, L. (2009). Automatic scoring of children’s read-aloud text passages and word lists. In Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications, pages 10–18.
Publicado
16/11/2022
Como Citar
FERREIRA, André Luiz Vasconcelos; SILVA, Cristiano Nascimento; DE ASSIS, Elias Cyrino; DE SOUZA, Jairo Francisco.
Avaliação de modelos para reconhecimento automático de fala aplicados para identificação da qualidade de leituras em voz alta de narrativas breves. In: SIMPÓSIO BRASILEIRO DE INFORMÁTICA NA EDUCAÇÃO (SBIE), 33. , 2022, Manaus.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2022
.
p. 895-907.
DOI: https://doi.org/10.5753/sbie.2022.224744.