Evaluation of models for automatic speech recognition applied to identify the quality of short narratives reading aloud

  • André Luiz Vasconcelos Ferreira Federal University of Juiz de Fora
  • Cristiano Nascimento Silva Federal University of Juiz de Fora
  • Elias Cyrino de Assis Federal University of Juiz de Fora
  • Jairo Francisco de Souza Federal University of Juiz de Fora https://orcid.org/0000-0002-0911-7980

Abstract


Advances in the area of automatic speech recognition (ASR) have allowed the emergence of innovative solutions in the area of Informatics in Education, especially in the domain of literacy assessment. Its use for child speech recognition, however, still brings challenges, and studies that analyze new technologies in this application domain are lacking. This article presents a comparison between two ASR technologies in the context of children's speech for the automatic assessment of reading fluency: a supervised and a self-supervised approach. 59 audios of children's readings aloud were used. Wav2Vec2 together with a language model showed substantially better results than the other models in relation to the word error rate.
Keywords: speech recognition, children's reading, kaldi, wav2vec2, assessment, fluency

References

Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460.

Bauer, A., Alavarse, O. M., and Oliveira, R. P. d. (2015). Avaliações em larga escala: uma sistematização do debate. Educação e Pesquisa, 41:1367–1384.

Beck, J. E., Jia, P., and Mostow, J. (2004). Automatically assessing oral reading fluency in a computer tutor that listens. Technology Instruction Cognition and Learning, 2:61–82.

Bernstein, J., Cohen, M., Murveit, H., Rtischev, D., and Weintraub, M. (1990). Automatic evaluation and training in english pronunciation. In First International Conference on Spoken Language Processing.

Bhardwaj, V., Kadyan, V., et al. (2020). Deep neural network trained punjabi children speech recognition system using kaldi toolkit. In 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA), pages 374–378. IEEE.

Black, M., Tepperman, J., Lee, S., and Narayanan, S. S. (2008). Estimation of children’s reading ability by fusion of automatic pronunciation verification and fluency detection. In Ninth Annual Conference of the International Speech Communication Association.

Black, M., Tepperman, J., Lee, S., Price, P., and Narayanan, S. S. (2007). Automatic detection and classification of disfluent reading miscues in young children’s speech for the purpose of assessment. In Eighth Annual Conference of the International Speech Communication Association.

Black, M. P., Tepperman, J., and Narayanan, S. S. (2010). Automatic prediction of children’s reading ability for high-level literacy assessment. IEEE Transactions on Audio, Speech, and Language Processing, 19(4):1015–1028.

Bolanos, D. (2008). Advances in the application of support vector machines for continuous automatic speech recognition. PhD thesis, Ph. D. thesis, Computer Science Department, Universidad Autonoma de Madrid.

Bolaños, D., Cole, R. A., Ward, W., Borts, E., and Svirsky, E. (2011). Flora: Fluent oral reading assessment of children’s speech. ACM Transactions on Speech and Language Processing (TSLP), 7(4):1–19.

Bolanos, D., Cole, R. A., Ward, W. H., Tindal, G. A., Schwanenflugel, P. J., and Kuhn, M. R. (2013). Automatic assessment of expressive oral reading. Speech Communication, 55(2):221–236.

Carchedi, L. C., Barrére, E., and de Souza, J. F. (2021). Avalia online: um sistema para avaliação em larga escala de testes de fluência de leitura. In Anais do XXXII Simpósio Brasileiro de Informática na Educação, pages 01–11. SBC.

Carchedi, L. C., Barrére, E., and Souza, J. (2018). Abordagem colaborativa para apoio à avaliação do ensino de português. In Brazilian Symposium on Computers in Education (Simpósio Brasileiro de Informática na Educação-SBIE), volume 29, page 1593.

Cheng, J., Chen, X., and Metallinou, A. (2015). Deep neural network acoustic models for spoken assessment applications. Speech Communication, 73:14–27.

Duchateau, J., Cleuren, L., Ghesquière, P., et al. (2007). Automatic assessment of children’s reading level. In Proceedings of the European Conference on Speech Communication and Technology, pages 1210–1213.

Fan, R., Afshan, A., and Alwan, A. (2021). Bi-apc: Bidirectional autoregressive predictive coding for unsupervised pre-training and its application to children’s asr. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7023–7027. IEEE.

Hagen, A. and Pellom, B. (2005). A multi-layered lexical-tree based recognition of subword speech units. Proc. L&TC, Poznan, Poland.

Hagen, A., Pellom, B., and Cole, R. (2007). Highly accurate children’s speech recognition for interactive reading tutors using subword units. speech communication, 49(12):861– 873.

Hagen, A., Pellom, B., Van Vuuren, S., and Cole, R. (2004). Advances in children’s speech recognition within an interactive literacy tutor. In Proceedings of HLT-NAACL 2004: Short Papers, pages 25–28.

Jain, R., Yiwere, M., Bigioi, D., and Corcoran, P. (2022). Can self-supervised learning solve the problem of child speech recognition? arXiv preprint arXiv:2204.05419.

Junior, A. C., Casanova, E., Soares, A., de Oliveira, F. S., Oliveira, L., Junior, R. C. F., da Silva, D. P. P., Fayet, F. G., Carlotto, B. B., Gris, L. R. S., et al. (2021). Coraa: a large corpus of spontaneous and prepared speech manually validated for speech recognition in brazilian portuguese. arXiv preprint arXiv:2110.15731.

Metallinou, A. and Cheng, J. (2014). Using deep neural networks to improve proficiency assessment for children english language learners. In Fifteenth Annual Conference of the International Speech Communication Association.

Mostow, J., Aist, G., Burkhead, P., Corbett, A., Cuneo, A., Eitelman, S., Huang, C., Junker, B., Sklar, M. B., and Tobin, B. (2003). Evaluation of an automated reading tutor that listens: Comparison to human tutoring and classroom instruction. Journal of Educational Computing Research, 29(1):61–117.

Poulsen, R., Hastings, P., and Allbritton, D. (2007). Tutoring bilingual students with an automated reading tutor that listens. Journal of Educational Computing Research, 36(2):191–221.

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al. (2011). The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, number CONF. IEEE Signal Processing Society.

Reeder, K., Shapiro, J., and Wakefield, J. (2007). The effectiveness of speech recognition technology in promoting reading proficiency and attitudes for canadian immigrant children. In 15th European Conference on Reading.

Ruiz, N. and Federico, M. (2015). Phonetically-oriented word error alignment for speech recognition error analysis in speech translation. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 296–302. IEEE.

Sabu, K. and Rao, P. (2018). Automatic assessment of children’s oral reading using speech recognition and prosody modeling. CSI Transactions on ICT, 6(2):221–225.

Sabu, K., Swarup, P., Tulsiani, H., and Rao, P. (2017). Automatic assessment of children’s l2 reading for accuracy and fluency. In SLaTE, pages 121–126.

Sorgatto, D. W., Nogueira, B. M., Cáceres, E. N., and Mongelli, H. (2021). Avaliação de classificadores para relacionar características escolares a indicadores educacionais. In Anais do XXXII Simpósio Brasileiro de Informática na Educação, pages 1232–1242. SBC.

Tao, J., Ghaffarzadegan, S., Chen, L., and Zechner, K. (2016). Exploring deep learning architectures for automatically grading non-native spontaneous speech. In 2016 IEEE International conference on acoustics, speech and signal processing (ICASSP), pages 6140–6144. IEEE.

Tchistiakova, S. (2019). Time delay neural network. https://kaleidoescape.github.io/tdnn/.

Vaessen, N. and van Leeuwen, D. A. (2021). Fine-tuning wav2vec2 for speaker recognition. arXiv preprint arXiv:2109.15053.

Yilmaz, E., Pelemans, J., et al. (2014). Automatic assessment of children’s reading with the flavor decoding using a phone confusion model. Proceedings Interspeech 2014, pages 969–972.

Yu, F., Yao, Z., Wang, X., An, K., Xie, L., Ou, Z., Liu, B., Li, X., and Miao, G. (2021). The slt 2021 children speech recognition challenge: Open datasets, rules and baselines. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 1117–1123. IEEE.

Zechner, K., Sabatini, J., and Chen, L. (2009). Automatic scoring of children’s read-aloud text passages and word lists. In Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications, pages 10–18.
Published
2022-11-16
FERREIRA, André Luiz Vasconcelos; SILVA, Cristiano Nascimento; DE ASSIS, Elias Cyrino; DE SOUZA, Jairo Francisco. Evaluation of models for automatic speech recognition applied to identify the quality of short narratives reading aloud. In: BRAZILIAN SYMPOSIUM ON COMPUTERS IN EDUCATION (SBIE), 33. , 2022, Manaus. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2022 . p. 895-907. DOI: https://doi.org/10.5753/sbie.2022.224744.