Automatic Speech Recognition for Children: A Systematic Review of Models, Toolkits, and Adaptations
Abstract
Automatic Speech Recognition (ASR) has emerged as a transformative technology in education, yet its application to children’s speech remains underexplored. This paper presents a systematic review of ASR for children, focusing on its effectiveness in educational contexts. Following the guidelines of Kitchenham and Charters (2007), we analyzed academic articles published between 2019 and 2024, sourced from the ACM Digital Library, IEEE Xplore, and Scopus. A total of 16 articles were selected based on predefined inclusion criteria, including relevance to children’s speech and educational applications. The review identifies Deep Neural Networks (DNNs) and adversarial learning models as the most effective approaches for recognizing children’s speech. Key findings highlight the potential of ASR to enhance language learning and development in children, particularly in low-resource contexts. However, challenges such as data scarcity and the need for adaptation to diverse linguistic environments remain significant barriers. This study contributes to the ongoing discussion on innovative educational technologies by providing a comprehensive analysis of current trends and future directions in ASR for children.
Keywords:
Automatic Speech Recognition, Children’s Speech, Educational Technology, Language Development, Systematic Review
References
Abion, C., Lumapag, N. C., Ramirez, J. C., Resulto, C., and Lucas, C. R. (2023). Comparison of data augmentation techniques on filipino asr for children’s speech. In 2023 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pages 60–65. IEEE.
Aljedaani, W., Krasniqi, R., Aljedaani, S., Mkaouer, M. W., Ludi, S., and Al-Raddah, K. (2023). If online learning works for you, what about deaf students? emerging challenges of online learning for deaf and hearing-impaired students during covid-19: a literature review. Universal access in the information society, 22(3):1027–1046.
Alyoussef, I. (2021). E-learning system use during emergency: an empirical study during the covid-19 pandemic. In Frontiers in Education, volume 6, page 677753. Frontiers Media SA.
Assmann, P. F., Nearey, T. M., and Bharadwaj, S. V. (2013). Developmental patterns in children’s speech: Patterns of spectral change in vowels. Vowel inherent spectral change, pages 199–230.
Bhardwaj, V., Ben Othman, M. T., Kukreja, V., Belkhier, Y., Bajaj, M., Goud, B. S., Rehman, A. U., Shafiq, M., and Hamam, H. (2022). Automatic speech recognition (asr) systems for children: A systematic literature review. Applied Sciences, 12(9):4419.
Chermak, G. D. and Schneiderman, C. R. (1986). Speech timing variability of children and adults. Journal of Phonetics, 13(4):477–80.
Claus, F., Gamboa-Rosales, H., Petrick, R., Hain, H.-U., and Hoffmann, R. (2013). A survey about databases of children’s speech.
Coughler, C., Quinn de Launay, K. L., Purcell, D. W., Oram Cardy, J., and Beal, D. S. (2022). Pediatric responses to fundamental and formant frequency altered auditory feedback: A scoping review. Frontiers in Human Neuroscience, 16:858863.
Dorado, B. and Villanueva, A. (2023). Development of low-latency and real-time filipino children automatic speech recognition system using deep neural network. ISDFS 2023 - 11th International Symposium on Digital Forensics and Security. Cited by: 1.
Duan, R. (2023). Joint learning feature and model adaptation for unsupervised acoustic modelling of child speech. Cited by: 0.
Duan, R. and Chen, N. F. (2020). Unsupervised feature adaptation using adversarial multi-task training for automatic evaluation of children’s speech. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020-October:3037 – 3041. Cited by: 13.
Duan, R. and Chen, N. F. (2021). Senone-aware adversarial multi-task training for unsupervised child to adult speech adaptation. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2021-June:7758 – 7762. Cited by: 7; All Open Access, Green Open Access.
Eskenazi, M., Mostow, J., and Graff, D. (1997). The cmu kids corpus. Linguistic Data Consortium, 11.
Garg, R., Cui, H., Seligson, S., Zhang, B., Porcheron, M., Clark, L., Cowan, B. R., and Beneteau, E. (2022). The last decade of hci research on children and voice-based conversational agents. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–19.
Gary, R. (1993). Leonard and george doddington. In TIDIGITS speech corpus. Texas Instruments, Inc.
Gelin, L., Pellegrini, T., Pinquier, J., and Daniel, M. (2021). Simulating reading mistakes for child speech transformer-based phone recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 3:1918 – 1922. Cited by: 1; All Open Access, Green Open Access.
Gerosa, M., Giuliani, D., and Brugnara, F. (2007). Analyzing children’s speech: An acoustic study of consonants and consonant-vowel transition. In INTERSPEECH.
Getman, Y., Phan, N., Al-Ghezi, R., Voskoboinik, E., Singh, M., Grosz, T., Kurimo, M., Salvi, G., Svendsen, T., Strombergsson, S., Smolander, A., and Ylinen, S. (2023). Developing an ai-assisted low-resource spoken language learning app for children. IEEE Access, 11:86025 – 86037. Cited by: 5; All Open Access, Gold Open Access.
Gold, B., Morgan, N., and Ellis, D. (2011). Speech and audio signal processing: processing and perception of speech and music. John Wiley & Sons.
Hair, A., Ballard, K. J., Ahmed, B., and Gutierrez-Osuna, R. (2019). Evaluating automatic speech recognition for child speech therapy applications. In Proceedings of the 21st International ACM SIGACCESS conference on computers and accessibility, pages 578–580.
Hair, A., Ballard, K. J., Markoulli, C., Monroe, P., Mckechnie, J., Ahmed, B., and Gutierrez-Osuna, R. (2021a). A longitudinal evaluation of tablet-based child speech therapy with apraxia world. ACM Transactions on Accessible Computing (TACCESS), 14(1):1–26.
Hair, A., Zhao, G., Ahmed, B., Ballard, K. J., and Gutierrez-Osuna, R. (2021b). Assessing posterior-based mispronunciation detection on field-collected recordings from child speech therapy sessions. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 1:181 – 185. Cited by: 3.
Husni, H. and Jamaludin, Z. (2009). Dyslexic childrenâC™ s reading pattern as input for asr: Data, analysis, and pronunciation model. Journal of Information and Communication Technology, 8:1–13.
Jain, R., Yiwere, M. Y., Bigioi, D., Corcoran, P., and Cucu, H. (2022). A text-to-speech pipeline, evaluation methodology, and initial fine-tuning results for child speech synthesis. IEEE Access, 10:47628–47642.
Kathania, H. K., Kadiri, S. R., Alku, P., and Kurimo, M. (2021). Spectral modification for recognition of children’s speech undermismatched conditions. In Dobnik, S. and Øvrelid, L., editors, Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 94–100, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden.
Kitchenham, B. and Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering. Technical report, Software Engineering Group, Keele University and Department of Computer Science, University of Durham.
Kothalkar, P. V., Datla, S., Dutta, S., Hansen, J. H., Seven, Y., Irvin, D., and Buzhardt, J. (2021). Measuring frequency of child-directed wh-question words for alternate preschool locations using speech recognition and location tracking technologies. In Companion Publication of the 2021 International Conference on Multi-modal Interaction, pages 414–418.
Li, J., Deng, L., Haeb-Umbach, R., and Gong, Y. (2015). Robust automatic speech recognition: a bridge to practical applications.
Li, J. et al. (2022). Recent advances in end-to-end automatic speech recognition. APSIPA Transactions on Signal and Information Processing, 11(1).
Liberman, A. M. (1989). Reading is hard just because listening is easy. Brain and reading, pages 197–205.
Lileikyte, R., Irvin, D., and Hansen, J. H. (2022). Assessing child communication engagement and statistical speech patterns for american english via speech recognition in naturalistic active learning spaces. Speech Communication, 140:98 – 108. Cited by: 8; All Open Access, Bronze Open Access.
Lyakso, E., Frolova, O., Dmitrieva, E., Grigorev, A., Kaya, H., Salah, A. A., and Karpov, A. (2015). Emochildru: emotional child russian speech corpus. In Speech and Computer: 17th International Conference, SPECOM 2015, Athens, Greece, September 20-24, 2015, Proceedings 17, pages 144–152. Springer.
Misra, A., Loukina, A., Beigman Klebanov, B., Gyawali, B., and Zechner, K. (2021). A good start is half the battle won: Unsupervised pre-training for low resource children’s speech recognition for an interactive reading companion. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12748 LNAI:306 – 317. Cited by: 0.
Nagano, T., Fukuda, T., Suzuki, M., and Kurata, G. (2019). Data augmentation based on vowel stretch for improving children’s speech recognition. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 502–508.
Neri, A., Cucchiarini, C., and Strik, H. (2003). Automatic speech recognition for second language learning: How and why it actually works. Proceedings of the 15th International Congress of Phonetic Sciences (ICPhS), 5:1157–1160.
Ngo, T. T.-N., Chen, H. H.-J., and Lai, K. K.-W. (2024). The effectiveness of automatic speech recognition in esl/efl pronunciation: A meta-analysis. ReCALL, 36(1):4–21.
Potamianos, A. and Narayanan, S. (2003). Robust recognition of children’s speech. IEEE Transactions on speech and audio processing, 11(6):603–616.
Russell, M. (2006). The pf-star british english childrens speech corpus. The Speech Ark Limited.
Shahin, M., Ahmed, B., and Epps, J. (2022). Speaker-and age-invariant training for child acoustic modeling using adversarial multi-task learning. arXiv preprint arXiv:2210.10231.
Shinohara, Y. (2016). Adversarial multi-task learning of deep neural networks for robust speech recognition. In Interspeech, pages 2369–2372. San Francisco, CA, USA.
Southwell, R., Pugh, S., Perkoff, E. M., Clevenger, C., Bush, J. B., Lieber, R., Ward, W., Foltz, P., and D’Mello, S. (2022). Challenges and feasibility of automatic speech recognition for modeling student collaborative discourse in class-rooms. International Educational Data Mining Society.
Taniya, Bhardwaj, V., and Kadyan, V. (2020). Deep neural network trained punjabi children speech recognition system using kaldi toolkit. In 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA), pages 374–378.
UNICEF (2017). The state of the world’s children 2017: Children in a digital world. UNICEF.
Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N. E. Y., Heymann, J., Wiesner, M., Chen, N., et al. (2018). Espnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015.
Wilks, T., Gerber, R. J., and Erdie-Lalena, C. (2010). Developmental milestones: cognitive development. Pediatrics in Review, 31(9):364–367.
Yeung, G., Fan, R., and Alwan, A. (2021). Fundamental frequency feature normalization and data augmentation for child speech recognition. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2021-June:6993 – 6997. Cited by: 16; All Open Access, Green Open Access.
Aljedaani, W., Krasniqi, R., Aljedaani, S., Mkaouer, M. W., Ludi, S., and Al-Raddah, K. (2023). If online learning works for you, what about deaf students? emerging challenges of online learning for deaf and hearing-impaired students during covid-19: a literature review. Universal access in the information society, 22(3):1027–1046.
Alyoussef, I. (2021). E-learning system use during emergency: an empirical study during the covid-19 pandemic. In Frontiers in Education, volume 6, page 677753. Frontiers Media SA.
Assmann, P. F., Nearey, T. M., and Bharadwaj, S. V. (2013). Developmental patterns in children’s speech: Patterns of spectral change in vowels. Vowel inherent spectral change, pages 199–230.
Bhardwaj, V., Ben Othman, M. T., Kukreja, V., Belkhier, Y., Bajaj, M., Goud, B. S., Rehman, A. U., Shafiq, M., and Hamam, H. (2022). Automatic speech recognition (asr) systems for children: A systematic literature review. Applied Sciences, 12(9):4419.
Chermak, G. D. and Schneiderman, C. R. (1986). Speech timing variability of children and adults. Journal of Phonetics, 13(4):477–80.
Claus, F., Gamboa-Rosales, H., Petrick, R., Hain, H.-U., and Hoffmann, R. (2013). A survey about databases of children’s speech.
Coughler, C., Quinn de Launay, K. L., Purcell, D. W., Oram Cardy, J., and Beal, D. S. (2022). Pediatric responses to fundamental and formant frequency altered auditory feedback: A scoping review. Frontiers in Human Neuroscience, 16:858863.
Dorado, B. and Villanueva, A. (2023). Development of low-latency and real-time filipino children automatic speech recognition system using deep neural network. ISDFS 2023 - 11th International Symposium on Digital Forensics and Security. Cited by: 1.
Duan, R. (2023). Joint learning feature and model adaptation for unsupervised acoustic modelling of child speech. Cited by: 0.
Duan, R. and Chen, N. F. (2020). Unsupervised feature adaptation using adversarial multi-task training for automatic evaluation of children’s speech. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020-October:3037 – 3041. Cited by: 13.
Duan, R. and Chen, N. F. (2021). Senone-aware adversarial multi-task training for unsupervised child to adult speech adaptation. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2021-June:7758 – 7762. Cited by: 7; All Open Access, Green Open Access.
Eskenazi, M., Mostow, J., and Graff, D. (1997). The cmu kids corpus. Linguistic Data Consortium, 11.
Garg, R., Cui, H., Seligson, S., Zhang, B., Porcheron, M., Clark, L., Cowan, B. R., and Beneteau, E. (2022). The last decade of hci research on children and voice-based conversational agents. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–19.
Gary, R. (1993). Leonard and george doddington. In TIDIGITS speech corpus. Texas Instruments, Inc.
Gelin, L., Pellegrini, T., Pinquier, J., and Daniel, M. (2021). Simulating reading mistakes for child speech transformer-based phone recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 3:1918 – 1922. Cited by: 1; All Open Access, Green Open Access.
Gerosa, M., Giuliani, D., and Brugnara, F. (2007). Analyzing children’s speech: An acoustic study of consonants and consonant-vowel transition. In INTERSPEECH.
Getman, Y., Phan, N., Al-Ghezi, R., Voskoboinik, E., Singh, M., Grosz, T., Kurimo, M., Salvi, G., Svendsen, T., Strombergsson, S., Smolander, A., and Ylinen, S. (2023). Developing an ai-assisted low-resource spoken language learning app for children. IEEE Access, 11:86025 – 86037. Cited by: 5; All Open Access, Gold Open Access.
Gold, B., Morgan, N., and Ellis, D. (2011). Speech and audio signal processing: processing and perception of speech and music. John Wiley & Sons.
Hair, A., Ballard, K. J., Ahmed, B., and Gutierrez-Osuna, R. (2019). Evaluating automatic speech recognition for child speech therapy applications. In Proceedings of the 21st International ACM SIGACCESS conference on computers and accessibility, pages 578–580.
Hair, A., Ballard, K. J., Markoulli, C., Monroe, P., Mckechnie, J., Ahmed, B., and Gutierrez-Osuna, R. (2021a). A longitudinal evaluation of tablet-based child speech therapy with apraxia world. ACM Transactions on Accessible Computing (TACCESS), 14(1):1–26.
Hair, A., Zhao, G., Ahmed, B., Ballard, K. J., and Gutierrez-Osuna, R. (2021b). Assessing posterior-based mispronunciation detection on field-collected recordings from child speech therapy sessions. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 1:181 – 185. Cited by: 3.
Husni, H. and Jamaludin, Z. (2009). Dyslexic childrenâC™ s reading pattern as input for asr: Data, analysis, and pronunciation model. Journal of Information and Communication Technology, 8:1–13.
Jain, R., Yiwere, M. Y., Bigioi, D., Corcoran, P., and Cucu, H. (2022). A text-to-speech pipeline, evaluation methodology, and initial fine-tuning results for child speech synthesis. IEEE Access, 10:47628–47642.
Kathania, H. K., Kadiri, S. R., Alku, P., and Kurimo, M. (2021). Spectral modification for recognition of children’s speech undermismatched conditions. In Dobnik, S. and Øvrelid, L., editors, Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), pages 94–100, Reykjavik, Iceland (Online). Linköping University Electronic Press, Sweden.
Kitchenham, B. and Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering. Technical report, Software Engineering Group, Keele University and Department of Computer Science, University of Durham.
Kothalkar, P. V., Datla, S., Dutta, S., Hansen, J. H., Seven, Y., Irvin, D., and Buzhardt, J. (2021). Measuring frequency of child-directed wh-question words for alternate preschool locations using speech recognition and location tracking technologies. In Companion Publication of the 2021 International Conference on Multi-modal Interaction, pages 414–418.
Li, J., Deng, L., Haeb-Umbach, R., and Gong, Y. (2015). Robust automatic speech recognition: a bridge to practical applications.
Li, J. et al. (2022). Recent advances in end-to-end automatic speech recognition. APSIPA Transactions on Signal and Information Processing, 11(1).
Liberman, A. M. (1989). Reading is hard just because listening is easy. Brain and reading, pages 197–205.
Lileikyte, R., Irvin, D., and Hansen, J. H. (2022). Assessing child communication engagement and statistical speech patterns for american english via speech recognition in naturalistic active learning spaces. Speech Communication, 140:98 – 108. Cited by: 8; All Open Access, Bronze Open Access.
Lyakso, E., Frolova, O., Dmitrieva, E., Grigorev, A., Kaya, H., Salah, A. A., and Karpov, A. (2015). Emochildru: emotional child russian speech corpus. In Speech and Computer: 17th International Conference, SPECOM 2015, Athens, Greece, September 20-24, 2015, Proceedings 17, pages 144–152. Springer.
Misra, A., Loukina, A., Beigman Klebanov, B., Gyawali, B., and Zechner, K. (2021). A good start is half the battle won: Unsupervised pre-training for low resource children’s speech recognition for an interactive reading companion. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12748 LNAI:306 – 317. Cited by: 0.
Nagano, T., Fukuda, T., Suzuki, M., and Kurata, G. (2019). Data augmentation based on vowel stretch for improving children’s speech recognition. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 502–508.
Neri, A., Cucchiarini, C., and Strik, H. (2003). Automatic speech recognition for second language learning: How and why it actually works. Proceedings of the 15th International Congress of Phonetic Sciences (ICPhS), 5:1157–1160.
Ngo, T. T.-N., Chen, H. H.-J., and Lai, K. K.-W. (2024). The effectiveness of automatic speech recognition in esl/efl pronunciation: A meta-analysis. ReCALL, 36(1):4–21.
Potamianos, A. and Narayanan, S. (2003). Robust recognition of children’s speech. IEEE Transactions on speech and audio processing, 11(6):603–616.
Russell, M. (2006). The pf-star british english childrens speech corpus. The Speech Ark Limited.
Shahin, M., Ahmed, B., and Epps, J. (2022). Speaker-and age-invariant training for child acoustic modeling using adversarial multi-task learning. arXiv preprint arXiv:2210.10231.
Shinohara, Y. (2016). Adversarial multi-task learning of deep neural networks for robust speech recognition. In Interspeech, pages 2369–2372. San Francisco, CA, USA.
Southwell, R., Pugh, S., Perkoff, E. M., Clevenger, C., Bush, J. B., Lieber, R., Ward, W., Foltz, P., and D’Mello, S. (2022). Challenges and feasibility of automatic speech recognition for modeling student collaborative discourse in class-rooms. International Educational Data Mining Society.
Taniya, Bhardwaj, V., and Kadyan, V. (2020). Deep neural network trained punjabi children speech recognition system using kaldi toolkit. In 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA), pages 374–378.
UNICEF (2017). The state of the world’s children 2017: Children in a digital world. UNICEF.
Watanabe, S., Hori, T., Karita, S., Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N. E. Y., Heymann, J., Wiesner, M., Chen, N., et al. (2018). Espnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015.
Wilks, T., Gerber, R. J., and Erdie-Lalena, C. (2010). Developmental milestones: cognitive development. Pediatrics in Review, 31(9):364–367.
Yeung, G., Fan, R., and Alwan, A. (2021). Fundamental frequency feature normalization and data augmentation for child speech recognition. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2021-June:6993 – 6997. Cited by: 16; All Open Access, Green Open Access.
Published
2025-07-20
How to Cite
PARADEDA, Raul Benites; FURTADO, Karine Bezerra; MOSCOSO, Giulia de Oliveira.
Automatic Speech Recognition for Children: A Systematic Review of Models, Toolkits, and Adaptations. In: WORKSHOP ON COMPUTING EDUCATION (WEI), 33. , 2025, Maceió/AL.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 49-62.
ISSN 2595-6175.
DOI: https://doi.org/10.5753/wei.2025.7081.
