From Voices to Data: Tools for Creating a Multitask Atypical Speech Corpus through Citizen Science
Resumo
Speech data from people with atypical speech patterns remain underrepresented in research, limiting the development of effective assistive technologies for communication support. The SofiaFala Ecoa project addresses this gap by designing and deploying an accessible, citizen-science-driven platform for the collaborative collection of a multitask atypical speech corpus. This initiative combines a redesigned mobile application, a web-based recording portal, and interdisciplinary outreach activities to engage participants with speech disorders, their families, and healthcare professionals. The system enables the capture of speech in multiple tasks—such as reading, repetition, and spontaneous speech—while supporting multimodal data integration. Our objectives are threefold: (1) to establish a scalable and inclusive infrastructure for gathering speech data; (2) to expand the availability of publicly shareable corpora of atypical speech; and (3) to foster community participation in research on assistive speech technologies. The contributions of SofiaFala Ecoa include the creation of a pilot corpus covering multiple speech tasks, the implementation of tools and protocols to ensure ethical and secure data collection, and the demonstration of feasibility through initial engagement metrics. By bridging accessibility, inclusivity, and technological innovation, SofiaFala Ecoa paves the way for improved AI-based speech technologies that reflect the needs of people with speech disorders.
Palavras-chave:
Atypical speech corpus, Data collection tools, Assistive technology
Referências
K. Pedro and M. Chacon, “Softwares educativos para alunos com deficiência intelectual: estratégias utilizadas,” Rev. Br de Educação Especial, vol. 19, no. 2, pp. 195–210, 2013.
J. Carrer, E. B. Pizzolato, and C. Goyos, “Avaliação de software educativo com reconhecimento de fala em indivíduos com desenvolvimento normal e atraso de linguagem,” Rev. Brasileira de Informática na Educação, vol. 17, no. 03, p. 67, 2009.
D. de Souza, D. dos Santos, Nascimento, and E. Schlüzen, “Uso das tecnologias de informação e comunicação para pessoas com necessidades educacionais especiais como contribuição para inclusão social, educacional e digital,” Rev. Educação Especial, pp. 25–36, 2005.
O. Saz, E. Lleida, C. Vaquero, and W.-R. Rodríguez, “The alborada-I3A corpus of disordered speech,” in Proc. of the Seventh International Conference on Language Resources and Evaluation (LREC’10). Valletta, Malta: European Language Resources Association (ELRA), May 2010.
C. Meunier, C. Fougeron, C. Fredouille, B. Bigi, L. Crevier-Buchman, E. Delais-Roussarie, L. Georgeton, A. Ghio, I. Laaridh, T. Legou, C. Pillot-Loiseau, and G. Pouchoulin, “The TYPALOC corpus: A collection of various dysarthric speech recordings in read and spontaneous styles,” in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Piperidis, Eds. Portorož, Slovenia: European Language Resources Association (ELRA), May 2016, pp. 4658–4665. [Online]. Available: [link]
A. M. M. C. Ramalho, “Aquisição fonológica na criança: tradução e adaptação de um instrumento de avaliação interlinguístico para o português europeu,” Ph.D. dissertation, Universidade de Évora, May 2018, orientadores: Maria João Freitas, Fernanda Gonçalves, Dina Caetano Alves. [Online]. Available: [link]
M. Neumann, H. Kothare, and V. Ramanarayanan, “Multimodal speech biomarkers for remote monitoring of ALS disease progression,” Comput Biol Med, vol. 180, p. 108949, Aug. 2024.
C. Bhat and H. Strik, “Speech technology for automatic recognition and assessment of dysarthric speech: An overview,” J Speech Lang Hear Res, vol. 68, no. 2, pp. 547–577, Jan. 2025.
C4AI - Centro de Inteligência Artificial da USP, “TaRSila,” [link], acessado em: 18 ago. 2025.
E. Howarth, G. Vabulas, S. Connolly, D. Green, and S. Smolley, “Developing accessible speech technology with users with dysarthric speech,” Assist Technol, pp. 1–8, Mar. 2024.
P. Rissato and A. Macedo, “Sofiafala: Software inteligente de apoio à fala,” in Anais Estendidos do XXVII Simpósio Brasileiro de Sistemas Multimídia eWeb, Porto Alegre, RS, Brasil, 2021, pp. 91–94.
C. Semenzin, L. Hamrick, A. Seidl, B. L. Kelleher, and A. Cristia, “Describing vocalizations in young children: A big data approach through citizen science annotation,” J Speech Lang Hear Res, vol. 64, no. 7, pp. 2401–2416, Jun. 2021.
L. Alhinti, S. Cunningham, and H. Christensen, “The dysarthric expressed emotional database (DEED): An audio-visual database in british english,” PLoS One, vol. 18, no. 8, p. e0287971, Aug. 2023.
J. Carrer, E. B. Pizzolato, and C. Goyos, “Avaliação de software educativo com reconhecimento de fala em indivíduos com desenvolvimento normal e atraso de linguagem,” Rev. Brasileira de Informática na Educação, vol. 17, no. 03, p. 67, 2009.
D. de Souza, D. dos Santos, Nascimento, and E. Schlüzen, “Uso das tecnologias de informação e comunicação para pessoas com necessidades educacionais especiais como contribuição para inclusão social, educacional e digital,” Rev. Educação Especial, pp. 25–36, 2005.
O. Saz, E. Lleida, C. Vaquero, and W.-R. Rodríguez, “The alborada-I3A corpus of disordered speech,” in Proc. of the Seventh International Conference on Language Resources and Evaluation (LREC’10). Valletta, Malta: European Language Resources Association (ELRA), May 2010.
C. Meunier, C. Fougeron, C. Fredouille, B. Bigi, L. Crevier-Buchman, E. Delais-Roussarie, L. Georgeton, A. Ghio, I. Laaridh, T. Legou, C. Pillot-Loiseau, and G. Pouchoulin, “The TYPALOC corpus: A collection of various dysarthric speech recordings in read and spontaneous styles,” in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Piperidis, Eds. Portorož, Slovenia: European Language Resources Association (ELRA), May 2016, pp. 4658–4665. [Online]. Available: [link]
A. M. M. C. Ramalho, “Aquisição fonológica na criança: tradução e adaptação de um instrumento de avaliação interlinguístico para o português europeu,” Ph.D. dissertation, Universidade de Évora, May 2018, orientadores: Maria João Freitas, Fernanda Gonçalves, Dina Caetano Alves. [Online]. Available: [link]
M. Neumann, H. Kothare, and V. Ramanarayanan, “Multimodal speech biomarkers for remote monitoring of ALS disease progression,” Comput Biol Med, vol. 180, p. 108949, Aug. 2024.
C. Bhat and H. Strik, “Speech technology for automatic recognition and assessment of dysarthric speech: An overview,” J Speech Lang Hear Res, vol. 68, no. 2, pp. 547–577, Jan. 2025.
C4AI - Centro de Inteligência Artificial da USP, “TaRSila,” [link], acessado em: 18 ago. 2025.
E. Howarth, G. Vabulas, S. Connolly, D. Green, and S. Smolley, “Developing accessible speech technology with users with dysarthric speech,” Assist Technol, pp. 1–8, Mar. 2024.
P. Rissato and A. Macedo, “Sofiafala: Software inteligente de apoio à fala,” in Anais Estendidos do XXVII Simpósio Brasileiro de Sistemas Multimídia eWeb, Porto Alegre, RS, Brasil, 2021, pp. 91–94.
C. Semenzin, L. Hamrick, A. Seidl, B. L. Kelleher, and A. Cristia, “Describing vocalizations in young children: A big data approach through citizen science annotation,” J Speech Lang Hear Res, vol. 64, no. 7, pp. 2401–2416, Jun. 2021.
L. Alhinti, S. Cunningham, and H. Christensen, “The dysarthric expressed emotional database (DEED): An audio-visual database in british english,” PLoS One, vol. 18, no. 8, p. e0287971, Aug. 2023.
Publicado
10/11/2025
Como Citar
GIOIA, Caio Oliveira Di; LEMBOR, Victor Hugo S.; FARES, Samira; MACEDO, Alessandra Alaniz.
From Voices to Data: Tools for Creating a Multitask Atypical Speech Corpus through Citizen Science. In: CONCURSO DE TRABALHOS DE INICIAÇÃO CIENTÍFICA - SIMPÓSIO BRASILEIRO DE SISTEMAS MULTIMÍDIA E WEB (WEBMEDIA), 31. , 2025, Rio de Janeiro/RJ.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 73-76.
ISSN 2596-1683.
DOI: https://doi.org/10.5753/webmedia_estendido.2025.16280.
