Acoustic Analysis of Prosodic Features in Natural versus Synthesized Speech Samples from YourTTS and SYNTACC Models

  • Julio Cesar Galdino USP
  • Gustavo Evangelista Araújo USP
  • Arnaldo Candido Junior UNESP
  • Miguel Oliveira Jr. UFAL
  • Moacir Antonelli Ponti USP
  • Sandra Maria Aluísio USP

Resumo


This study presents an acoustic analysis of prosodic features in both natural and synthesized speech samples, using two state-of-the-art speech synthesis models: YourTTS and SYNTACC. By analyzing spontaneous speech data, the duration of intonational units and syllables produced by these models was compared. The findings reveal that both models generate speech with significantly shorter and less variable durations of intonational units and syllables compared to natural speech. These results highlight the differences in syllable duration and speech rate between synthesized and natural speech, emphasizing the need for more refined prosodic metrics to accurately assess the quality of synthesized speech.
Palavras-chave: Acoustic Analysis of Prosodic Features, Speech Synthesis Models Evaluation, Portuguese language, Spontaneous Speech

Referências

Bain, M., Huh, J., Han, T., and Zisserman, A. (2023). Whisperx: Time-accurate speech transcription of long-form audio. INTERSPEECH 2023, pages 4489–4493.

Boersma, P. and Weenink, D. (2024). Praat: doing phonetics by computer [computer program]. [link].

Cagliari, L. C. (1992). Prosódia: algumas funções dos supra-segmentos. Cadernos de estudos linguísticos, 23:137–151.

Casanova, E., Junior, A. C., Shulby, C., Oliveira, F. S. d., Teixeira, J. P., Ponti, M. A., and Aluísio, S. (2022a). Tts-portuguese corpus: a corpus for speech synthesis in brazilian portuguese. Language Resources and Evaluation, 56(3):1043–1055.

Casanova, E., Weber, J., Shulby, C. D., Junior, A. C., Gölge, E., and Ponti, M. A. (2022b). Yourtts: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pages 2709–2720. PMLR.

Caseli, H. M. and Nunes, M. G. V., editors (2024). Processamento de Linguagem Natural: Conceitos, Técnicas e Aplicações em Português. BPLN, 2 edition.

Chan, C. and Kuang, J. (2024). Exploring the accuracy of prosodic encodings in state-of-the-art text-to-speech models. In Proc. Speech Prosody 2024, pages 27–31.

Chiang, C., Huang, W., and Lee, H. (2023). Why we should report the details in subjective evaluation of TTS more rigorously. In Harte, N., Carson-Berndsen, J., and Jones, G., editors, 24th Annual Conference of the International Speech Communication Association, Interspeech 2023, Dublin, Ireland, August 20-24, 2023, pages 5551–5555. ISCA.

Ferreira, L. P. (2014). A duração como correlato acústico do acento de palavra no português brasileiro e no espanhol: desafios para o ensino de suprassegmentais e preparação de material didático. Signum: Estudos da Linguagem, 17(1):74–101.

Galdino, J. C. (2023). Em 200 metros, vire à esquerda: a entoação dos comandos de GPS. Master’s thesis, Universidade Federal de Alagoas.

Gonçalves, C. S. (2017). Taxa de elocução e taxa de articulação em corpus utilizado na perícia de comparação de locutores. Letras de Hoje, 52:15–25.

Hirst, D. (2012). Analyse tier praat script.

Hoogeboom, E., Van Den Berg, R., and Welling, M. (2019). Emerging convolutions for generative normalizing flows. In International conference on machine learning, pages 2771–2780. PMLR.

Jadoul, Y., Thompson, B., and de Boer, B. (2018). Introducing parselmouth: A python interface to praat. Journal of Phonetics, 71:1–15.

Kent, R. and Read, C. (2002). The Acoustic Analysis of Speech. Singular/Thomson Learning.

Kim, J., Kong, J., and Son, J. (2021). Vits: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proc. ICML, pages 5530–5540.

Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. (2016). Improved variational inference with inverse autoregressive flow. Advances in neural information processing systems, 29.

Le Maguer, S., King, S., and Harte, N. (2024). The limits of the mean opinion score for speech synthesis evaluation. Computer Speech & Language, 84:101577.

Li, N., Liu, S., Liu, Y., Zhao, S., and Liu, M. (2019). Neural speech synthesis with transformer network. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):6706–6713.

Matos, A., Araújo, G., Junior, A. C., and Ponti, M. (2024). Accent classification is challenging but pre-training helps: a case study with novel brazilian portuguese datasets. In Proceedings of the 16th International Conference on Computational Processing of Portuguese, pages 364–373.

Nguyen, T.-N., Pham, N.-Q., and Waibel, A. (2023). Syntacc: Synthesizing multi-accent speech by weight factorization. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.

R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

Radford, A., Kim, J. W., Xu, T., Brockman, G., Mcleavey, C., and Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 28492–28518. PMLR.

Raitio, T., Li, J., and Seshadri, S. (2022). Hierarchical prosody modeling and control in non-autoregressive parallel neural TTS. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7587–7591. IEEE.

Raso, T., Teixeira, B., and Barbosa, P. (2020). Modelling automatic detection of prosodic boundaries for Brazilian Portuguese spontaneous speech. Journal of Speech Sciences, 9:105–128.
Publicado
17/11/2024
GALDINO, Julio Cesar; ARAÚJO, Gustavo Evangelista; CANDIDO JUNIOR, Arnaldo; OLIVEIRA JR., Miguel; PONTI, Moacir Antonelli; ALUÍSIO, Sandra Maria. Acoustic Analysis of Prosodic Features in Natural versus Synthesized Speech Samples from YourTTS and SYNTACC Models. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 21. , 2024, Belém/PA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 304-315. ISSN 2763-9061. DOI: https://doi.org/10.5753/eniac.2024.245092.

Artigos mais lidos do(s) mesmo(s) autor(es)

1 2 > >>