EyetrackingMOS: Proposal for an online evaluation method for speech synthesis models
Abstract
Evaluating Text-To-Speech (TTS) systems is challenging, as the increasing quality of synthesis makes it difficult to discriminate models’ ability to reproduce prosodic attributes, especially for Brazilian Portuguese. Offline evaluation metrics do not capture our genuine reactions to audio stimuli. Therefore, we propose an online evaluation method using eye-tracking. Our experiments with 76 annotators show a reasonable correlation between EyetrackingMOS and MOS, as well as a reduction in the total evaluation time. We believe this metric provides precise and potentially fast information to complement existing evaluation methods.
Keywords:
Speech Synthesis Models Evaluation, Portuguese language, spontaneous speech, eyetracking
References
ALMEIDA, R. A. S. d., OLIVEIRA JR., M., and COZIJN, R. (2021). Paradigma do Mundo Visual: Método de Rastreamento Ocular, chapter 5. Blucher Open Access.
Batista, N. A. R. (2019). Estudo sobre identificação automática de sotaques regionais brasileiros baseada em modelagens estatísticas e técnicas de aprendizado de máquina. Master’s thesis, Unicamp.
Cagliari, L. C. (1992). Prosódia: algumas funções dos supra-segmentos. Cadernos de estudos linguísticos, 23:137–151.
Casanova, E., Shulby, C., Gölge, E., Müller, N. M., de Oliveira, F. S., Junior, A. C., da Silva Soares, A., Aluisio, S. M., and Ponti, M. A. (2021). Sc-glowtts: an efficient zero-shot multi-speaker text-to-speech model.
Casanova, E., Weber, J., Shulby, C. D., Junior, A. C., Gölge, E., and Ponti, M. A. (2022). Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pages 2709–2720. PMLR.
Caseli, H. M. and Nunes, M. G. V., editors (2024). Processamento de Linguagem Natural: Conceitos, Técnicas e Aplicações em Português. BPLN, 2 edition.
Choi, Y., Jung, Y., Suh, Y., and Kim, H. (2022). Learning to maximize speech quality directly using mos prediction for neural text-to-speech. IEEE Access, 10:52621–52629.
Cooper, E., Huang, W.-C., Tsao, Y., Wang, H.-M., Toda, T., and Yamagishi, J. (2024). A review on subjective and objective evaluation of synthetic speech. Acoustical Science and Technology, 45(4):161–183.
Hoogeboom, E., Van Den Berg, R., and Welling, M. (2019). Emerging convolutions for generative normalizing flows. In International conference on machine learning, pages 2771–2780. PMLR.
ITU - R (2017). ITU-T Rec. P.10/G.100 (11/2017): Vocabulary for performance, quality of service and quality of experience. Recommendation P.10/G.100, International Telecommunication Union. [link]
ITU - T (1996). Methods for subjective determination of transmission quality. Recommendation P.800, International Telecommunication Union.
Jia, Y., Zhang, Y., Weiss, R. J., Wang, Q., Shen, J., Ren, F., Chen, Z., Nguyen, P., Pang, R., Moreno, I. L., and Wu, Y. (2019). Transfer learning from speaker verification to multispeaker text-to-speech synthesis.
Ju, Z., Wang, Y., Shen, K., Tan, X., Xin, D., Yang, D., Liu, Y., Leng, Y., Song, K., Tang, S., Wu, Z., Qin, T., Li, X.-Y., Ye, W., Zhang, S., Bian, J., He, L., Li, J., and Zhao, S. (2024). Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models.
Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. (2016). Improved variational inference with inverse autoregressive flow. Advances in neural information processing systems, 29.
Le Maguer, S., King, S., and Harte, N. (2024). The limits of the mean opinion score for speech synthesis evaluation. Computer Speech Language, 84:101577.
Ling, L., Fernandes Tavares, T., Barbosa, P., and Batista, N. (2018). Detecção automática de sotaques regionais brasileiros: A importância da validação cross-datasets.
Loizou, P. C. (2011). Speech Quality Assessment, pages 623–654. Springer Berlin Heidelberg, Berlin, Heidelberg.
Mitchell, D. C. (2004). On-line methods in language processing: introduction and historical review. In Carreiras, M. and Clifton Jr., C., editors, The On-line Study of Sentence Comprehension: Eyetracking, ERP and Beyond, pages 15–32. Psychology Press, New York.
Mota, J. A., Ribeiro, S. S. C., and de Oliveira, J. M. (2023). Atlas Linguístico Do Brasil: Comentários às Cartas Linguísticas I-V 3. Universidade Estadual de Londrina. Editora.
Nguyen, T.-N., Pham, N.-Q., and Waibel, A. (2023). Syntacc: Synthesizing multi-accent speech by weight factorization. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
Ren, Y., Hu, C., Tao, X., Zhao, Z., Zhang, X., Li, Q., Lei, L., Zhou, S., Liu, J., and Liu, S. (2021). Fastspeech 2: Fast and high-quality end-to-end text to speech. In International Conference on Learning Representations.
Ren, Y., Zhao, Z., Tan, X., Yi, J., Cheng, Y.-L., Yang, J., Qin, T., and Liu, T.-Y. (2022). Naturalspeech: End-to-end text to speech synthesis with human-level quality. In Advances in Neural Information Processing Systems.
Ribeiro, F., Florêncio, D., Zhang, C., and Seltzer, M. (2011). Crowdmos: An approach for crowdsourcing mean opinion score studies. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 2416–2419. IEEE.
Sellam, T., Bapna, A., Camp, J., Mackinnon, D., Parikh, A. P., and Riesa, J. (2023). Squeak: Measuring speech naturalness in many languages. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
Shen, K., Ju, Z., Tan, X., Liu, Y., Leng, Y., He, L., Qin, T., Zhao, S., and Bian, J. (2023). Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers.
Tan, X., Chen, J., Liu, H., Cong, J., Zhang, C., Liu, Y., Wang, X., Leng, Y., Yi, Y., He, L., Soong, F., Qin, T., Zhao, S., and Liu, T.-Y. (2022). Naturalspeech: End-to-end text to speech synthesis with human-level quality.
Ynoguti, C. A. (1999). Reconhecimento de Fala Contínua Utilizando Modelos Ocultos de Markov. PhD thesis, Unicamp.
Batista, N. A. R. (2019). Estudo sobre identificação automática de sotaques regionais brasileiros baseada em modelagens estatísticas e técnicas de aprendizado de máquina. Master’s thesis, Unicamp.
Cagliari, L. C. (1992). Prosódia: algumas funções dos supra-segmentos. Cadernos de estudos linguísticos, 23:137–151.
Casanova, E., Shulby, C., Gölge, E., Müller, N. M., de Oliveira, F. S., Junior, A. C., da Silva Soares, A., Aluisio, S. M., and Ponti, M. A. (2021). Sc-glowtts: an efficient zero-shot multi-speaker text-to-speech model.
Casanova, E., Weber, J., Shulby, C. D., Junior, A. C., Gölge, E., and Ponti, M. A. (2022). Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pages 2709–2720. PMLR.
Caseli, H. M. and Nunes, M. G. V., editors (2024). Processamento de Linguagem Natural: Conceitos, Técnicas e Aplicações em Português. BPLN, 2 edition.
Choi, Y., Jung, Y., Suh, Y., and Kim, H. (2022). Learning to maximize speech quality directly using mos prediction for neural text-to-speech. IEEE Access, 10:52621–52629.
Cooper, E., Huang, W.-C., Tsao, Y., Wang, H.-M., Toda, T., and Yamagishi, J. (2024). A review on subjective and objective evaluation of synthetic speech. Acoustical Science and Technology, 45(4):161–183.
Hoogeboom, E., Van Den Berg, R., and Welling, M. (2019). Emerging convolutions for generative normalizing flows. In International conference on machine learning, pages 2771–2780. PMLR.
ITU - R (2017). ITU-T Rec. P.10/G.100 (11/2017): Vocabulary for performance, quality of service and quality of experience. Recommendation P.10/G.100, International Telecommunication Union. [link]
ITU - T (1996). Methods for subjective determination of transmission quality. Recommendation P.800, International Telecommunication Union.
Jia, Y., Zhang, Y., Weiss, R. J., Wang, Q., Shen, J., Ren, F., Chen, Z., Nguyen, P., Pang, R., Moreno, I. L., and Wu, Y. (2019). Transfer learning from speaker verification to multispeaker text-to-speech synthesis.
Ju, Z., Wang, Y., Shen, K., Tan, X., Xin, D., Yang, D., Liu, Y., Leng, Y., Song, K., Tang, S., Wu, Z., Qin, T., Li, X.-Y., Ye, W., Zhang, S., Bian, J., He, L., Li, J., and Zhao, S. (2024). Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models.
Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. (2016). Improved variational inference with inverse autoregressive flow. Advances in neural information processing systems, 29.
Le Maguer, S., King, S., and Harte, N. (2024). The limits of the mean opinion score for speech synthesis evaluation. Computer Speech Language, 84:101577.
Ling, L., Fernandes Tavares, T., Barbosa, P., and Batista, N. (2018). Detecção automática de sotaques regionais brasileiros: A importância da validação cross-datasets.
Loizou, P. C. (2011). Speech Quality Assessment, pages 623–654. Springer Berlin Heidelberg, Berlin, Heidelberg.
Mitchell, D. C. (2004). On-line methods in language processing: introduction and historical review. In Carreiras, M. and Clifton Jr., C., editors, The On-line Study of Sentence Comprehension: Eyetracking, ERP and Beyond, pages 15–32. Psychology Press, New York.
Mota, J. A., Ribeiro, S. S. C., and de Oliveira, J. M. (2023). Atlas Linguístico Do Brasil: Comentários às Cartas Linguísticas I-V 3. Universidade Estadual de Londrina. Editora.
Nguyen, T.-N., Pham, N.-Q., and Waibel, A. (2023). Syntacc: Synthesizing multi-accent speech by weight factorization. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
Ren, Y., Hu, C., Tao, X., Zhao, Z., Zhang, X., Li, Q., Lei, L., Zhou, S., Liu, J., and Liu, S. (2021). Fastspeech 2: Fast and high-quality end-to-end text to speech. In International Conference on Learning Representations.
Ren, Y., Zhao, Z., Tan, X., Yi, J., Cheng, Y.-L., Yang, J., Qin, T., and Liu, T.-Y. (2022). Naturalspeech: End-to-end text to speech synthesis with human-level quality. In Advances in Neural Information Processing Systems.
Ribeiro, F., Florêncio, D., Zhang, C., and Seltzer, M. (2011). Crowdmos: An approach for crowdsourcing mean opinion score studies. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 2416–2419. IEEE.
Sellam, T., Bapna, A., Camp, J., Mackinnon, D., Parikh, A. P., and Riesa, J. (2023). Squeak: Measuring speech naturalness in many languages. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
Shen, K., Ju, Z., Tan, X., Liu, Y., Leng, Y., He, L., Qin, T., Zhao, S., and Bian, J. (2023). Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers.
Tan, X., Chen, J., Liu, H., Cong, J., Zhang, C., Liu, Y., Wang, X., Leng, Y., Yi, Y., He, L., Soong, F., Qin, T., Zhao, S., and Liu, T.-Y. (2022). Naturalspeech: End-to-end text to speech synthesis with human-level quality.
Ynoguti, C. A. (1999). Reconhecimento de Fala Contínua Utilizando Modelos Ocultos de Markov. PhD thesis, Unicamp.
Published
2024-11-17
How to Cite
ARAÚJO, Gustavo E. et al.
EyetrackingMOS: Proposal for an online evaluation method for speech synthesis models. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 15. , 2024, Belém/PA.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2024
.
p. 87-96.
DOI: https://doi.org/10.5753/stil.2024.245424.
