EyetrackingMOS: Proposal for an online evaluation method for speech synthesis models

Gustavo E. Araújo; Julio C. Galdino; Rodrigo de F. Lima; Leonardo Ishida; Gustavo W. Lopes; Miguel Oliveira Jr.; Arnaldo Cândido Jr.; Sandra M. Aluísio; Moacir A. Ponti

doi:10.5753/stil.2024.245424

Gustavo E. Araújo USP http://orcid.org/0009-0004-5661-4789
Julio C. Galdino USP https://orcid.org/0000-0001-6378-4648
Rodrigo de F. Lima USP
Leonardo Ishida USP
Gustavo W. Lopes USP
Miguel Oliveira Jr. UFAL https://orcid.org/0000-0002-0866-0535
Arnaldo Cândido Jr. UNESP https://orcid.org/0000-0002-5647-0891
Sandra M. Aluísio USP https://orcid.org/0000-0001-5108-2630
Moacir A. Ponti USP https://orcid.org/0000-0003-2059-9463

DOI: https://doi.org/10.5753/stil.2024.245424

Resumo

Avaliar sistemas Text-To-Speech (TTS) é um desafio, uma vez que a qualidade crescente da síntese impõe obstáculos em discriminar a capacidade de modelos em reproduzir atributos prosódicos, especialmente para o português brasileiro. Métricas de avaliação offline não medem a reação genuína de avaliadores aos estímulos de áudios. Propõe-se, portanto, um método de avaliação online com rastreamento de globo ocular. Os experimentos com 76 anotadores apontam que há uma correlação razoável entre EyetrackingMOS e MOS, assim como uma redução em sua duração total. Desta forma, acredita-se que esta métrica forneça uma informação precisa e potencialmente rápida para complementar os métodos de avaliação.

Palavras-chave: Avaliação de modelos de síntese de fala, língua portuguesa, fala espontânea, rastreamento ocular

Referências

ALMEIDA, R. A. S. d., OLIVEIRA JR., M., and COZIJN, R. (2021). Paradigma do Mundo Visual: Método de Rastreamento Ocular, chapter 5. Blucher Open Access.

Batista, N. A. R. (2019). Estudo sobre identificação automática de sotaques regionais brasileiros baseada em modelagens estatísticas e técnicas de aprendizado de máquina. Master’s thesis, Unicamp.

Cagliari, L. C. (1992). Prosódia: algumas funções dos supra-segmentos. Cadernos de estudos linguísticos, 23:137–151.

Casanova, E., Shulby, C., Gölge, E., Müller, N. M., de Oliveira, F. S., Junior, A. C., da Silva Soares, A., Aluisio, S. M., and Ponti, M. A. (2021). Sc-glowtts: an efficient zero-shot multi-speaker text-to-speech model.

Casanova, E., Weber, J., Shulby, C. D., Junior, A. C., Gölge, E., and Ponti, M. A. (2022). Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pages 2709–2720. PMLR.

Caseli, H. M. and Nunes, M. G. V., editors (2024). Processamento de Linguagem Natural: Conceitos, Técnicas e Aplicações em Português. BPLN, 2 edition.

Choi, Y., Jung, Y., Suh, Y., and Kim, H. (2022). Learning to maximize speech quality directly using mos prediction for neural text-to-speech. IEEE Access, 10:52621–52629.

Cooper, E., Huang, W.-C., Tsao, Y., Wang, H.-M., Toda, T., and Yamagishi, J. (2024). A review on subjective and objective evaluation of synthetic speech. Acoustical Science and Technology, 45(4):161–183.

Hoogeboom, E., Van Den Berg, R., and Welling, M. (2019). Emerging convolutions for generative normalizing flows. In International conference on machine learning, pages 2771–2780. PMLR.

ITU - R (2017). ITU-T Rec. P.10/G.100 (11/2017): Vocabulary for performance, quality of service and quality of experience. Recommendation P.10/G.100, International Telecommunication Union. [link]

ITU - T (1996). Methods for subjective determination of transmission quality. Recommendation P.800, International Telecommunication Union.

Jia, Y., Zhang, Y., Weiss, R. J., Wang, Q., Shen, J., Ren, F., Chen, Z., Nguyen, P., Pang, R., Moreno, I. L., and Wu, Y. (2019). Transfer learning from speaker verification to multispeaker text-to-speech synthesis.

Ju, Z., Wang, Y., Shen, K., Tan, X., Xin, D., Yang, D., Liu, Y., Leng, Y., Song, K., Tang, S., Wu, Z., Qin, T., Li, X.-Y., Ye, W., Zhang, S., Bian, J., He, L., Li, J., and Zhao, S. (2024). Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models.

Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. (2016). Improved variational inference with inverse autoregressive flow. Advances in neural information processing systems, 29.

Le Maguer, S., King, S., and Harte, N. (2024). The limits of the mean opinion score for speech synthesis evaluation. Computer Speech Language, 84:101577.

Ling, L., Fernandes Tavares, T., Barbosa, P., and Batista, N. (2018). Detecção automática de sotaques regionais brasileiros: A importância da validação cross-datasets.

Loizou, P. C. (2011). Speech Quality Assessment, pages 623–654. Springer Berlin Heidelberg, Berlin, Heidelberg.

Mitchell, D. C. (2004). On-line methods in language processing: introduction and historical review. In Carreiras, M. and Clifton Jr., C., editors, The On-line Study of Sentence Comprehension: Eyetracking, ERP and Beyond, pages 15–32. Psychology Press, New York.

Mota, J. A., Ribeiro, S. S. C., and de Oliveira, J. M. (2023). Atlas Linguístico Do Brasil: Comentários às Cartas Linguísticas I-V 3. Universidade Estadual de Londrina. Editora.

Nguyen, T.-N., Pham, N.-Q., and Waibel, A. (2023). Syntacc: Synthesizing multi-accent speech by weight factorization. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.

Ren, Y., Hu, C., Tao, X., Zhao, Z., Zhang, X., Li, Q., Lei, L., Zhou, S., Liu, J., and Liu, S. (2021). Fastspeech 2: Fast and high-quality end-to-end text to speech. In International Conference on Learning Representations.

Ren, Y., Zhao, Z., Tan, X., Yi, J., Cheng, Y.-L., Yang, J., Qin, T., and Liu, T.-Y. (2022). Naturalspeech: End-to-end text to speech synthesis with human-level quality. In Advances in Neural Information Processing Systems.

Ribeiro, F., Florêncio, D., Zhang, C., and Seltzer, M. (2011). Crowdmos: An approach for crowdsourcing mean opinion score studies. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 2416–2419. IEEE.

Sellam, T., Bapna, A., Camp, J., Mackinnon, D., Parikh, A. P., and Riesa, J. (2023). Squeak: Measuring speech naturalness in many languages. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.

Shen, K., Ju, Z., Tan, X., Liu, Y., Leng, Y., He, L., Qin, T., Zhao, S., and Bian, J. (2023). Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers.

Tan, X., Chen, J., Liu, H., Cong, J., Zhang, C., Liu, Y., Wang, X., Leng, Y., Yi, Y., He, L., Soong, F., Qin, T., Zhao, S., and Liu, T.-Y. (2022). Naturalspeech: End-to-end text to speech synthesis with human-level quality.

Ynoguti, C. A. (1999). Reconhecimento de Fala Contínua Utilizando Modelos Ocultos de Markov. PhD thesis, Unicamp.

EyetrackingMOS: Proposta de um método de avaliação online para modelos de síntese de fala

Resumo

Referências