Assessing the Naturalness of Text-to-Speech Models for Brazilian Portuguese NPC Dialogue
Resumo
Believable Non-Player Character (NPC) voices are critical for creating immersive experiences in Virtual Reality (VR). However, traditional voice acting is costly and lacks scalability for extensive dialogues and localization, particularly for languages like Brazilian Portuguese. While Text-to-Speech (TTS) offers a viable alternative, there is a lack of systematic evaluation of modern models’ perceptual naturalness when generating domain-specific NPC dialogues in Portuguese. This paper addresses this gap by conducting a subjective evaluation of five State-of-the-Art TTS models: XTTS, Parler TTS, F5-TTS, Fish Speech, and Orpheus. Two test sets were created by translating dialogues from the LIGHT and CHiME-6 datasets. The naturalness of the synthesized audio was assessed through a Mean Opinion Score (MOS) user study with 13 native speakers. Results indicate moderate overall naturalness (average MOS 3.29), with Parler TTS and XTTS emerging as top performers (MOS 3.53). These findings provide a benchmark for developers creating localized VR content and highlight that achieving truly expressive NPC voices remains a significant research challenge.
Palavras-chave:
Virtual Reality, Text-to-Speech, NPC, Speech Synthesis, Brazilian Portuguese
Referências
Aylett, R. and Louchart, S. Towards a narrative theory of virtual reality. Virtual Reality, 7:2–9, 2003.
Barker, J. Watanabe, S. Vincent, E. and Trmal, J. The Fifth ‘CHiME’ Speech Separation and Recognition Challenge Dataset, Task and Baselines. In Proc. Interspeech 2018, 1561–1565, 2018.
Bhosale, S. Yang, H. Kanojia, D. Deng, J. and Zhu, X. Av-gs Learning material and geometry aware priors for novel view acoustic synthesis. arXiv preprint arXiv:2406.08920, 2024.
Brogni, A. Slater, M. Steed, A. et al. More breaks less presence. Presence 2003 The 6th Annual International Workshop on Presence, 1–4, 2003.
Brown, T. Mann, B. Ryder, N. Subbiah, M. Kaplan, J. D. Dhariwal, P. Neelakantan, A. Shyam, P. Sastry, G. Askell, A. et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
Carvalho, V. M. and Rodrigues, M. A. F. Investigating and comparing the perceptions of voice interaction in digital games Opportunities for health and wellness applications. In 2023 IEEE 11th International Conference on Serious Games and Applications for Health (SeGAH), 1–8, 2023.
Casanova, E. Davis, K. Golge, E. Goknar, G. Gulea, I. Hart, L. Aljafari, A. Meyer, J. Morais, R. Olayemi, S. et al. Xtts a massively multilingual zero-shot text-to-speech model. arXiv preprint arXiv:2406.04904, 2024.
Chen, Y. Niu, Z. Ma, Z. Deng, K. Wang, C. Zhao, J. Yu, K. and Chen, X. F5-tts A fairytaler that fakes fluent and faithful speech with flow matching. arXiv preprint arXiv:2410.06885, 2024.
Collins, K. Game Sound An Introduction to the History, Theory, and Practice of Video Game Music and Sound Design. MIT Press, 2008.
Copet, J. Kreuk, F. Gat, I. Remez, T. Kant, D. Synnaeve, G. Adi, Y. and Defossez, A. Simple and controllable music generation. Advances in Neural Information Processing Systems, 36:47704–47720, 2023.
Dai, Z. Yang, Z. Yang, Y. Carbonell, J. Le, Q. V. and Salakhutdinov, R. Transformer-xl Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
Dhariwal, P. and Nichol, A. Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
The LLaMA 3 herd of models, 2024.
Fei, Z. Fan, M. Yu, C. and Huang, J. Flux that plays music. arXiv preprint arXiv:2409.00587, 2024.
Gao, Y. Dai, Y. Zhang, G. Guo, H. Mostajeran, F. Zheng, B. and Yu, T. Trust in virtual agents Exploring the role of stylization and voice. IEEE Transactions on Visualization and Computer Graphics, 2025.
Urbanek, J. Fan, A. et al. Learning to speak and act in a fantasy text adventure game. 2019.
Jaegle, A. Gimeno, F. Brock, A. Vinyals, O. Zisserman, A. and Carreira, J. Perceiver General perception with iterative attention. International Conference on Machine Learning, 4651–4664, 2021.
Kobayashi, M. Ueno, K. and Ise, S. The effects of spatialized sounds on the sense of presence in auditory virtual environments A psychological and physiological study. Presence Teleoperators and Virtual Environments, 24(2):163–174, 2015.
Kumar, R. Seetharaman, P. Luebs, A. Kumar, I. and Kumar, K. High-fidelity audio compression with improved RVQGAN. Advances in Neural Information Processing Systems, 36:27980–27993, 2023.
Kurucz, P. Serafin, S. Chitale, V. Klein, E. and Baghaei, N. Investigating the impact of avatar action sounds on the plausibility illusion in virtual reality. In 2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), 636–642, 2025.
Liao, S. Lan, S. and Zachariah, A. G. EVA-GAN Enhanced various audio generation via scalable generative adversarial networks. arXiv preprint arXiv:2402.00892, 2024.
Liao, S. Wang, Y. Li, T. Cheng, Y. Zhang, R. Zhou, R. and Xing, Y. Fish-speech Leveraging large language models for advanced multilingual text-to-speech synthesis. arXiv preprint arXiv:2411.01156, 2024.
Lipman, Y. Chen, R. T. Ben-Hamu, H. Nickel, M. and Le, M. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022.
Lyth, D. and King, S. Natural language guidance of high-fidelity text-to-speech with synthetic annotations. arXiv preprint arXiv:2402.01912, 2024.
Maempel, H.-J. and Horn, M. Audiovisual perception of real and virtual rooms. Journal of Virtual Reality and Broadcasting, 14, 2017.
Peebles, W. and Xie, S. Scalable diffusion models with transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, 4195–4205, 2023.
Ren, Y. Ruan, Y. Tan, X. Qin, T. Zhao, S. Zhao, Z. and Liu, T.-Y. FastSpeech Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems, 32, 2019.
Rombach, R. Blattmann, A. Lorenz, D. Esser, P. and Ommer, B. High-resolution image synthesis with latent diffusion models. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695, 2022.
Oliveira, F. S. Casanova, E. Junior, A. C. Gris, L. R. S. Soares, A. S. and Galvao Filho, A. R. Evaluation of speech representations for MOS prediction. International Conference on Text, Speech, and Dialogue, 270–282, 2023.
Shen, J. Pang, R. Weiss, R. J. Schuster, M. Jaitly, N. Yang, Z. Chen, Z. Zhang, Y. Wang, Y. Skerry-Ryan, R. et al. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4779–4783, 2018.
Shirali-Shahreza, S. How should we define voice naturalness. Proceedings of the Sixteenth International Conference on Advances in Computer-Human Interactions (ACHI), 235–239, 2023.
Siuzdak, H. Grotschla, F. and Lanzendorfer, L. A. SNAC Multi-scale neural audio codec. arXiv preprint arXiv:2410.14411, 2024.
Slater, M. Banakou, D. Beacco, A. Gallego, J. Macia-Varela, F. and Oliva, R. A separate reality An update on place illusion and plausibility in virtual reality. Frontiers in Virtual Reality, 3:914392, 2022.
Slater, M. Usoh, M. and Steed, A. Depth of presence in virtual environments. Presence Teleoperators & Virtual Environments, 3(2):130–144, 1994.
Sohl-Dickstein, J. Weiss, E. Maheswaranathan, N. and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. International Conference on Machine Learning, 2256–2265, 2015.
Southwick, H. The tongue can paint what the eyes cannot see The voice actor and world-building in videogames. Voice and Speech Review, 16(1):33–43, 2022.
Suarjaya, I. A new algorithm for data compression optimization. arXiv preprint arXiv:1209.1045, 2012.
Team, C. L. Orpheus TTS, 2025.
Touvron, H. Martin, L. Stone, K. Albert, P. Almahairi, A. Babaei, Y. Bashlykov, N. Batra, S. Bhargava, P. Bhosale, S. et al. Llama 2 Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Van Den Oord, A. Vinyals, O. et al. Neural discrete representation learning. Advances in Neural Information Processing Systems, 30, 2017.
Vaswani, A. Shazeer, N. Parmar, N. Uszkoreit, J. Jones, L. Gomez, A. N. Kaiser, L. and Polosukhin, I. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
Vastfjall, D. The subjective sense of presence, emotion recognition, and experienced emotions in auditory virtual environments. CyberPsychology & Behavior, 6(2):181–188, 2003.
Watanabe, S. Mandel, M. Barker, J. Vincent, E. Arora, A. Chang, X. Khudanpur, S. Manohar, V. Povey, D. Raj, D. et al. CHiME-6 challenge Tackling multispeaker speech recognition for unsegmented recordings. CHiME 2020-6th International Workshop on Speech Processing in Everyday Environments, 2020.
Weiss, R. J. Xiao, Y. Clark, R. Stanton, D. Skerry-Ryan, R. Shor, J. Wang, Y. Battenberg, E. and Saurous, R. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. 4700–4709, 2018.
Woo, S. Debnath, S. Hu, R. Chen, X. Liu, Z. Kweon, I. S. and Xie, S. ConvNeXt v2 Co-designing and scaling ConvNets with masked autoencoders. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16133–16142, 2023.
Zeghidour, N. Luebs, A. Omran, A. Skoglund, J. and Tagliasacchi, M. SoundStream An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
Barker, J. Watanabe, S. Vincent, E. and Trmal, J. The Fifth ‘CHiME’ Speech Separation and Recognition Challenge Dataset, Task and Baselines. In Proc. Interspeech 2018, 1561–1565, 2018.
Bhosale, S. Yang, H. Kanojia, D. Deng, J. and Zhu, X. Av-gs Learning material and geometry aware priors for novel view acoustic synthesis. arXiv preprint arXiv:2406.08920, 2024.
Brogni, A. Slater, M. Steed, A. et al. More breaks less presence. Presence 2003 The 6th Annual International Workshop on Presence, 1–4, 2003.
Brown, T. Mann, B. Ryder, N. Subbiah, M. Kaplan, J. D. Dhariwal, P. Neelakantan, A. Shyam, P. Sastry, G. Askell, A. et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
Carvalho, V. M. and Rodrigues, M. A. F. Investigating and comparing the perceptions of voice interaction in digital games Opportunities for health and wellness applications. In 2023 IEEE 11th International Conference on Serious Games and Applications for Health (SeGAH), 1–8, 2023.
Casanova, E. Davis, K. Golge, E. Goknar, G. Gulea, I. Hart, L. Aljafari, A. Meyer, J. Morais, R. Olayemi, S. et al. Xtts a massively multilingual zero-shot text-to-speech model. arXiv preprint arXiv:2406.04904, 2024.
Chen, Y. Niu, Z. Ma, Z. Deng, K. Wang, C. Zhao, J. Yu, K. and Chen, X. F5-tts A fairytaler that fakes fluent and faithful speech with flow matching. arXiv preprint arXiv:2410.06885, 2024.
Collins, K. Game Sound An Introduction to the History, Theory, and Practice of Video Game Music and Sound Design. MIT Press, 2008.
Copet, J. Kreuk, F. Gat, I. Remez, T. Kant, D. Synnaeve, G. Adi, Y. and Defossez, A. Simple and controllable music generation. Advances in Neural Information Processing Systems, 36:47704–47720, 2023.
Dai, Z. Yang, Z. Yang, Y. Carbonell, J. Le, Q. V. and Salakhutdinov, R. Transformer-xl Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
Dhariwal, P. and Nichol, A. Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
The LLaMA 3 herd of models, 2024.
Fei, Z. Fan, M. Yu, C. and Huang, J. Flux that plays music. arXiv preprint arXiv:2409.00587, 2024.
Gao, Y. Dai, Y. Zhang, G. Guo, H. Mostajeran, F. Zheng, B. and Yu, T. Trust in virtual agents Exploring the role of stylization and voice. IEEE Transactions on Visualization and Computer Graphics, 2025.
Urbanek, J. Fan, A. et al. Learning to speak and act in a fantasy text adventure game. 2019.
Jaegle, A. Gimeno, F. Brock, A. Vinyals, O. Zisserman, A. and Carreira, J. Perceiver General perception with iterative attention. International Conference on Machine Learning, 4651–4664, 2021.
Kobayashi, M. Ueno, K. and Ise, S. The effects of spatialized sounds on the sense of presence in auditory virtual environments A psychological and physiological study. Presence Teleoperators and Virtual Environments, 24(2):163–174, 2015.
Kumar, R. Seetharaman, P. Luebs, A. Kumar, I. and Kumar, K. High-fidelity audio compression with improved RVQGAN. Advances in Neural Information Processing Systems, 36:27980–27993, 2023.
Kurucz, P. Serafin, S. Chitale, V. Klein, E. and Baghaei, N. Investigating the impact of avatar action sounds on the plausibility illusion in virtual reality. In 2025 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), 636–642, 2025.
Liao, S. Lan, S. and Zachariah, A. G. EVA-GAN Enhanced various audio generation via scalable generative adversarial networks. arXiv preprint arXiv:2402.00892, 2024.
Liao, S. Wang, Y. Li, T. Cheng, Y. Zhang, R. Zhou, R. and Xing, Y. Fish-speech Leveraging large language models for advanced multilingual text-to-speech synthesis. arXiv preprint arXiv:2411.01156, 2024.
Lipman, Y. Chen, R. T. Ben-Hamu, H. Nickel, M. and Le, M. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022.
Lyth, D. and King, S. Natural language guidance of high-fidelity text-to-speech with synthetic annotations. arXiv preprint arXiv:2402.01912, 2024.
Maempel, H.-J. and Horn, M. Audiovisual perception of real and virtual rooms. Journal of Virtual Reality and Broadcasting, 14, 2017.
Peebles, W. and Xie, S. Scalable diffusion models with transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, 4195–4205, 2023.
Ren, Y. Ruan, Y. Tan, X. Qin, T. Zhao, S. Zhao, Z. and Liu, T.-Y. FastSpeech Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems, 32, 2019.
Rombach, R. Blattmann, A. Lorenz, D. Esser, P. and Ommer, B. High-resolution image synthesis with latent diffusion models. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695, 2022.
Oliveira, F. S. Casanova, E. Junior, A. C. Gris, L. R. S. Soares, A. S. and Galvao Filho, A. R. Evaluation of speech representations for MOS prediction. International Conference on Text, Speech, and Dialogue, 270–282, 2023.
Shen, J. Pang, R. Weiss, R. J. Schuster, M. Jaitly, N. Yang, Z. Chen, Z. Zhang, Y. Wang, Y. Skerry-Ryan, R. et al. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4779–4783, 2018.
Shirali-Shahreza, S. How should we define voice naturalness. Proceedings of the Sixteenth International Conference on Advances in Computer-Human Interactions (ACHI), 235–239, 2023.
Siuzdak, H. Grotschla, F. and Lanzendorfer, L. A. SNAC Multi-scale neural audio codec. arXiv preprint arXiv:2410.14411, 2024.
Slater, M. Banakou, D. Beacco, A. Gallego, J. Macia-Varela, F. and Oliva, R. A separate reality An update on place illusion and plausibility in virtual reality. Frontiers in Virtual Reality, 3:914392, 2022.
Slater, M. Usoh, M. and Steed, A. Depth of presence in virtual environments. Presence Teleoperators & Virtual Environments, 3(2):130–144, 1994.
Sohl-Dickstein, J. Weiss, E. Maheswaranathan, N. and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. International Conference on Machine Learning, 2256–2265, 2015.
Southwick, H. The tongue can paint what the eyes cannot see The voice actor and world-building in videogames. Voice and Speech Review, 16(1):33–43, 2022.
Suarjaya, I. A new algorithm for data compression optimization. arXiv preprint arXiv:1209.1045, 2012.
Team, C. L. Orpheus TTS, 2025.
Touvron, H. Martin, L. Stone, K. Albert, P. Almahairi, A. Babaei, Y. Bashlykov, N. Batra, S. Bhargava, P. Bhosale, S. et al. Llama 2 Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Van Den Oord, A. Vinyals, O. et al. Neural discrete representation learning. Advances in Neural Information Processing Systems, 30, 2017.
Vaswani, A. Shazeer, N. Parmar, N. Uszkoreit, J. Jones, L. Gomez, A. N. Kaiser, L. and Polosukhin, I. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
Vastfjall, D. The subjective sense of presence, emotion recognition, and experienced emotions in auditory virtual environments. CyberPsychology & Behavior, 6(2):181–188, 2003.
Watanabe, S. Mandel, M. Barker, J. Vincent, E. Arora, A. Chang, X. Khudanpur, S. Manohar, V. Povey, D. Raj, D. et al. CHiME-6 challenge Tackling multispeaker speech recognition for unsegmented recordings. CHiME 2020-6th International Workshop on Speech Processing in Everyday Environments, 2020.
Weiss, R. J. Xiao, Y. Clark, R. Stanton, D. Skerry-Ryan, R. Shor, J. Wang, Y. Battenberg, E. and Saurous, R. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. 4700–4709, 2018.
Woo, S. Debnath, S. Hu, R. Chen, X. Liu, Z. Kweon, I. S. and Xie, S. ConvNeXt v2 Co-designing and scaling ConvNets with masked autoencoders. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16133–16142, 2023.
Zeghidour, N. Luebs, A. Omran, A. Skoglund, J. and Tagliasacchi, M. SoundStream An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
Publicado
30/09/2025
Como Citar
FERRO FILHO, Alexandre Costa; SOUSA, Rafael Teixeira; ROSA, Augusto Seben da; MENDES, Leticia Lima; PINTO, Paula Leandra Loeblein; GALVÃO FILHO, Arlindo Rodrigues.
Assessing the Naturalness of Text-to-Speech Models for Brazilian Portuguese NPC Dialogue. In: SIMPÓSIO DE REALIDADE VIRTUAL E AUMENTADA (SVR), 27. , 2025, Salvador/BA.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 423-428.
