Uma Revisão de Arquiteturas Ponta a Ponta para Sintetização de Voz

  • Lucy Anne Evangelista USP
  • Patrícia do Nascimento USP
  • Carlos Eduardo Elmadjian USP
  • Alfredo Vel Lejbman USP

Abstract


The objective of this work is to carry out a comparative bibliographic study between the architectures for voice synthesis (Char2Wav, ClariNet, Tacotron, Tacotron 2, and DeepVoice 3) by systematizing information regarding the resources and capacity of the architectures. The comparative study also covered the frameworks (TensorFlow, PyTorch, etc.) used in the implementation of the architectures. At the end, some informational points are suggested that should be taken as relevant when comparing the available architectures.

References

Luong, M.-T., Pham, H., and Manning, C. D. (2015). Effective approaches to attentionbased neural machine translation. arXiv preprint arXiv:1508.04025.

Ping, W., Peng, K., and Chen, J. (2018). Clarinet: Parallel wave generation in end-to-endtext-to-speech. arXiv preprint arXiv:1807.07281.

Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., and Miller, J. (2017). Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654.

Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y.,

Wang, Y., Skerrv-Ryan, R., et al. (2018). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE.

Sotelo, J., Mehri, S., Kumar, K., Santos, J. F., Kastner, K., Courville, A., and Bengio, Y. (2017). Char2wav: End-to-end speech synthesis.

Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., et al. (2017). Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.
Published
2020-08-19
EVANGELISTA, Lucy Anne; DO NASCIMENTO, Patrícia; ELMADJIAN, Carlos Eduardo; VEL LEJBMAN, Alfredo. Uma Revisão de Arquiteturas Ponta a Ponta para Sintetização de Voz. In: REGIONAL SCHOOL OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING, 1. , 2020, São Paulo. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 17-21.