Uma Revisão de Arquiteturas Ponta a Ponta para Sintetização de Voz
Abstract
The objective of this work is to carry out a comparative bibliographic study between the architectures for voice synthesis (Char2Wav, ClariNet, Tacotron, Tacotron 2, and DeepVoice 3) by systematizing information regarding the resources and capacity of the architectures. The comparative study also covered the frameworks (TensorFlow, PyTorch, etc.) used in the implementation of the architectures. At the end, some informational points are suggested that should be taken as relevant when comparing the available architectures.
References
Ping, W., Peng, K., and Chen, J. (2018). Clarinet: Parallel wave generation in end-to-endtext-to-speech. arXiv preprint arXiv:1807.07281.
Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., and Miller, J. (2017). Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654.
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y.,
Wang, Y., Skerrv-Ryan, R., et al. (2018). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE.
Sotelo, J., Mehri, S., Kumar, K., Santos, J. F., Kastner, K., Courville, A., and Bengio, Y. (2017). Char2wav: End-to-end speech synthesis.
Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., et al. (2017). Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.
