Uma Revisão de Arquiteturas Ponta a Ponta para Sintetização de Voz

Lucy Anne Evangelista; Patrícia do Nascimento; Carlos Eduardo Elmadjian; Alfredo Vel Lejbman

Lucy Anne Evangelista USP
Patrícia do Nascimento USP
Carlos Eduardo Elmadjian USP
Alfredo Vel Lejbman USP

Resumo

O objetivo deste artigo é realizar um estudo bibliográfico comparativo entre as arquiteturas para sínteze de voz (Char2Wav, ClariNet, Tacotron, Tacotron 2 e DeepVoice 3), sistematizando informações quanto a recursos e capacidade das arquiteturas. O estudo comparativo também contemplou os frameworks (TensorFlow, PyTorch, etc.) utilizados na implementação das arquiteturas. Ao final, são sugeridos alguns pontos informacionais que devem ser tomados como relevantes ao se comparar as arquiteturas disponíveis.

Referências

Luong, M.-T., Pham, H., and Manning, C. D. (2015). Effective approaches to attentionbased neural machine translation. arXiv preprint arXiv:1508.04025.

Ping, W., Peng, K., and Chen, J. (2018). Clarinet: Parallel wave generation in end-to-endtext-to-speech. arXiv preprint arXiv:1807.07281.

Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan, A., Narang, S., Raiman, J., and Miller, J. (2017). Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654.

Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y.,

Wang, Y., Skerrv-Ryan, R., et al. (2018). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE.

Sotelo, J., Mehri, S., Kumar, K., Santos, J. F., Kastner, K., Courville, A., and Bengio, Y. (2017). Char2wav: End-to-end speech synthesis.

Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., et al. (2017). Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.