Phonetic segmentation for Brazilian Portuguese based on a self-supervised model and forced-alignment
Resumo
Um sistema de segmentação fonética para o português brasileiro foi desenvolvido utilizando o modelo Wav2Vec2, uma estrutura de aprendizado auto-supervisionado. O estudo explora a aplicação do Wav2Vec2 na determinação automática de fronteiras fonéticas dentro de sinais de fala. Aproveitando as representações acústicas ricas aprendidas pelo Wav2Vec2, buscamos melhorar a precisão da segmentação fonética. O desempenho do sistema foi comparado com o Montreal Forced Aligner (MFA), demonstrando eficácia notável em várias condições de fala, incluindo vozes neutras e expressivas. Nossa metodologia envolve pré-processamento de transcrições fonéticas, utilização do modelo para alinhamento e pós-processamento para determinar limites fonéticos precisos. Os resultados indicam avanços significativos na detecção de fronteiras fonéticas, especialmente em contextos desafiadores como a fala expressiva.
Palavras-chave:
Segmentação fonética de fala, Alinhamento forçado, aprendizado auto-supervisionado
Referências
Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., and Pallett, D. S. (1993). Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n, 93:27403.
Graves, A., Fernández, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376.
Hwang, J., Hira, M., Chen, C., Zhang, X., Ni, Z., Sun, G., Ma, P., Huang, R., Pratap, V., Zhang, Y., et al. (2023). Torchaudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for pytorch. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–9. IEEE.
Kürzinger, L., Winkelbauer, D., Li, L., Watzel, T., and Rigoll, G. (2020). Ctc-segmentation of large corpora for german end-to-end speech recognition. In International Conference on Speech and Computer, pages 267–278. Springer.
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., and Sonderegger, M. (2017). Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498–502.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel-Zarandi, M., et al. (2024). Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97):1–52.
Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019). wav2vec: Unsupervised Pre-Training for Speech Recognition. In Proc. Interspeech 2019, pages 3465–3469.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Zeyer, A., Schlüter, R., and Ney, H. (2021). Why does ctc result in peaky behavior? arXiv preprint arXiv:2105.14849.
Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., and Pallett, D. S. (1993). Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n, 93:27403.
Graves, A., Fernández, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376.
Hwang, J., Hira, M., Chen, C., Zhang, X., Ni, Z., Sun, G., Ma, P., Huang, R., Pratap, V., Zhang, Y., et al. (2023). Torchaudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for pytorch. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–9. IEEE.
Kürzinger, L., Winkelbauer, D., Li, L., Watzel, T., and Rigoll, G. (2020). Ctc-segmentation of large corpora for german end-to-end speech recognition. In International Conference on Speech and Computer, pages 267–278. Springer.
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., and Sonderegger, M. (2017). Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498–502.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel-Zarandi, M., et al. (2024). Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97):1–52.
Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019). wav2vec: Unsupervised Pre-Training for Speech Recognition. In Proc. Interspeech 2019, pages 3465–3469.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Zeyer, A., Schlüter, R., and Ney, H. (2021). Why does ctc result in peaky behavior? arXiv preprint arXiv:2105.14849.
Publicado
17/11/2024
Como Citar
BUARQUE, Eduardo S. e S.; GOMES, Joel F. F.; TAVARES, Ubiratan da S.; ULIANI NETO, Mário; RUNSTEIN, Fernando O.; VIOLATO, Ricardo P. V.; LIMA, Marcus.
Phonetic segmentation for Brazilian Portuguese based on a self-supervised model and forced-alignment. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 21. , 2024, Belém/PA.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2024
.
p. 293-303.
ISSN 2763-9061.
DOI: https://doi.org/10.5753/eniac.2024.245057.