Phonetic segmentation for Brazilian Portuguese based on a self-supervised model and forced-alignment
Abstract
A phonetic segmentation system for Brazilian Portuguese was developed using the Wav2Vec2 model, a self-supervised learning framework. The study explores the application of Wav2Vec2 in automatically determining phonetic boundaries within speech signals. By leveraging the rich acoustic representations learned by Wav2Vec2, we aim to improve the accuracy of phonetic segmentation. The system’s performance was compared with the Montreal Forced Aligner (MFA), demonstrating notable effectiveness in various speech conditions, including neutral and expressive voices. Our methodology involves preprocessing phonetic transcriptions, utilizing the model for alignment, and post-processing to determine precise phonetic boundaries. Results indicate significant advancements in phonetic boundary detection, especially in challenging contexts such as expressive speech.
Keywords:
Phonetic segmentation, forced-alignment, self-supervised learning
References
Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., and Pallett, D. S. (1993). Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n, 93:27403.
Graves, A., Fernández, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376.
Hwang, J., Hira, M., Chen, C., Zhang, X., Ni, Z., Sun, G., Ma, P., Huang, R., Pratap, V., Zhang, Y., et al. (2023). Torchaudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for pytorch. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–9. IEEE.
Kürzinger, L., Winkelbauer, D., Li, L., Watzel, T., and Rigoll, G. (2020). Ctc-segmentation of large corpora for german end-to-end speech recognition. In International Conference on Speech and Computer, pages 267–278. Springer.
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., and Sonderegger, M. (2017). Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498–502.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel-Zarandi, M., et al. (2024). Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97):1–52.
Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019). wav2vec: Unsupervised Pre-Training for Speech Recognition. In Proc. Interspeech 2019, pages 3465–3469.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Zeyer, A., Schlüter, R., and Ney, H. (2021). Why does ctc result in peaky behavior? arXiv preprint arXiv:2105.14849.
Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., and Pallett, D. S. (1993). Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n, 93:27403.
Graves, A., Fernández, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pages 369–376.
Hwang, J., Hira, M., Chen, C., Zhang, X., Ni, Z., Sun, G., Ma, P., Huang, R., Pratap, V., Zhang, Y., et al. (2023). Torchaudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for pytorch. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–9. IEEE.
Kürzinger, L., Winkelbauer, D., Li, L., Watzel, T., and Rigoll, G. (2020). Ctc-segmentation of large corpora for german end-to-end speech recognition. In International Conference on Speech and Computer, pages 267–278. Springer.
McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., and Sonderegger, M. (2017). Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498–502.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel-Zarandi, M., et al. (2024). Scaling speech technology to 1,000+ languages. Journal of Machine Learning Research, 25(97):1–52.
Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019). wav2vec: Unsupervised Pre-Training for Speech Recognition. In Proc. Interspeech 2019, pages 3465–3469.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Zeyer, A., Schlüter, R., and Ney, H. (2021). Why does ctc result in peaky behavior? arXiv preprint arXiv:2105.14849.
Published
2024-11-17
How to Cite
BUARQUE, Eduardo S. e S.; GOMES, Joel F. F.; TAVARES, Ubiratan da S.; ULIANI NETO, Mário; RUNSTEIN, Fernando O.; VIOLATO, Ricardo P. V.; LIMA, Marcus.
Phonetic segmentation for Brazilian Portuguese based on a self-supervised model and forced-alignment. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 21. , 2024, Belém/PA.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2024
.
p. 293-303.
ISSN 2763-9061.
DOI: https://doi.org/10.5753/eniac.2024.245057.
