Punctuation Restoration in Translated pt-BR Texts from Audio Transcriptions
Abstract
This work proposes a model for automatically restoring punctuation in colloquial Brazilian Portuguese texts derived from audio transcriptions, focusing on improving readability and usability in NLP tasks. The methodology involves two main stages: training and inference. The IWSLT (2014–2016) corpus is used, which contains translated TED talk transcriptions, and a hybrid model composed of Bi-LSTM, self-attention, and CRF. Preprocessing includes punctuation mapping, vocabulary construction, GloVe embeddings, and highway network. Four evaluation scenarios were applied, revealing that combining the three years of data yields the best results, with progressive improvements in precision, recall, and F1-score metrics.References
Caseli, H. M. and Nunes, M. G. V., editors (2023). Processamento de Linguagem Natural: Conceitos, Técnicas e Aplicações em Português. BPLN. [link].
Chordia, V. (2021). Punktuator: A multilingual punctuation restoration system for spoken and written text. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 312–320. Association for Computational Linguistics.
de Lima, T. B., Rodrigues, L., Macario, V., Freitas, E., and Mello, R. F. (2023). Automatic punctuation verification of school students’ essay in portuguese. In Encontro Nacional de Inteligência Artificial e Computacional (ENIAC), pages 58–70. SBC.
de Lima, T. B., Rolim, V., Nascimento, A. C., Miranda, P., Macario, V., Rodrigues, L., Freitas, E., Gašević, D., and Mello, R. F. (2024). Towards explainable automatic punctuation restoration for portuguese using transformers. Expert Systems with Applications, 257:125097.
Gris, L. R. S., Marcacini, R., Junior, A. C., Casanova, E., Soares, A., and Aluísio, S. M. (2023). Evaluating openai’s whisper asr for punctuation prediction and topic modeling of life histories of the museum of the person.
Guerreiro, N. M., Rei, R., and Batista, F. (2021). Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts. In Expert Systems with Applications, volume 186, page 115740.
Lima, T. B. D., Miranda, P., Mello, R. F., Wenceslau, M., Bittencourt, I. I., Cordeiro, T. D., and José, J. (2022). Sequence labeling algorithms for punctuation restoration in brazilian portuguese texts. In 2022 11th Brazilian Conference( BRACIS), pages 616–630.
Moura, B. C. D., de S. Sales, A. G., de S. Linhares, J. E. B., Barbosa, F. M. D., and Neto, A. A. (2025). Avaliação in-domain e cross-domain em restauração de pontuação utilizando processamento de linguagem natural. Anais do Computer on the Beach, 16:45–52.
Olive, J., Christianson, C., and McCary, J. (2011). Handbook of natural language processing and machine translation: DARPA global autonomous language exploitation. Springer Science & Business Media.
Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP-2014), volume 12, pages 1532–1543.
Srivastava, R. K., Greff, K., and Schmidhuber, J. (2015). Highway networks. arXiv preprint arXiv:1505.00387.
Chordia, V. (2021). Punktuator: A multilingual punctuation restoration system for spoken and written text. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 312–320. Association for Computational Linguistics.
de Lima, T. B., Rodrigues, L., Macario, V., Freitas, E., and Mello, R. F. (2023). Automatic punctuation verification of school students’ essay in portuguese. In Encontro Nacional de Inteligência Artificial e Computacional (ENIAC), pages 58–70. SBC.
de Lima, T. B., Rolim, V., Nascimento, A. C., Miranda, P., Macario, V., Rodrigues, L., Freitas, E., Gašević, D., and Mello, R. F. (2024). Towards explainable automatic punctuation restoration for portuguese using transformers. Expert Systems with Applications, 257:125097.
Gris, L. R. S., Marcacini, R., Junior, A. C., Casanova, E., Soares, A., and Aluísio, S. M. (2023). Evaluating openai’s whisper asr for punctuation prediction and topic modeling of life histories of the museum of the person.
Guerreiro, N. M., Rei, R., and Batista, F. (2021). Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts. In Expert Systems with Applications, volume 186, page 115740.
Lima, T. B. D., Miranda, P., Mello, R. F., Wenceslau, M., Bittencourt, I. I., Cordeiro, T. D., and José, J. (2022). Sequence labeling algorithms for punctuation restoration in brazilian portuguese texts. In 2022 11th Brazilian Conference( BRACIS), pages 616–630.
Moura, B. C. D., de S. Sales, A. G., de S. Linhares, J. E. B., Barbosa, F. M. D., and Neto, A. A. (2025). Avaliação in-domain e cross-domain em restauração de pontuação utilizando processamento de linguagem natural. Anais do Computer on the Beach, 16:45–52.
Olive, J., Christianson, C., and McCary, J. (2011). Handbook of natural language processing and machine translation: DARPA global autonomous language exploitation. Springer Science & Business Media.
Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP-2014), volume 12, pages 1532–1543.
Srivastava, R. K., Greff, K., and Schmidhuber, J. (2015). Highway networks. arXiv preprint arXiv:1505.00387.
Published
2025-09-29
How to Cite
SALES, Angel G. de S.; MOURA, Brenda C. D.; LINHARES, José E. B. de S.; BARBOSA, Fabiann M. D..
Punctuation Restoration in Translated pt-BR Texts from Audio Transcriptions. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 16. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 374-384.
DOI: https://doi.org/10.5753/stil.2025.37839.
