Disfluency Detection and Removal in Speech Transcriptions via Large Language Models

Pedro L. S. de Lima; Cláudio E. C. Campelo

doi:10.5753/stil.2024.245417

Pedro L. S. de Lima UFCG http://orcid.org/0009-0003-3924-2808
Cláudio E. C. Campelo UFCG https://orcid.org/0000-0003-4404-2344

DOI: https://doi.org/10.5753/stil.2024.245417

Resumo

The field of Automatic Speech Recognition (ASR) has significantly expanded within the technological landscape due to its extensive use in sectors such as education, healthcare, and customer service. Many modern applications depend on analyzing spoken content through Speech-to-Text (STT) conversion models. However, transcriptions produced by these systems often contain undesirable elements, such as word repetitions and the prolongation of certain sounds, known as disfluencies or linguistic crutches. These elements can negatively affect the quality of automatic content analysis by Natural Language Processing (NLP) models, including those for named entity recognition, emotion detection, or sentiment analysis. Therefore, this study aims to evaluate the feasibility of identifying and eliminating linguistic disfluencies using Large Language Models (LLMs), such as GPT-4, LLaMA, Claude, and Gemini, through Prompt Engineering techniques. The approach was tested using a corpus of debate transcriptions with manually annotated disfluency occurrences, yielding promising results.

Palavras-chave: Automatic Speech Recognition, Linguistic Disfluencies, Large Language Models, Prompt Engineering, Speech-to-Text

Referências

Anthropic. (2024). Claude 3.5 Sonnet. [link]

Bach, N., & Huang, F. (2019). Noisy BiLSTM-Based Models for Disfluency Detection. In Interspeech (pp. 4230-4234). DOI: 10.21437/Interspeech.2019-1336

Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. [link] DOI: 10.48550/arXiv.2006.11477

Bassi, S., Duregon, G., Jalagam, S., & Roth, D. (2023). End-to-End Speech Recognition and Disfluency Removal with Acoustic Language Model Pretraining. [link] DOI: 10.48550/arXiv.2309.04516

Corley, M., & Stewart, O. W. (2008). Hesitation disfluencies in spontaneous speech: The meaning of um. Language and Linguistics Compass, 2(4), 589-602. [link] DOI: 10.1111/j.1749-818X.2008.00068.x

Ferguson, J., Durrett, G., & Klein, D. (2015). Disfluency detection with a semi-markov model and prosodic features. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 257-262). [link] DOI: 10.3115/v1/N15-1029

Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. [link] DOI: 10.48550/arXiv.2106.07447

Meta. (2024). Introducing LLaMA 3: Advancements in Large Language Models. [link]

OpenAI, Achiam, J., Adler, S., et al. (2024). GPT-4 Technical Report. [link] DOI: 10.48550/arXiv.2303.08774

OpenAI. (2024). OpenAI Tokenizer. [link]

Romana, A., Koishida, K., & Provost, E. M. (2023). Automatic Disfluency Detection from Untranscribed Speech. [link] DOI: 10.48550/arXiv.2311.00867

Snover, M., Dorr, B., & Schwartz, R. (2004). A lexically-driven algorithm for disfluency detection. In Proceedings of HLT-NAACL 2004: Short Papers (pp. 157-160). [link]

Team, G., Georgiev, P., and et al., V. I. L. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. [link] DOI: 10.48550/arXiv.2403.05530

Zayats, V., Ostendorf, M., & Hajishirzi, H. (2016). Disfluency detection using a bidirectional LSTM. [link] DOI: 10.48550/arXiv.1604.03209