Disfluency Detection and Removal in Speech Transcriptions via Large Language Models
Resumo
The field of Automatic Speech Recognition (ASR) has significantly expanded within the technological landscape due to its extensive use in sectors such as education, healthcare, and customer service. Many modern applications depend on analyzing spoken content through Speech-to-Text (STT) conversion models. However, transcriptions produced by these systems often contain undesirable elements, such as word repetitions and the prolongation of certain sounds, known as disfluencies or linguistic crutches. These elements can negatively affect the quality of automatic content analysis by Natural Language Processing (NLP) models, including those for named entity recognition, emotion detection, or sentiment analysis. Therefore, this study aims to evaluate the feasibility of identifying and eliminating linguistic disfluencies using Large Language Models (LLMs), such as GPT-4, LLaMA, Claude, and Gemini, through Prompt Engineering techniques. The approach was tested using a corpus of debate transcriptions with manually annotated disfluency occurrences, yielding promising results.
Referências
Bach, N., & Huang, F. (2019). Noisy BiLSTM-Based Models for Disfluency Detection. In Interspeech (pp. 4230-4234). DOI: 10.21437/Interspeech.2019-1336
Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. [link] DOI: 10.48550/arXiv.2006.11477
Bassi, S., Duregon, G., Jalagam, S., & Roth, D. (2023). End-to-End Speech Recognition and Disfluency Removal with Acoustic Language Model Pretraining. [link] DOI: 10.48550/arXiv.2309.04516
Corley, M., & Stewart, O. W. (2008). Hesitation disfluencies in spontaneous speech: The meaning of um. Language and Linguistics Compass, 2(4), 589-602. [link] DOI: 10.1111/j.1749-818X.2008.00068.x
Ferguson, J., Durrett, G., & Klein, D. (2015). Disfluency detection with a semi-markov model and prosodic features. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 257-262). [link] DOI: 10.3115/v1/N15-1029
Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. [link] DOI: 10.48550/arXiv.2106.07447
Meta. (2024). Introducing LLaMA 3: Advancements in Large Language Models. [link]
OpenAI, Achiam, J., Adler, S., et al. (2024). GPT-4 Technical Report. [link] DOI: 10.48550/arXiv.2303.08774
OpenAI. (2024). OpenAI Tokenizer. [link]
Romana, A., Koishida, K., & Provost, E. M. (2023). Automatic Disfluency Detection from Untranscribed Speech. [link] DOI: 10.48550/arXiv.2311.00867
Snover, M., Dorr, B., & Schwartz, R. (2004). A lexically-driven algorithm for disfluency detection. In Proceedings of HLT-NAACL 2004: Short Papers (pp. 157-160). [link]
Team, G., Georgiev, P., and et al., V. I. L. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. [link] DOI: 10.48550/arXiv.2403.05530
Zayats, V., Ostendorf, M., & Hajishirzi, H. (2016). Disfluency detection using a bidirectional LSTM. [link] DOI: 10.48550/arXiv.1604.03209