Synthetic AI Data Pipeline for Domain-Specific Speech-to-Text Solutions

Anderson Luiz Karl; Guilherme Sales Fernandes; Leonardo Augusto Pires; Yvens R. Serpa; Carlos Caminha

doi:10.5753/stil.2024.245336

Anderson Luiz Karl Audo Tecnologia e Saúde https://orcid.org/0009-0005-4989-8426
Guilherme Sales Fernandes Audo Tecnologia e Saúde https://orcid.org/0009-0001-8100-0074
Leonardo Augusto Pires Audo Tecnologia e Saúde https://orcid.org/0000-0002-8265-5021
Yvens R. Serpa Audo Tecnologia e Saúde / Saxion University of Applied Sciences https://orcid.org/0000-0002-4799-3180
Carlos Caminha UFC https://orcid.org/0009-0000-5788-6680

DOI: https://doi.org/10.5753/stil.2024.245336

Resumo

In this article, we propose a pipeline to fine-tune domain-specific Speech-to-Text (STT) models using synthetic data generated by Artificial Intelligence (AI). Our methodology eliminates the need for manually labelled audio data, which is expensive and difficult to obtain, by generating domain-specific data with a Large Language Model (LLM) combined with multiple Text-to-Speech (TTS) solutions. We applied our pipeline to the radiology domain and compared the results with different approaches based on the availability of domain-specific data, varying from the total absence of domain-specific data to the use of only domain-specific high-quality data (ground truth). Our performance improved the accuracy of the baseline by 40.19% and 10.63% for the WhisperX Tiny and Small models, respectively, which, although performed worse than the results from using the ground truth, shows that it is possible to achieve good results with minimal cost and effort. Finally, the result analysis shows a good insight into the amount of action necessary to achieve good results based on the availability of real data.

Palavras-chave: Large Language Models, Text-to-speech, Speech-to-text, Domain-Specific, Model Fine-tuning

Referências

Ali, A. and Renals, S. (2018). Word error rate estimation for speech recognition: ewer. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 20–24. DOI: 10.18653/v1/P18-2004

Bain, M., Huh, J., Han, T., and Zisserman, A. (2023). Whisperx: Time-accurate speech transcription of long-form audio. INTERSPEECH 2023. [link]

Blackley, S. V., Huynh, J., Wang, L., Korach, Z., and Zhou, L. (2019). Speech recognition for clinical documentation from 1990 to 2018: a systematic review. Journal of the American Medical Informatics Association, 26(4):324–338. DOI: 10.1093/jamia/ocy179

Casanova, E., Shulby, C., Korolev, A., Junior, A. C., Soares, A. d. S., Aluísio, S., and Ponti, M. A. (2022). Asr data augmentation in low-resource settings using cross-lingual multi-speaker tts and cross-lingual voice conversion. arXiv preprint arXiv:2204.00618. DOI: 10.48550/arXiv.2204.00618

Chan, W., Jaitly, N., Le, Q., and Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 4960–4964. IEEE Press. [link]

da Cruz, F. B., de Souza Britto, M. C., Moreira, G. M., and Junior, A. d. S. B. (2022). Robôs substituem juízes? o estado da arte da inteligência artificial no judiciário brasileiro. Revista Antinomias, 3(1):8–41. [link]

Gontier, F., Serizel, R., and Cerisara, C. (2021). Automated audio captioning by finetuning bart with audioset tags. In DCASE 2021-6th Workshop on Detection and Classification of Acoustic Scenes and Events. [link]

Gruzitis, N., Dargis, R., Lasmanis, V. J., Garkaje, G., and Gosko, D. (2022). Adapting automatic speech recognition to the radiology domain for a less-resourced language: the case of latvian. In Intelligent Sustainable Systems: Selected Papers of WorldS4 2021, Volume 1, pages 267–276. Springer. DOI: 10.1007/978-981-16-6309-3_27

Hammana, I., Lepanto, L., Poder, T., Bellemare, C., and Ly, M.-S. (2015). Speech recognition in the radiology department: a systematic review. Health Information Management Journal, 44(2):4–10. DOI: 10.1177/183335831504400201

Hu, T.-Y., Armandpour, M., Shrivastava, A., Chang, J.-H. R., Koppula, H., and Tuzel, O. (2022). Synt++: Utilizing imperfect synthetic data to improve speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7682–7686. IEEE. DOI: 10.1109/ICASSP43922.2022.9746217

Huang, Y., He, L., Wei, W., Gale, W., Li, J., and Gong, Y. (2020). Using personalized speech synthesis and neural language generator for rapid speaker adaptation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7399–7403. IEEE. DOI: 10.1109/ICASSP40776.2020.9053104

Islam, R. and Moushi, O. M. (2024). Gpt-4o: The cutting-edge advancement in multimodal llm. Authorea Preprints. DOI: 10.36227/techrxiv.171986596.65533294/v1

Johnson, M., Lapkin, S., Long, V., Sanchez, P., Suominen, H., Basilakis, J., and Dawson, L. (2014). A systematic review of speech recognition technology in health care. BMC Med. Inform. Decis. Mak., 14(1):94. [link]

Koenecke, A., Choi, A. S. G., Mei, K. X., Schellmann, H., and Sloane, M. (2024). Careless whisper: Speech-to-text hallucination harms. In The 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1672–1681. [link]

Kumar, Y. (2024). A comprehensive analysis of speech recognition systems in healthcare: Current research challenges and future prospects. SN Computer Science, 5. [link]

Laptev, A., Korostik, R., Svischev, A., Andrusenko, A., Medennikov, I., and Rybin, S. (2020). You do not need more data: Improving end-to-end speech recognition by textto-speech data augmentation. In 2020 13th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), pages 439–444. IEEE. DOI: 10.1109/CISP-BMEI51763.2020.9263564

Li, J., Gadde, R., Ginsburg, B., and Lavrukhin, V. (2018). Training neural speech recognition systems with synthetic speech augmentation. arXiv preprint arXiv:1811.00707. DOI: 10.48550/arXiv.1811.00707

Mak, F., Govender, A., and Badenhorst, J. (2024). Exploring asr fine-tuning on limited domain-specific data for low-resource languages. Journal of the Digital Humanities Association of Southern Africa (DHASA), 5. DOI: 10.55492/dhasa.v5i1.5024

Rosenberg, A., Zhang, Y., Ramabhadran, B., Jia, Y., Moreno, P., Wu, Y., and Wu, Z. (2019). Speech recognition with augmented synthesized speech. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 996–1002. IEEE. DOI: 10.1109/ASRU46091.2019.9003990

Samarakoon, L., Mak, B., and Lam, A. Y. (2018). Domain adaptation of end-to-end speech recognition in low-resource settings. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 382–388. IEEE. DOI: 10.1109/SLT.2018.8639506

Silva, M. d. L. M., Mendonça, A. L. C., Neto, E. R. D., Chaves, I. C., Caminha, C., Brito, F. T., Farias, V. A. E., and Machado, J. C. (2024). Facto dataset: A dataset of user reports for faulty computer components. In Anais do VI Dataset Showcase Workshop, pages 1–12. SBC. [link]

Suh, J., Na, I., and Jung, W. (2024). Improving domain-specific asr with llm-generated contextual descriptions. DOI: 10.48550/arXiv.2407.17874

Vásquez-Correa, J. C., Arzelus, H., Martin-Doñas, J. M., Arellano, J., Gonzalez-Docasal, A., and Álvarez, A. (2023). When whisper meets tts: Domain adaptation using only synthetic speech data. In International Conference on Text, Speech, and Dialogue, pages 226–238. Springer. DOI: 10.1007/978-3-031-40498-6_20

Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing Systems. [link]

Vivancos-Vicente, P. J., Castejón-Garrido, J. S., Paredes-Valverde, M. A., Salas-Zárate, M. d. P., and Valencia-García, R. (2016). Ixhealth: A multilingual platform for advanced speech recognition in healthcare. In Technologies and Innovation: Second International Conference, CITI 2016, Guayaquil, Ecuador, November 23-25, 2016, Proceedings 2, pages 26–38. Springer. DOI: 10.1007/978-3-319-48024-4_3

Vorbeck, F., Ba-Ssalamah, A., Kettenbach, J., and Huebsch, P. (2000). Report generation using digital speech recognition in radiology. European Radiology, 10:1976–1982. DOI: 10.1007/s003300000459

Yang, K., Hu, T.-Y., Chang, J.-H. R., Koppula, H. S., and Tuzel, O. (2023). Text is all you need: Personalizing asr models using controllable speech synthesis. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE. DOI: 10.1109/ICASSP49357.2023.10096971

Yu, D., Deng, L., and Dahl, G. (2010). Roles of pre-training and fine-tuning in contextdependent dbn-hmms for real-world speech recognition. In Proc. NIPS Workshop on Deep Learning and Unsupervised Feature Learning. sn.