Analysis of transcription tools for Brazilian Portuguese with focus on disfluency detection


Advancements and easier access to technology has led to a greater demand for applications whose interaction is performed through voice recognition, since multimedia content has been a valuable source for computational analysis. In this sense, vocal representations are extracted for various purposes in applications in several areas such as convenience, accessibility, security and sentiment analysis. The main challenge of speech recognition lies in the variability of speakers, environments, devices and the presence of disfluencies during spoken speech. These aspects influence transcription tools, essential when the user requires interaction through voice, aiming at producing texts from this interaction. In particular, detection of disfluencies can help to identify aspects related to the emotional status of the speaker. This work presents an analysis of text transcription tools, with focus in disfluency detection, encompassing the metrics most used for evaluation and databases used in evaluations in the context of Brazilian Portuguese. An experiment was conducted to evaluate the performance of three tools (IBM Watson, Google Speech and Vosk). The Google Speech tool achieved the best performance with average Word Error Rate of 9.69% for fluent sentences and 17.15% for disfluent sentences, followed by IBM Watson with 11.86% and 24.44% and Vosk with 14.39% and 22.56% respectively.

Palavras-chave: disfluencies, speech recognition, spoken dialogue, natural language processing, rich transcription


Thales Aguiar de Lima and Márjory Da Costa-Abreu. 2020. A survey on automatic speech recognition systems for Portuguese language and its variations. Computer Speech Language 62 (2020), 101055.

Lavanya B. Babu, Anu George, K R Sreelakshmi, and Leena Mary. 2018. Continuous Speech Recognition System for Malayalam Language Using Kaldi. In 2018 International Conference on Emerging Trends and Innovations In Engineering And Technological Research (ICETIETR). 1--4.

Nguyen Bach and Fei Huang. 2019. Noisy BiLSTM-Based Models for Disfluency Detection. In Proc. Interspeech 2019. 4230--4234.

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, Ann Arbor, Michigan, 65--72.

Dario Bertero, Linlin Wang, Ho Yin Chan, and Pascale Fung. 2015. A comparison between a DNN and a CRF disfluency detection and reconstruction system. In Proc. Interspeech 2015. 844--848.

Adwoa Agyeiwaa Boakye-Yiadom, Mingwei Qin, and Ren Jing. 2021. Research of Automatic Speech Recognition of Asante-Twi Dialect For Translation. In Proceedings of the 2021 5th International Conference on Electronic Information Technology and Computer Engineering (Xiamen, China) (EITCE 2021). Association for Computing Machinery, New York, NY, USA, 1086--1094.

Edresson Casanova, Arnaldo Candido Junior, Christopher Shulby, Frederico Santos de Oliveira, João Paulo Teixeira, Moacir Antonelli Ponti, and Sandra Aluísio. 2022. TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese. Language Resources and Evaluation (2022), 1--13.

Qian Chen, Mengzhe Chen, Bo Li, and Wen Wang. 2020. Controllable Time-Delay Transformer for Real-Time Punctuation Prediction and Disfluency Detection. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 8069--8073.

Eunah Cho, Kevin Kilgour, Jan Niehues, and Alex Waibel. 2015. Combination of NN and CRF models for joint detection of punctuation and disfluencies. In Proc. Interspeech 2015. 3650--3654.

Frederico Santos de Oliveira, Anderson da Silva Soares, and Arnaldo Candido Junior. [n.d.]. Brazilian Portuguese Speech Recognition Using Wav2vec 2.0. In Computational Processing of the Portuguese Language: 15th International Conference, PROPOR 2022, Fortaleza, Brazil, March 21--23, 2022, Proceedings. Springer Nature, 333.

Kallirroi Georgila, Anton Leuski, Volodymyr Yanov, and David Traum. 2020. Evaluation of Off-the-shelf Speech Recognizers Across Diverse Dialogue Domains. In Proceedings of the 12th Language Resources and Evaluation Conference. European Language Resources Association, 6469--6476.

Nathan S. Hartmann, Erick R. Fonseca, Christopher D. Shulby, Marcos V. Treviso, Jéssica S. Rodrigues, and Sandra M. Aluísio. 2017. Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks. In Anais do XI Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana (Minas Gerais). SBC, Porto Alegre, RS, Brasil, 122--131.

Ben Haynor and Petar S. Aleksic. 2020. Incorporating Written Domain Numeric Grammars into End-To-End Contextual Speech Recognition Systems for Improved Recognition of Numeric Sequences. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7809--7813.

Paria Jamshid Lou, Peter Anderson, and Mark Johnson. 2018. Disfluency Detection using Auto-Correlational Neural Networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 4610--4619.

V. Kepuska and G. Bohouta. 2017. Comparing speech recognition systems (Microsoft API, Google API and CMU Sphinx). International Journal of Engineering Research and Application 7, 3 (2017), 20--24.

Arvind Kumar, Rampravesh Kumar, and Kamlesh Kishore. 2020. Performance analysis of ASR Model for Santhali language on Kaldi and Matlab Toolkit. In 2020 International Conference on Recent Trends on Electronics, Information, Communication Technology (RTEICT). 88--92.

Yogesh Kumar and Navdeep Singh. 2019. A Comprehensive View of Automatic Speech Recognition System - A Systematic Literature Review. In 2019 International Conference on Automation, Computational and Technology Management (ICACTM). 168--173.

Burhanuddin Lakdawala, Farhan Khan, Arif Khan, Yash Tomar, Rahul Gupta, and Ashfaq Shaikh. 2018. Voice to Text transcription using CMU Sphinx A mobile application for healthcare organization. In 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT). 749--753.

Benjamin Lecouteux, Michel Vacher, and François Portet. 2018. Distant Speech Processing for Smart Home: Comparison of ASR Approaches in Scattered Microphone Network for Voice Command. Int. J. Speech Technol. 21, 3 (sep 2018), 601--618.

K.-F. Lee, H.-W. Hon, and R. Reddy. 1990. An overview of the SPHINX speech recognition system. IEEE Transactions on Acoustics, Speech, and Signal Processing 38, 1 (1990), 35--45.

Zhenyu Li, Bin He, Xinguo Yu, and Rong Hu. 2017. Speech Interaction of Educational Robot Based on Ekho and Sphinx. In Proceedings of the 2017 International Conference on Education and Multimedia Technology (Singapore, Singapore) (ICEMT '17). Association for Computing Machinery, New York, NY, USA, 14--20.

Nelson Neto, Carlos Patrick, Aldebaro Klautau, and Isabel Trancoso. 2011. Free tools and resources for Brazilian Portuguese speech recognition. Journal of the Brazilian Computer Society 17, 1 (2011), 53--68.

Arif Nursetyo and De Rosal Ignatius Moses Setiadi. 2018. LatAksLate: Javanese Script Translator based on Indonesian Speech Recognition using Sphinx-4 and Google API. In 2018 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI). 17--22.

Rafael Oliveira, Pedro Batista, Nelson Neto, and Aldebaro Klautau. 2012. Baseline Acoustic Models for Brazilian Portuguese Using CMU Sphinx Tools. In Computational Processing of the Portuguese Language. Springer Berlin Heidelberg, Berlin, Heidelberg, 375--380.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (Philadelphia, Pennsylvania) (ACL '02). Association for Computational Linguistics, USA, 311--318.

Rosalind W. Picard. 1997. Affective Computing. MIT Press, Cambridge, MA, USA.

Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, Jan Silovsky, Georg Stemmer, and Karel Vesely. 2011. The Kaldi Speech Recognition Toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding (Hilton Waikoloa Village, Big Island, Hawaii, US). IEEE Signal Processing Society. IEEE Catalog No.: CFP11SRW-USB.

Tommaso Raso and Heliana Mello. 2012. The C-ORAL-BRASIL I: Reference Corpus for Informal Spoken Brazilian Portuguese. In Computational Processing of the Portuguese Language. Springer Berlin Heidelberg, Berlin, Heidelberg, 362--367.

Tânia Rocha, António Marques, José Pedro Brito, Luís Cardoso, Pedro Martins, and João Barroso. 2017. Web application for the training of the correct pronunciation of words in Portuguese for people with speech and language disorders --- preliminary usability study. In 2017 12th Iberian Conference on Information Systems and Technologies (CISTI). 1--7.

Johann C. Rocholl, Vicky Zayats, Daniel D. Walker, Noah B. Murad, Aaron Schneider, and Daniel J. Liebling. 2021. Disfluency Detection with Unlabeled Data and Small BERT Models. In Proc. Interspeech 2021. 766--770.

Morteza Rohanian and Julian Hough. 2020. Re-framing Incremental Deep Language Models for Dialogue Processing with Multi-task Learning. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 497--507.

Matheus Sampaio, Regis Magalhães, Ticiana Silva, Lívia Cruz, Davi Vasconcelos, José Macêdo, and Marianna Ferreira. 2021. Evaluation of Automatic Speech Recognition Systems. In Anais do XXXVI Simpósio Brasileiro de Bancos de Dados (Rio de Janeiro). SBC, Porto Alegre, RS, Brasil, 301--306.

Himangshu Sarma, Navanath Saharia, and Utpal Sharma. 2017. Development and Analysis of Speech Recognition Systems for Assamese Language Using HTK. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 17, 1, Article 7 (oct 2017), 14 pages.

Rohit Raj Sehgal, Shubham Agarwal, and Gaurav Raj. 2018. Interactive Voice Response using Sentiment Analysis in Automatic Speech Recognition Systems. In 2018 International Conference on Advances in Computing and Communication Engineering (ICACCE). 213--218.

Puwadol Sirikongtham and Worapat Paireekreng. 2017. Improving speech recognition using dynamic multi-pipeline API. In 2017 15th International Conference on ICT and Knowledge Engineering (ICTKE). 1--6.

V. Sneha, G. Hardhika, K. Jeeva Priya, and Deepa Gupta. 2018. Isolated Kannada Speech Recognition Using HTK---A Detailed Approach. In Progress in Advanced Computing and Intelligent Engineering. Springer Singapore, Singapore, 185--194.

S. Supriya and S. M. Handore. 2017. Speech recognition using HTK toolkit for Marathi language. In 2017 IEEE International Conference on Power, Control, Signals and Instrumentation Engineering (ICPCSI). 1591--1597.

Steve Young, Gunnar Evermann, Mark Gales, Thomas Hain, Dan Kershaw, Xunying Liu, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, et al. 2002. The HTK book. Cambridge university engineering department 3, 175 (2002), 12.

Denis Roberto Zamignani and Sonia Beatriz Meyer. 2007. Comportamento verbal no contexto clínico: contribuições metodológicas apartir da análise do comportamento. Revista Brasileira de Terapia Comportamental e Cognitiva 9 (12 2007), 241 -- 259. [link].
LUNA, Alana S.; MACHADO-LIMA, Ariane; NUNES, Fátima L. S.. Analysis of transcription tools for Brazilian Portuguese with focus on disfluency detection. In: SIMPÓSIO BRASILEIRO SOBRE FATORES HUMANOS EM SISTEMAS COMPUTACIONAIS (IHC), 21. , 2022, Diamantina. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2022 .