Vocal Recovery of Tracheostomized Patients Using Audio Processing and Machine Learning in a Zero-Shot Scenario
Abstract
This work proposes a method for voice reconstruction in individuals who have undergone laryngectomy, integrating advanced audio processing and machine learning techniques. The approach aims to restore features such as timbre, intonation, and prosody, which are often lost when using an electronic larynx, whose sound is constrained by a constant fundamental frequency (F0). To address the lack of public datasets containing voices of tracheostomized patients, a synthetic dataset was created to simulate the acoustic properties of these devices. The developed pipeline comprises three stages: (i) speech analysis, involving the extraction of linguistic content and style; (ii) mapping, combining this information with the mel-spectrogram through techniques such as conditional modulation and diffusion networks, with a particular focus on Flow Matching; and (iii) reconstruction and synthesis, using highfidelity vocoders. Experiments compared two preprocessing methods— timbre shifter and F0 fixation—evaluated in four training and testing combinations. Results show that the F0→F0 configuration outperformed the alternative in three out of the four analyzed metrics (MCD of 444.04, LSD of 0.47, and PSNR of 42.27), suggesting that F0 fixation favors voice reconstruction that more closely matches the original signal. These findings highlight the potential of the proposed approach to improve the naturalness and intelligibility of synthesized speech for laryngectomized patients.
References
Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, and Michael Auli. 2021. XLS-R: Self-supervised Crosslingual Speech Representation Learning at Scale. arXiv:2111.09296 [cs.CL] [link]
Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv:2006.11477 [cs.CL] [link]
Anders R. Bargum, Stefania Serafin, and Cumhur Erkut. 2023. Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion. arXiv:2311.08104 [cs.SD] [link]
Tom Bäckström, Okko Räsänen, Abraham Zewoudie, Pablo Pérez Zarazaga, Liisa Koivusalo, Sneha Das, Esteban Gómez Mellado, Marieum Bouafif Mansali, Daniel Ramos, Sudarsana Kadiri, Paavo Alku, and Mohammad Hassan Vali. 2022. Introduction to Speech Processing (2 ed.). DOI: 10.5281/zenodo.6821775
Hao Chen, Xiaoqi Cao, Xiyan Zhang, Zhenyu Wang, Bingjing Qiu, and Kehong Zheng. 2023. Automatic segmentation framework of X-Ray tomography data for multi-phase rock using Swin Transformer approach. Scientific Data 10 (11 2023). DOI: 10.1038/s41597-023-02734-7
Sanyuan Chen, ChengyiWang, Zhengyang Chen, YuWu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. 2022. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE Journal of Selected Topics in Signal Processing 16, 6 (Oct. 2022), 1505–1518. DOI: 10.1109/jstsp.2022.3188113
Mauricio de Cunto. 2023. A VOZ HUMANA E AS SUAS CARACTERÍSTICAS -UM RESUMO ESCLARECIMENTOS SOBRE A METODOLOGIA DO EXAME DE IDENTIFICAÇÃO FORENSE DE VOZ/FALA PROF. ENGº MAURÍCIO R. DE CUNTO (OUTUBRO DE 2023 -VERSÃO 4). (10 2023). DOI: 10.13140/RG.2.2.10780.26246
Per Enqvist and Johan Karlsson. 2008. Minimal Itakura-Saito distance and covariance interpolation. In 2008 47th IEEE Conference on Decision and Control. 137–142. DOI: 10.1109/CDC.2008.4739312
M.W Gardner and S.R Dorling. 1998. Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences. Atmospheric Environment 32, 14 (1998), 2627–2636. DOI: 10.1016/S1352-2310(97)00447-0
Alain Horé and Djemel Ziou. 2010. Image Quality Metrics: PSNR vs. SSIM. In 2010 20th International Conference on Pattern Recognition. 2366–2369. DOI: 10.1109/ICPR.2010.579
Xun Huang and Serge Belongie. 2017. Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization. arXiv:1703.06868 [cs.CV] [link]
In-Sun Hwang, Sang-Hoon Lee, and Seong-Whan Lee. 2022. StyleVC: Non-Parallel Voice Conversion with Adversarial Style Generalization. In 2022 26th International Conference on Pattern Recognition (ICPR). 23–30. DOI: 10.1109/ICPR56361.2022.9956613
Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. 2018. CREPE: A Convolutional Representation for Pitch Estimation. arXiv:1802.06182 [eess.AS] [link]
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. arXiv:2010.05646 [cs.SD] [link]
R. Kubichek. 1993. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, Vol. 1. 125–128 vol.1. DOI: 10.1109/PACRIM.1993.407206
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. 2023. Flow Matching for Generative Modeling. arXiv:2210.02747 [cs.LG] [link]
Songting Liu. 2024. Zero-shot Voice Conversion with Diffusion Transformers. arXiv:2411.09943 [cs.SD] [link]
NAZARIO, LUIZA CASCAES MAGAJEWSKI, FLÁVIO RICARDO LIBERALI PIZZOL, NATALIA DAL SALOTI, MATHEUS HENRIQUE DA SILVA MEDEIROS, and LEONARDO KFOURI. 2022. Tendência temporal da utilização da traqueostomia em pacientes hospitalizados pelo Sistema Único de Saúde no Brasil no período de 2011 a 2020. Revista do Colégio Brasileiro de Cirurgiões (2022).
Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. 2017. FiLM: Visual Reasoning with a General Conditioning Layer. arXiv:1709.07871 [cs.CV] [link]
HuiWang, Siqi Zheng, Yafeng Chen, Luyao Cheng, and Qian Chen. 2023. CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking. arXiv:2303.00332 [cs.SD] [link]
Yaogen Yang, Haozhe Zhang, Zexin Cai, Yao Shi, Ming Li, Dong Zhang, Xiaojun Ding, Jianhua Deng, and Jie Wang. 2023. Electrolaryngeal speech enhancement based on a two stage framework with bottleneck feature refinement and voice conversion. Biomedical Signal Processing and Control 80 (2023), 104279. DOI: 10.1016/j.bspc.2022.104279
Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. arXiv:1801.03924 [cs.CV] [link]
