Enhancing Robustness in Audio Deepfake Detection for VR Applications using data augmentation and Mixup

Gustavo dos Reis Oliveira; Rafaello Virgilli; Lucas Alcântara Souza; Lucas Stefanel Gris; Evellyn Nicole Machado Rosa; Isadora Stéfany Rezende Remigio Mesquita; Daniel Tunnermann; Arlindo Rodrigues Galvão Filho

Gustavo dos Reis Oliveira AKCIT Universidade Federal de Goiás https://orcid.org/0009-0001-2301-7310
Rafaello Virgilli AKCIT Universidade Federal de Goiás https://orcid.org/0009-0002-5040-5869
Lucas Alcântara Souza AKCIT Universidade Federal de Goiás https://orcid.org/0009-0005-5412-0192
Lucas Stefanel Gris AKCIT Universidade Federal de Goiás https://orcid.org/0000-0002-2099-5004
Evellyn Nicole Machado Rosa AKCIT Universidade Federal de Goiás https://orcid.org/0009-0004-8078-1376
Isadora Stéfany Rezende Remigio Mesquita AKCIT Universidade Federal de Goiás https://orcid.org/0009-0008-1469-2497
Daniel Tunnermann AKCIT Universidade Federal de Goiás https://orcid.org/0009-0009-5541-7069
Arlindo Rodrigues Galvão Filho AKCIT Universidade Federal de Goiás https://orcid.org/0000-0003-2151-8039

Resumo

The rapid advancement of virtual reality (VR) technology has heightened the need for robust and reliable deepfake audio detection to ensure the authenticity and integrity of virtual interactions. Al-though current state-of-the-art models exhibit promising results, they are often overconfident, which can lead to poor generalization and reduced effectiveness against novel or slightly altered deepfake attacks. In this work, we investigate the application of data augmentation techniques and Mixup techniques to increase the diversity of training data and improve the generalization of deepfake audio detection models. Mixup creates new training examples by combining pairs of existing examples, promoting smoother and more robust decision boundaries, while data augmentation creates new training examples altering a sample with a given probability. Our results demonstrate that applying such techniques to the Wav2vec 2.0 model significantly improves its generalization ability, leading to more reliable deepfake detection in VR environments

Palavras-chave: Deepfake Detection, Audio Classification, Machine Learning, Feature Abstraction, Mixup

Referências

ASVspoof 2019: The Automatic Speaker Verification Spoofing and Countermeasures Challenge Evaluation Plan. [link]. [Online].

Fatih Arslan. 2023. Deepfake Technology: A Criminological Literature Review. The Sakarya Journal of Law (The SJL) 11, 1 (2023), 701–720.

Rebecca A. Delfino. 2023. Deepfakes em julgamento: uma chamada para expandir o papel de controle do juiz de julgamento para proteger os processos legais contra falsificação tecnológica. Hastings Law Journal 74 (2023), 293. [link]

Yinlin Guo, Haofan Huang, Xi Chen, He Zhao, and Yuehai Wang. 2023. Audio Deepfake Detection with Self-Supervised WavLM and Multi-Fusion Attentive Classifier. arXiv preprint arXiv:2312.08089 (2023). DOI: 10.48550/arXiv.2312.08089

Jee-weon Jung, Hee-Soo Heo, Hemlata Tak, Hye-jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, and Nicholas Evans. 2021. AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks. arXiv preprint arXiv:2110.01200 (2021). DOI: 10.48550/arXiv.2110.01200

Taein Kang, Soyul Han, Sunmook Choi, Jaejin Seo, Sanghyeok Chung, Seungeun Lee, Seungsang Oh, and Il-Youp Kwak. 2024. Experimental Study: Enhancing Voice Spoofing Detection Models with wav2vec 2.0. arXiv preprint arXiv:2402.17127 (2024). DOI: 10.48550/arXiv.2402.17127

Gan Pei, Jiangning Zhang, Menghan Hu, Zhenyu Zhang, Chengjie Wang, Yunsheng Wu, Guangtao Zhai, Jian Yang, Chunhua Shen, and Dacheng Tao. 2024. Deepfake Generation and Detection: A Benchmark and Survey. arXiv preprint arXiv:2403.17881 (2024). DOI: 10.48550/arXiv.2403.17881

Tomasz Walczyna and Zbigniew Piotrowski. 2023. Overview of voice conversion methods based on deep learning. Applied Sciences 13, 5 (2023), 3100.

X. Wang, J. Yamagishi, and et al. 2020. ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language (CSL) 64 (2020), 101114.

Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Yuankun Xie, Yukun Liu, Xiaopeng Wang, Xuefei Liu, Yongwei Li, Jianhua Tao, Yi Lu, Xin Qi, and Shuchen Shi. 2024. Generalized Fake Audio Detection via Deep Stable Learning. arXiv preprint arXiv:2406.03237 (2024). DOI: 10.48550/arXiv.2406.03237

Junichi Yamagishi, Xuechen Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuenan Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, and et al. 2021. Asvspoof 2021: accelerating progress in spoofed and deep-fake speech detection. InASVspoof 2021Workshop - Automatic Speaker Verification and Spoofing Countermeasures Challenge.