Text-Dependent Speech Biometrics - Evaluation of pre-trained ECAPA-TDNN and Wav2vec models with the BioCPqD and RedDots databases
Resumo
Este trabalho aborda o desafio da biometria de voz dependente de texto, avaliando diferentes bases de dados e modelos de classificação. Utilizamos modelos pré-treinados das arquiteturas ECAPA-TDNN e Wav2vec e aplicamo-os nas bases de dados BioCPqD e RedDots. Os resultados mostram que as taxas de erro são bastante baixas para ambas bases de dados. Também é possível observar que o desempenho do modelo Wav2vec foi muito inferior ao do ECAPA-TDNN.
Palavras-chave:
biometria de voz, reconhecimento de locutor, verificação de locutor dependente de texto
Referências
Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Kanda, N., Yoshioka, T., Xiao, X., Wu, J., Zhou, L., Ren, S., Qian, Y., Qian, Y., Wu, J., Zeng, M., and Wei, F. (2021). Wavlm: Large-scale self-supervised pre-training for full stack speech processing. arXiv.
Chowdhury, F. A. R. R., Wang, Q., Moreno, I. L., and Wan, L. (2018). The Attention-based Models for Text-Dependent Speaker Verification. Cornell University, arxiv:1710.10470 edition.
Desplanques, B., Thienpondt, J., and Demuynck, K. (2020). The ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. Proc. Interspeech 2020.
Jahangir, R., Teh, Y. W., Nwekec, H. F., Mujtabad, G., Al-Garadi, M. A., and Ali, I. (2021). The Speaker identification through artificial intelligence techniques: A comprehensive review and research challenges. Revista Elsevier 171.
Jain, A. K., Flynn, P., and Ross, A. A. (2008). Handbook of Biometrics. Springer.
Jakubec, M., Jarina, R., Lieskovska, E., and Kasak, P. (2024). Deep speaker embeddings for speaker verification: Review and experimental comparison. Engineering Applications of Artificial Intelligence, 127:107232.
Larcher, A., Lee, K. A., Ma, B., and Li, H. (2012). The rsr2015: Database for text-dependent speaker verification using multiple pass-phrases. In Annual Conference of the International Speech Communication association (Interspeech 2012), Portland, United States.
Lee, K. A., Larcher, A., Guangsen, W., Patrick, K., Brummer, N., van Leeuwen, D., Aronowitz, H., Kockmann, M., Vaquero, C., Ma, B., Li, H., Stafylakis, T., Alam, J., Swart, A., and Perez, J. (2015). The The RedDots Data Collection for Speaker Recognition. Interspeech, sep 2015, dresden, germany edition.
Li, L., Liu, R., Kang, J., Fan, Y., Cui, H., Cai, Y., Vipperla, R., Zheng, T. F., and Wang, D. (2022). Cn-celeb: multi-genre speaker recognition. Speech Communication.
Nagrani, A., Chung, J. S., Xie, W., and Zisserman, A. (2019). Voxceleb: Large-scale speaker verification in the wild. Computer Science and Language.
Qin, X., Bu, H., and Li, M. (2019). Hi-mia : A far-field text-dependent speaker verification database and the baselines.
Ravanelli, M., Parcollet, T., Moumen, A., de Langen, S., Subakan, C., Plantinga, P., Wang, Y., Mousavi, P., Libera, L. D., Ploujnikov, A., Paissan, F., Borra, D., Zaiem, S., Zhao, Z., Zhang, S., Karakasidis, G., Yeh, S.-L., Champion, P., Rouhe, A., Braun, R., Mai, F., Zuluaga-Gomez, J., Mousavi, S. M., Nautsch, A., Liu, X., Sagar, S., Duret, J., Mdhaffar, S., Laperriere, G., Rouvier, M., Mori, R. D., and Esteve, Y. (2024). Open-source conversational ai with speechbrain 1.0.
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and Khudanpur, S. (2018). X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5329–5333.
Tu, Y., Lin, W., and Mak, M.-W. (2022). A survey on text-dependent and text-independent speaker verification. IEEE Access, 10:99038–99049.
Violato, R. P. V., Neto, M. U., and Simões, F. O. (2013). The BioCPqD: uma base de dados biométricos com amostras de face e voz de indivíduos brasileiros. Cad. CPqD Tecnologia, Campinas, v. 9, n. 2, p. 7-18, jul./dez. 2013 edition.
Chowdhury, F. A. R. R., Wang, Q., Moreno, I. L., and Wan, L. (2018). The Attention-based Models for Text-Dependent Speaker Verification. Cornell University, arxiv:1710.10470 edition.
Desplanques, B., Thienpondt, J., and Demuynck, K. (2020). The ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. Proc. Interspeech 2020.
Jahangir, R., Teh, Y. W., Nwekec, H. F., Mujtabad, G., Al-Garadi, M. A., and Ali, I. (2021). The Speaker identification through artificial intelligence techniques: A comprehensive review and research challenges. Revista Elsevier 171.
Jain, A. K., Flynn, P., and Ross, A. A. (2008). Handbook of Biometrics. Springer.
Jakubec, M., Jarina, R., Lieskovska, E., and Kasak, P. (2024). Deep speaker embeddings for speaker verification: Review and experimental comparison. Engineering Applications of Artificial Intelligence, 127:107232.
Larcher, A., Lee, K. A., Ma, B., and Li, H. (2012). The rsr2015: Database for text-dependent speaker verification using multiple pass-phrases. In Annual Conference of the International Speech Communication association (Interspeech 2012), Portland, United States.
Lee, K. A., Larcher, A., Guangsen, W., Patrick, K., Brummer, N., van Leeuwen, D., Aronowitz, H., Kockmann, M., Vaquero, C., Ma, B., Li, H., Stafylakis, T., Alam, J., Swart, A., and Perez, J. (2015). The The RedDots Data Collection for Speaker Recognition. Interspeech, sep 2015, dresden, germany edition.
Li, L., Liu, R., Kang, J., Fan, Y., Cui, H., Cai, Y., Vipperla, R., Zheng, T. F., and Wang, D. (2022). Cn-celeb: multi-genre speaker recognition. Speech Communication.
Nagrani, A., Chung, J. S., Xie, W., and Zisserman, A. (2019). Voxceleb: Large-scale speaker verification in the wild. Computer Science and Language.
Qin, X., Bu, H., and Li, M. (2019). Hi-mia : A far-field text-dependent speaker verification database and the baselines.
Ravanelli, M., Parcollet, T., Moumen, A., de Langen, S., Subakan, C., Plantinga, P., Wang, Y., Mousavi, P., Libera, L. D., Ploujnikov, A., Paissan, F., Borra, D., Zaiem, S., Zhao, Z., Zhang, S., Karakasidis, G., Yeh, S.-L., Champion, P., Rouhe, A., Braun, R., Mai, F., Zuluaga-Gomez, J., Mousavi, S. M., Nautsch, A., Liu, X., Sagar, S., Duret, J., Mdhaffar, S., Laperriere, G., Rouvier, M., Mori, R. D., and Esteve, Y. (2024). Open-source conversational ai with speechbrain 1.0.
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and Khudanpur, S. (2018). X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5329–5333.
Tu, Y., Lin, W., and Mak, M.-W. (2022). A survey on text-dependent and text-independent speaker verification. IEEE Access, 10:99038–99049.
Violato, R. P. V., Neto, M. U., and Simões, F. O. (2013). The BioCPqD: uma base de dados biométricos com amostras de face e voz de indivíduos brasileiros. Cad. CPqD Tecnologia, Campinas, v. 9, n. 2, p. 7-18, jul./dez. 2013 edition.
Publicado
17/11/2024
Como Citar
R. JR, Alcino Vilela; COLOMBO, Julia C.; BERGAMASCHI, Murilo M.; ULIANI NETO, Mário; RUNSTEIN, Fernando O.; VIOLATO, Ricardo P. V.; LIMA, Marcus.
Text-Dependent Speech Biometrics - Evaluation of pre-trained ECAPA-TDNN and Wav2vec models with the BioCPqD and RedDots databases. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 21. , 2024, Belém/PA.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2024
.
p. 275-283.
ISSN 2763-9061.
DOI: https://doi.org/10.5753/eniac.2024.245052.