Speech2Phone: A Novel and Efficient Method for Training Speaker Recognition Models

Casanova, Edresson; Candido Junior, Arnaldo; Shulby, Christopher; de Oliveira, Frederico Santos; Gris, Lucas Rafael Stefanel; da Silva, Hamilton Pereira; Aluísio, Sandra Maria; Ponti, Moacir Antonelli

doi:10.1007/978-3-030-91699-2_39

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13074))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

967 Accesses

Abstract

In this paper we present an efficient method for training models for speaker recognition using small or under-resourced datasets. This method requires less data than other SOTA (State-Of-The-Art) methods, e.g. the Angular Prototypical and GE2E loss functions, while achieving similar results to those methods. This is done using the knowledge of the reconstruction of a phoneme in the speaker’s voice. For this purpose, a new dataset was built, composed of 40 male speakers, who read sentences in Portuguese, totaling approximately 3h. We compare the three best architectures trained using our method to select the best one, which is the one with a shallow architecture. Then, we compared this model with the SOTA method for the speaker recognition task: the Fast ResNet–34 trained with approximately 2,000 h, using the loss functions Angular Prototypical and GE2E. Three experiments were carried out with datasets in different languages. Among these three experiments, our model achieved the second best result in two experiments and the best result in one of them. This highlights the importance of our method, which proved to be a great competitor to SOTA speaker recognition models, with 500x less data and a simpler approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
Ardila, R., et al.: Common voice: a massively-multilingual speech corpus. In: Proceedings of the 12th Language Resources and Evaluation Conference, pp. 4218–4222 (2020)
Google Scholar
Arik, S., Chen, J., Peng, K., Ping, W., Zhou, Y.: Neural voice cloning with a few samples. In: Advances in Neural Information Processing Systems, pp. 10019–10029 (2018)
Google Scholar
Bowater, R.J., Porter, L.L.: Voice recognition of telephone conversations. US Patent 6,278,772 (21 August 2001)
Google Scholar
Bredin, H.: TristouNet: triplet loss for speaker turn embedding. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5430–5434. IEEE (2017)
Google Scholar
Cheng, J.M., Wang, H.C.: A method of estimating the equal error rate for automatic speaker verification. In: 2004 International Symposium on Chinese Spoken Language Processing, pp. 285–288. IEEE (2004)
Google Scholar
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 539–546. IEEE (2005)
Google Scholar
Chung, J.S., et al.: In defence of metric learning for speaker recognition. In: Proceedings of the Interspeech 2020, pp. 2977–2981 (2020)
Google Scholar
Cooper, E., et al.: Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2020, pp. 6184–6188. IEEE (2020)
Google Scholar
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)
Google Scholar
Ertaş, F.: Fundamentals of speaker recognition. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi 6(2–3) (2011)
Google Scholar
Ferrucci, D., et al.: Building Watson: an overview of the DeepQA project. AI Mag. 31(3), 59–79 (2010)
Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016). http://www.deeplearningbook.org
Gruber, T.: Siri, a virtual personal assistant-bringing intelligence to the interface (2009)
Google Scholar
Heo, H.S., Lee, B.J., Huh, J., Chung, J.S.: Clova baseline system for the VoxCeleb speaker recognition challenge 2020. arXiv preprint arXiv:2009.14153 (2020)
Ioffe, S.: Probabilistic linear discriminant analysis. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 531–542. Springer, Heidelberg (2006). https://doi.org/10.1007/11744085_41
Chapter Google Scholar
Kekre, H., Kulkarni, V.: Closed set and open set speaker identification using amplitude distribution of different transforms. In: 2013 International Conference on Advances in Technology and Engineering (ICATE), pp. 1–8. IEEE (2013)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: SphereFace: deep hypersphere embedding for face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 212–220 (2017)
Google Scholar
Logan, B., et al.: Mel frequency cepstral coefficients for music modeling. In: ISMIR, vol. 270, pp. 1–11 (2000)
Google Scholar
McFee, B., et al.: librosa: audio and music signal analysis in Python. In: Proceedings of the 14th Python in Science Conference, pp. 18–25 (2015)
Google Scholar
Michalski, R.S., Carbonell, J.G., Mitchell, T.M.: Machine learning: an artificial intelligence approach. Springer Science & Business Media (2013). https://doi.org/10.1007/978-3-662-12405-5
Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. In: Proceedings of the Interspeech 2017, pp. 2616–2620 (2017)
Google Scholar
Nazaré, T.S., da Costa, G.B.P., Contato, W.A., Ponti, M.: Deep convolutional neural networks and noisy images. In: Mendoza, M., Velastín, S. (eds.) CIARP 2017. LNCS, vol. 10657, pp. 416–424. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75193-1_50
Chapter Google Scholar
Nussbaumer, H.J.: The fast Fourier transform. In: Fast Fourier Transform and Convolution Algorithms, vol. 2, pp. 80–111. Springer, Heidelberg (1981). https://doi.org/10.1007/978-3-662-00551-4_4
Ping, W., et al.: Deep Voice 3: scaling text-to-speech with convolutional sequence learning. In: International Conference on Learning Representations (2018)
Google Scholar
Ramoji, S., Krishnan V, P., Singh, P., Ganapathy, S.: Pairwise discriminative neural PLDA for speaker verification. arXiv preprint arXiv:2001.07034 (2020)
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)
Google Scholar
Seara, I.: Estudo Estatístico dos Fonemas do Português Brasileiro Falado na Capital de Santa Catarina para Elaboração de Frases Foneticamente Balanceadas. Ph.D. thesis, Dissertação de Mestrado, Universidade Federal de Santa Catarina ... (1994)
Google Scholar
Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. In: Interspeech, pp. 999–1003 (2017)
Google Scholar
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. IEEE (2018)
Google Scholar
Tang, Y.: TF.Learn: Tensorflow’s high-level module for distributed machine learning. arXiv preprint arXiv:1612.04251 (2016)
Veaux, C., Yamagishi, J., MacDonald, K., et al.: Superseded-CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit. University of Edinburgh, The Centre for Speech Technology Research (CSTR) (2016)
Google Scholar
Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883. IEEE (2018)
Google Scholar
Wang, F., Cheng, J., Liu, W., Liu, H.: Additive margin Softmax for face verification. IEEE Sig. Process. Lett. 25(7), 926–930 (2018)
Article Google Scholar
Wang, J., Wang, K.C., Law, M.T., Rudzicz, F., Brudno, M.: Centroid-based deep metric learning for speaker recognition. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2019, pp. 3652–3656. IEEE (2019)
Google Scholar
Zhou, Y., Tian, X., Xu, H., Das, R.K., Li, H.: Cross-lingual voice conversion with bilingual Phonetic PosteriorGram and average modeling. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2019, pp. 6790–6794. IEEE (2019)
Google Scholar

Download references

Acknowledgments

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) – Finance Code 001, as well as CNPq (National Council of Technological and Scientific Development) grant 304266/2020-5. Also, we would like to thank the CyberLabs and Itaipu Technological Park (Parque Tecnológico Itaipu—PTI) for financial support for this paper. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPU used in part of the experiments presented in this research.

Author information

Authors and Affiliations

University of São Paulo, São Carlos, Brazil
Edresson Casanova, Sandra Maria Aluísio & Moacir Antonelli Ponti
Federal University of Technology - Paraná, Medianeira, Brazil
Arnaldo Candido Junior, Lucas Rafael Stefanel Gris & Hamilton Pereira da Silva
DefinedCrowd Corp, Seattle, USA
Christopher Shulby
Federal University of Goiás, Goiânia, Brazil
Frederico Santos de Oliveira

Authors

Edresson Casanova
View author publications
You can also search for this author in PubMed Google Scholar
Arnaldo Candido Junior
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Shulby
View author publications
You can also search for this author in PubMed Google Scholar
Frederico Santos de Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Lucas Rafael Stefanel Gris
View author publications
You can also search for this author in PubMed Google Scholar
Hamilton Pereira da Silva
View author publications
You can also search for this author in PubMed Google Scholar
Sandra Maria Aluísio
View author publications
You can also search for this author in PubMed Google Scholar
Moacir Antonelli Ponti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Edresson Casanova .

Editor information

Editors and Affiliations

Universidade Federal de Sergipe, São Cristóvão, Brazil
André Britto
Universidade de São Paulo, São Paulo, Brazil
Karina Valdivia Delgado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Casanova, E. et al. (2021). Speech2Phone: A Novel and Efficient Method for Training Speaker Recognition Models. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13074. Springer, Cham. https://doi.org/10.1007/978-3-030-91699-2_39

Download citation

DOI: https://doi.org/10.1007/978-3-030-91699-2_39
Published: 28 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91698-5
Online ISBN: 978-3-030-91699-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics