A Comparison of Deep Learning Architectures for Automatic Gender Recognition from Audio Signals

Alef Iury S. Ferreira; Frederico S. Oliveira; Nádia F. Felipe da Silva; Anderson S. Soares

doi:10.5753/eniac.2021.18297

Alef Iury S. Ferreira UFG
Frederico S. Oliveira UFMT
Nádia F. Felipe da Silva UFG
Anderson S. Soares UFG

DOI: https://doi.org/10.5753/eniac.2021.18297

Resumo

O reconhecimento de gênero a partir da fala é um problema relacionado à análise de fala humana, e possui diversas aplicações que vão desde a personalização na recomendação de produtos à ciência forense. A identificação da eficiência e custos de diferentes abordagens que lidam com esse problema é imprescindível. Este trabalho tem como foco investigar e comparar a eficiência e custos de diferentes arquiteturas de deep learning para o reconhecimento de gênero a partir da fala. Os resultados mostram que o modelo convolucional unidimensional consegue os melhores resultados. No entanto, constatou-se que o modelo fully connected apresentou resultados próximos com menor custo, tanto no uso de memória, quanto no tempo de treinamento.

Referências

Alkhawaldeh, R. S. (2019). Dgr: Gender recognition of human speech using one-dimensional conventional neural network. Scientific Programming, 2019:7213717.

Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13(null):281–305.

Betrò, B. (1991). Bayesian methods in global optimization. Journal of Global Optimization, 1(1):1–14.

Bocklet, T., Maier, A., Bauer, J., Burkhardt, F., and Nöth, E. (2008). Age and gender recognition for telephone applications based on gmm supervectors and support vector machines. 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1605–1608.

Boersma, P. and Weenink, D. (2018). Praat: doing phonetics by computer [Computer program]. Version 6.0.37, retrieved 3 February 2018 http://www.praat.org/.

Cheveigné, A. and Kawahara, H. (2002). Yin, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111:1917–30.

Ellis, D. P. (2007). Chroma feature analysis and synthesis.

Harte, C., Sandler, M., and Gasser, M. (2006). Detecting harmonic change in musical audio. In Proceedings of the 1st ACM Workshop on Audio and Music Computing Multimedia, AMCMM ’06, page 21–26, New York, NY, USA. Association for Computing Machinery.

Jiang, D.-N., Lu, L., Zhang, H., Tao, J., and Cai, L. (2002). Music type classification by spectral contrast feature. Proceedings. IEEE International Conference on Multimedia and Expo, 1:113–116 vol.1.

Kabil, S., Muckenhirn, H., and Magimai-Doss, M. (2018). On learning to identify genders from raw speech signal using cnns. pages 287–291.

Kanatani, K.-i. (2018). Fast fourier transform. In Particle characterization in technology, pages 31–50. CRC Press.

La Mura, M. and Lamberti, P. (2020). Human-machine interaction personalization: a review on gender and emotion recognition through speech analysis. In 2020 IEEE International Workshop on Metrology for Industry 4.0 IoT, pages 319–323.

Levitan, S., Mishra, T., and Bangalore, S. (2016). Automatic identification of gender from speech. pages 84–88.

Mamyrbayev, O., Toleu, A., Tolegen, G., and Mekebayev, N. (2020). Neural architectures for gender detection and speaker identification. Cogent Engineering, 7(1).

McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., and Nieto, O. (2015). librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, volume 8.

Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.

Nair, A. M. S. U. and Savithri, S. P. (2021). Classification of pitch and gender of speakers for forensic speaker recognition from disguised voices using novel features learned by deep convolutional neural networks. Traitement du Signal, 38(1):221–230.

Nair, R. R. and Vijayan, B. (2019). Voice based gender recognition. International Research Journal of Engineering and Technology, 6.

Panayotov, V., Chen, G., Povey, D., and Khudanpur, S. (2015). Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210.

Parveen, S. and Green, P. (2003). Multitask learning in connectionist robust asr using recurrent neural networks. In INTERSPEECH.

Picone, J. (1993). Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81(9):1215– 1247.

Pouyanfar, S., Sadiq, S., Yan, Y., Tian, H., Tao, Y., Reyes, M. P., Shyu, M.-L., Chen, S.-C., and Iyengar, S. S. (2018). A survey on deep learning: Algorithms, techniques, and applications. ACM Comput. Surv., 51(5).

Purwins, H., Li, B., Virtanen, T., Schlüter, J., Chang, S.-Y., and Sainath, T. (2019). Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal Processing, 13(2):206–219.

Ramachandran, P., Zoph, B., and Le, Q. V. (2017). Searching for activation functions.

Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. CoRR, abs/1904.05862.

Sharma, G., Umapathy, K., and Krishnan, S. (2020). Trends in audio signal feature extraction methods. Applied Acoustics, 158:107020.

Shepstone, S. E., Tan, Z.-H., and Jensen, S. H. (2013). Audio-based age and gender identification to enhance the recommendation of tv content. IEEE Transactions on Consumer Electronics, 59(3):721–729.

Wu, J., Chen, X.-Y., Zhang, H., Xiong, L.-D., Lei, H., and Deng, S.-H. (2019). Hyperparameter optimization for machine learning models based on bayesian optimizationb. Journal of Electronic Science and Technology, 17(1):26 – 40.

Yamamoto, R., Santos, J. F., and Blaauw, M. (2020). r9y9/pysptk: v0.1.18 release. Zenodo.