Identification of Emotions in Spoken Language Using Deep Learning
Resumo
As emoções constituem um dos pilares da comunicação humana, especialmente da linguagem falada. Na fala emocional, as emoções podem ser identificadas por atributos inerentes à voz, como pitch, frequência, intensidade etc. Neste artigo, foi proposto um modelo baseado em aumento artificial de dados e Deep Learning, mais especificamente uma Rede Neural Recorrente Convolucional, para automatizar essa tarefa de identificação de emoções ao ser treinado na base de dados RAVDESS com técnica de validação cruzada. Avaliado pelas métricas acurácia e F1-score, o modelo atingiu nelas, respectivamente, 76,25% e 76% em média e uma máxima de 83,33% e 80%, o que são resultados levemente melhores que os apresentados em pesquisas relacionadas.
Referências
Bhavan, A. et al. (2019). Bagged support vector machines for emotion recognition from speech. Elsevier, 184: 104886. DOI: https://doi.org/10.1016/j.knosys.2019.104886.
Cummins, N. et al. (2017) An Image-based Deep Spectrum Feature Representation for the Recognition of Emotional Speech. In Proceedings of the 25th ACM international conference on Multimedia (MM '17), pages 478–484. DOI: 10.1145/3123266.3123371.
Fayek, H. M., Lech, M. and Cavedon, L. (2017). Evaluating deep learning architectures for Speech Emotion Recognition. Elsevier, 92: 60-68. DOI: 10.1016/j.neunet.2017.02.013.
Han, K., Yu, D. and Tashev, I. (2014). Speech emotion recognition using deep neural network and extreme learning machine. In Interspeech 2014, pages 23-227.
Jalal, M. A. et al. (2019). Learning temporal clusters using capsule routing for speech emotion recognition. In Interspeech 2019, pages 1701-1705. DOI: 10.21437/interspeech.2019-3068.
Koolagudi, S. G. and Rao, K. S (2012). Emotion recognition from speech: a review. In International Journal of Speech Technology 15, pages 99–117. DOI: 10.1007/s10772-011-9125-1.
Kwon, O. W. et al. (2003). Emotion recognition by speech signals. In Eurospeech 2003, pages 125-128.
Lee, J. and Tashev, I. (2015). High-level feature representation using recurrent neural network for speech emotion recognition. In Interspeech 2015, pages 1537-1540.
Livingstone, S. R. and Russo, F. A (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS ONE, 13(5): e0196391. DOI: 10.1371/journal.pone.0196391.
Mustaqeem et al. (2020). Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM. IEEE Access, 8: 79861-79875. DOI: 10.1109/ACCESS.2020.2990405.
Nwe, T. L., Foo, S. W. and Silva, L. C. (2003). Speech emotion recognition using hidden Markov models. Elsevier Speech Communications Journal, 41(4): 603-623. DOI: 10.1016/S0167-6393(03)00099-2.
Paiva, E. C. (2017). Reconhecimento de emoção através da voz para integração em uma aplicação web. Universidade Federal de Uberlândia.
Qirong, M. et al. (2014). Learning salient features for speech emotion recognition using convolutional neural networks. IEEE, 16(8): 2203-2213. DOI: 10.1109/TMM.2014.2360798.
Schuller, B., Rigoll, G. and Lang, M. (2003). Hidden Markov model-based speech emotion recognition. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings (ICASSP'03), pages II-1. DOI: 10.1109/ICASSP.2003.1202279.
Zeng, Y., Mao, H., Peng, D. and Yi, Z. (2019). Spectrogram based multi-task audioclassification. Multimedia Tools Appl., 78(3): 3705–3722.
Zhang, Y. et al. (2018). Attention based fully convolutional network for speech emotion recognition. In 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 1771-1775. DOI: 10.23919/APSIPA.2018.8659587.