Dimensional Speech Emotion Recognition from Bimodal Features

Larissa Guder; João Paulo Aires; Felipe Meneguzzi; Dalvan Griebler

doi:10.5753/sbcas.2024.2779

Larissa Guder PUCRS
João Paulo Aires PUCRS
Felipe Meneguzzi PUCRS / University of Aberdeen
Dalvan Griebler PUCRS

DOI: https://doi.org/10.5753/sbcas.2024.2779

Abstract

Considering the human-machine relationship, affective computing aims to allow computers to recognize or express emotions. Speech Emotion Recognition is a task from affective computing that aims to recognize emotions in an audio utterance. The most common way to predict emotions from the speech is using pre-determined classes in the offline mode. In that way, emotion recognition is restricted to the number of classes. To avoid this restriction, dimensional emotion recognition uses dimensions such as valence, arousal, and dominance to represent emotions with higher granularity. Existing approaches propose using textual information to improve results for the valence dimension. Although recent efforts have tried to improve results on speech emotion recognition to predict emotion dimensions, they do not consider real-world scenarios where processing the input quickly is necessary. Considering these aspects, we take the first step towards creating a bimodal approach for dimensional speech emotion recognition in streaming. Our approach combines sentence and audio representations as input to a recurrent neural network that performs speechemotion recognition. Our final architecture achieves a Concordance Correlation Coefficient of 0.5915 for arousal, 0.1431 for valence, and 0.5899 for dominance in the IEMOCAP dataset.

References

Association, A. P. (2022). Diagnostic and Statistical Manual of Mental Disorders: DSM-5-TR. American Psychiatric Association Publishing.

Atmaja, B. and Akagi, M. (2020). Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning. APSIPA Transactions on Signal and Information Processing, 9.

Atmaja, B. and Akagi, M. (2021). Two-stage dimensional emotion recognition by fusing predictions of acoustic and text networks using svm. Speech Communication, 126:9–21.

Bertero, D., Siddique, F. B., Wu, C.-S., Wan, Y., Chan, R. H. Y., and Fung, P. (2016). Real-time speech emotion and sentiment recognition for interactive dialogue systems. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1042–1047, Austin, Texas. Association for Computational Linguistics.

Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J. N., Lee, S., and Narayanan, S. S. (2008). IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42:335–359.

Cramer, A. L., Wu, H.-H., Salamon, J., and Bello, J. P. (2019). Look, listen, and learn more: Design choices for deep audio embeddings. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3852–3856.

de Lope, J. and Graña, M. (2023). An ongoing review of speech emotion recognition. Neurocomputing, 528:1–11.

Dominguez-Morales, J. P., Liu, Q., James, R., Gutierrez-Galan, D., Jimenez-Fernandez, A., Davidson, S., and Furber, S. (2018). Deep spiking neural network model for time-variant signals classification: a real-time speech recognition approach. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–8.

Ekman, P. (1999). Basic emotions. In Dalgleish, T. and Powers, M. J., editors, Handbook of Cognition and Emotion, pages 4–5. Wiley.

Geetha, A., Mala, T., Priyanka, D., and Uma, E. (2024). Multimodal emotion recognition with deep learning: Advancements, challenges, and future directions. Information Fusion, 105.

Ghriss, A., Yang, B., Rozgic, V., Shriberg, E., and Wang, C. (2022). Sentiment-aware automatic speech recognition pre-training for enhanced speech emotion recognition. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing Proceedings, volume 2022-May, pages 7347–7351.

Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F., Jansen, A., Moore, R. C., Plakal, M., Platt, D., Saurous, R. A., Seybold, B., Slaney, M., Weiss, R. J., and Wilson, K. (2017). Cnn architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 131–135. IEEE Press.

Ispas, A.-R., Deschamps-Berger, T., and Devillers, L. (2023). A multi-task, multi-modal approach for predicting categorical and dimensional emotions. In ACM International Conference Proceeding Series, page 311 – 317.

Julião, M., Abad, A., and Moniz, H. (2020). Exploring text and audio embeddings for multi-dimension elderly emotion recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, volume 2020-October, pages 2067–2071.

Koh, E. S. and Dubnov, S. (2021). Comparison and analysis of deep audio embeddings for music emotion recognition. CoRR, abs/2104.06517.

Lech, M., Stolar, M., Best, C., and Bolia, R. (2020). Real-time speech emotion recognition using a pre-trained image classification network: Effects of bandwidth reduction and companding. Frontiers in Computer Science, 2.

Leow, C. S., Hayakawa, T., Nishizaki, H., and Kitaoka, N. (2020). Development of a lowlatency and real-time automatic speech recognition system. In 2020 IEEE 9th Global Conference on Consumer Electronics (GCCE), pages 925–928.

Lieskovská, E., Jakubec, M., Jarina, R., and Chmulík, M. (2021). A review on speech emotion recognition using deep learning and attention mechanism. Electronics, 10.

MacAry, M., Tahon, M., Esteve, Y., and Rousseau, A. (2021). On the use of self-supervised pre-trained acoustic and linguistic features for continuous speech emotion recognition. In 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings, pages 373–380.

Mehrabian, A. (1996). Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in Temperament. Current Psychology, 14:261–292.

Pham, N. T., Dang, D. N. M., Pham, B. N. H., and Nguyen, S. D. (2023). Server: Multi-modal speech emotion recognition using transformer-based and vision-based embeddings. In Proceedings of the 2023 8th International Conference on Intelligent Information Technology, ICIIT ’23, page 234–238, New York, NY, USA. Association for Computing Machinery.

Russell, J. (1980). A circumplex model of affect. Journal of personality and social psychology, 39:1161–1178.

Saeki, T., Takamichi, S., and Saruwatari, H. (2021). Low-latency incremental text-to-speech synthesis with distilled context prediction network. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 749–756.

Simonyan, K. and Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition.

Singh, R., Yadav, H., Sharma, M., Gosain, S., and Shah, R. R. (2019). Automatic speech recognition for real-time systems. In 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM), pages 189–198.

Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., and Khudanpur, S. (2018). X-vectors: Robust dnn embeddings for speaker recognition. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing Proceedings, volume 2018-April, page 5329 – 5333.

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment tree-bank. In Yarowsky, D., Baldwin, T., Korhonen, A., Livescu, K., and Bethard, S., editors, Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.

Sogancioglu, G., Verkholyak, O., Kaya, H., Fedotov, D., Cadée, T., Salah, A., and Karpov, A. (2020). Is everything fine, grandma? acoustic and linguistic modeling for robust elderly speech emotion recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, volume 2020-October, pages 2097–2101.

Srinivasan, S., Huang, Z., and Kirchhoff, K. (2022). Representation learning through cross-modal conditional teacher-student training for speech emotion recognition. In ICASSP 2022 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume 2022-May, pages 4298–4302. cited By 0.

Stolar, M. N., Lech, M., Bolia, R. S., and Skinner, M. (2017). Real time speech emotion recognition using rgb image classification and transfer learning. In 2017 11th International Conference on Signal Processing and Communication Systems (ICSPCS), pages 1–8.

Sun, L., Lian, Z., Tao, J., Liu, B., and Niu, M. (2020). Multi-modal continuous dimensional emotion recognition using recurrent neural network and self-attention mechanism. In Proceedings of the 1st International on Multimodal Sentiment Analysis in Real-Life Media Challenge and Workshop, MuSe’20, page 27–34, New York, NY, USA. Association for Computing Machinery.

Testa, B., Xiao, Y., Sharma, H., Gump, A., and Salekin, A. (2023). Privacy against real-time speech emotion detection via acoustic adversarial evasion of machine learning. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., 7.

Triantafyllopoulos, A., Wagner, J., Wierstorf, H., Schmitt, M., Reichel, U., Eyben, F., Burkhardt, F., and Schuller, B. (2022). Probing speech emotion recognition transformers for linguistic knowledge. In Proc. Interspeech 2022, volume 2022-September, pages 146–150.

Wang, C., Ren, Y., Zhang, N., Cui, F., and Luo, S. (2022). Speech emotion recognition based on multi-feature and multi-lingual fusion. Multimedia Tools and Applications, 81:4897–4907.

Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., and Zhou, M. (2020). MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA. Curran Associates Inc.

Wundt, W. and Judd, C. (1897). Outlines of Psychology. W. Engelmann.

Dimensional Speech Emotion Recognition from Bimodal Features

Abstract

References

Most read articles by the same author(s)