Dimensional Speech Emotion Recognition: a Bimodal Approach

Larissa Guder; João Paulo Aires; Dalvan Griebler

doi:10.5753/webmedia_estendido.2024.244402

Larissa Guder PUCRS
João Paulo Aires PUCRS
Dalvan Griebler PUCRS

DOI: https://doi.org/10.5753/webmedia_estendido.2024.244402

Resumo

Considering the human-machine relationship, affective computing aims to allow computers to recognize or express emotions. Speech Emotion Recognition is a task from affective computing that aims to recognize emotions in an audio utterance. The most common way to predict emotions from the speech is using pre-determined classes in the offline mode. In that way, emotion recognition is restricted to the number of classes. To avoid this restriction, dimensional emotion recognition uses dimensions such as valence, arousal, and dominance, which can represent emotions with higher granularity. Existing approaches propose using textual information to improve results for the valence dimension. Although recent efforts have tried to improve results on speech emotion recognition to predict emotion dimensions, they do not consider real-world scenarios, where processing the input in a short time is necessary. Considering these aspects, this work provides the first step towards creating a bimodal approach for Dimensional Speech Emotion Recognition in streaming. Our approach combines sentence and audio representations as input to a recurrent neural network that performs speech-emotion recognition. We evaluate different methods for creating audio and text representations, as well as automatic speech recognition techniques. Our best results achieve 0.5915 of CCC for arousal, 0.4165 for valence, and 0.5899 for dominance in the IEMOCAP dataset.

Palavras-chave: Affective Computing, Natural Language Processing, Streaming

Referências

American Psychiatric Association. 2022. Diagnostic and Statistical Manual of Mental Disorders: DSM-5-TR. American Psychiatric Association Publishing, USA.

Bagus Tris Atmaja and Masato Akagi. 2020. Dimensional speech emotion recognition from speech features and word embeddings by using multitask learning. APSIPA Transactions on Signal and Information Processing 9 (May 2020), e17. DOI: 10.1017/ATSIP.2020.14

Bagus Tris Atmaja, Akira Sasou, and Masato Akagi. 2022. Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion. Speech Communication 140 (May 2022), 11–28. DOI: 10.1016/j.specom.2022.03.002

A.V. Geetha, T. Mala, D. Priyanka, and E. Uma. 2024. Multimodal Emotion Recognition with Deep Learning: Advancements, challenges, and future directions. Information Fusion 105 (March 2024), 102–218. DOI: 10.1016/j.inffus.2023.102218

Larissa Guder, João Aires, Felipe Meneguzzi, and Dalvan Griebler. 2024. Dimensional Speech Emotion Recognition from Bimodal Features. In Anais do XXIV Simpósio Brasileiro de Computação Aplicada à Saúde (Goiânia/GO). SBC, Porto Alegre, RS, Brasil, 579–590.

Eva Lieskovská, Maroš Jakubec, Roman Jarina, and Michal Chmulík. 2021. A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism. Electronics 10 (January 2021), 1163. DOI: 10.3390/electronics10101163

W.M. Wundt and C.H. Judd. 1897. Outlines of Psychology. W. Engelmann.