Unraveling Emotional Dimensions in Brazilian Portuguese Speech through Deep Learning

Henrique Tibério B. V. Augusto; Vinícius P. Gonçalves; Edna Dias Canedo; Rodolfo Meneguette; Gustavo Pessin; Geraldo Pereira R. Filho

doi:10.5753/kdmile.2024.243865

Henrique Tibério B. V. Augusto UnB
Vinícius P. Gonçalves UnB
Edna Dias Canedo UnB
Rodolfo Meneguette USP
Gustavo Pessin ITV
Geraldo Pereira R. Filho UESB

DOI: https://doi.org/10.5753/kdmile.2024.243865

Resumo

Speech is often our first form of communication and expression of emotions. Speech Emotion Recognition is a complex problem, as emotional expression depends on spoken language, dialect, accent, and the cultural background of individuals. The intensity of this emotion can affect our perception and lead us to interpret information inappropriately, with potential applications in various fields such as: patient monitoring, security, commercial systems, and entertainment. This work performed a Machine Learning task using both Machine Learning and Deep Learning to infer the intensity of emotions in Portuguese speech, employing Domain Fusion with two distinct databases. To do so, an Autoencoder was created to extract features, and then we trained a supervised model to classify the intensities into four classes: (i) weak; (ii) moderate; (iii) high; and (iv) peak intensity. The results indicate the possibility of inferring intensity, although the dataset is limited, even when combining two datasets. Two experimental scenarios were carried out, with analogous architectures, varying the dimensionality of representative features used as input for the models. Additionally, observing the performance metrics, it was possible to note the recurrence of the same class (high) with the lowest variation of F1-Score between both experiments, which raises questions for further studies, while the most distant classes (weak and peak) had the best performance for both experiments.

Palavras-chave: brazilian portuguese, deep learning, emotion intensity, machine learning, speech emotion recognition

Referências

Bhargava, M. and Polzehl, T. Improving automatic emotion recognition from speech using rhythm and temporal feature, 2013.

Bui, K.-H. N., Oh, H., and Yi, H. Traffic density classification using sound datasets: An empirical study on traffic flow at asymmetric roads. IEEE Access vol. 8, pp. 125671–125679, 2020. [link].

Campos, G. A. and Moutinho, L. d. S. Deep: uma arquitetura para reconhecer emoção com base no espectro sonoro da voz de falantes da língua portuguesa, 2021. [link].

Cook, D. and Das, S. K. Smart environments: technology, protocols, and applications. Vol. 43. John Wiley & Sons, 2004.

Elsayed, N., ElSayed, Z., Asadizanjani, N., Ozer, M., Abdelgawad, A., and Bayoumi, M. Speech emotion recognition using supervised deep recurrent system for mental health monitoring, 2022.

Eskimez, S. E., Duan, Z., and Heinzelman, W. Unsupervised learning approach to feature analysis for automatic speech emotion recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5099–5103, 2018.

Filho, G. P. R., Meneguette, R. I., Mendonça, F. L. L. d., Enamoto, L., Pessin, G., and Gonçalves, V. P. Toward an emotion efficient architecture based on the sound spectrum from the voice of portuguese speakers. Neural Computing and Applications, 2024.

Goncalves, L., Salman, A. N., Naini, A. R., Velazquez, L. M., Thebaud, T., Garcia, L. P., Dehak, N., Sisman, B., and Busso, C. Odyssey 2024-speech emotion recognition challenge: Dataset, baseline framework, and results. Development 10 (9,290): 4–54, 2024.

Gonçalves, V. P., Giancristofaro, G. T., Filho, G. P., Johnson, T., Carvalho, V., Pessin, G., Neris, V. P. d. A., and Ueyama, J. Assessing users’ emotion at interaction time: a multimodal approach with multiple sensors. Soft Computing vol. 21, pp. 5309–5323, 2017.

Josh, N. Brazilian portuguese emotional speech corpus analysis. X Seminário em TI do PCI/CT , 2021. [link].

Koolagudi, S. G. and Rao, K. S. Emotion recognition from speech: a review. Int J Speech Technol vol. 15, pp. 99––117, 2012. [link].

Latif, S., Rana, R., Qadir, J., and Epps, J. Variational autoencoders for learning latent representations of speech emotion: A preliminary study, 2017.

Li, Y., Zhao, T., and Kawahara, T. Improved end-to-end speech emotion recognition using self attention mechanism and multitask learning. In Interspeech. pp. 2803–2807, 2019.

Liu, R., Sisman, B., Schuller, B., Gao, G., and Li, H. Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning. In Proc. Interspeech 2022. pp. 5493–5497, 2022.

N Holz, P. L.-M. . D. P. The paradoxical role of emotional intensity in the perception of vocal affect. Sci Rep 11 (9663), 2021. [link].

Nassif, A. B., Shahin, I., Elnagar, A., Velayudhan, D., Alhudhaif, A., and Polat, K. Emotional speaker identification using a novel capsule nets model. Expert Systems with Applications vol. 193, pp. 116469, 2022.

Neto, J. T., Filho, G. P., Mano, L. Y., and Ueyama, J. Verbo: Voice emotion recognition database in portuguese language. Journal of Computer Science 14 (11): 1420–1430, Nov, 2018.

Olatinwo, D. D., Abu-Mahfouz, A., Hancke, G., and Myburgh, H. Iot-enabled wban and machine learning for speech emotion recognition in patients. Sensors 23 (6), 2023.

Purington, A., Taft, J. G., Sannon, S., Bazarova, N. N., and Taylor, S. H. "alexa is my new bff" social roles, user satisfaction, and personification of the amazon echo. In Proceedings of the 2017 CHI conference extended abstracts on human factors in computing systems. pp. 2853–2859, 2017.

Purves, D., GJ, G. J. A., D, D. F., and et al. Neuroscience. Sunderland (MA): Sinauer Associates, 2001. [link].

Ververidis, D., Kotropoulos, C., and Pitas, I. Automatic emotional speech classification. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol. 1. pp. I–593, 2004. [link].

You, M., Chen, C., Bu, J., Liu, J., and Tao, J. Emotion recognition from noisy speech. In 2006 IEEE International Conference on Multimedia and Expo. pp. 1653–1656, 2006. [link].

Zhang, S., Zhang, S., Huang, T., and Gao, W. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Transactions on Multimedia 20 (6): 1576–1590, 2018.

Zheng, Y. Methodologies for cross-domain data fusion: An overview. IEEE Transactions on Big Data 1 (1): 16–34, 2015.

Zhou, K., Sisman, B., Rana, R., Schuller, B. W., and Li, H. Emotion intensity and its control for emotional voice conversion. IEEE Transactions on Affective Computing, 2022.

Zhu, X., Yang, S., Yang, G., and Xie, L. Controlling emotion strength with relative attribute for end-to-end speech synthesis. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). pp. 192–199, 2019.