Cross-Language Speech Emotion Recognition: English versus German

  • Erick K. Komati IFES
  • Karin S. Komati IFES

Abstract


The present study analyzed the generalization ability of a speech emotion recognition (SER) model in a cross-corpus scenario, training a one-dimensional convolutional neural network with English datasets and testing it on a German dataset. The training used the CREMA-D, RAVDESS, SAVEE, and TESS datasets, while the EmoDB dataset was employed for testing. The model achieved an accuracy of 0.61 on the training data, but its performance dropped to 0.30 when tested on the German dataset. This decline in performance highlights the limitations of SER in the face of linguistic and cultural differences.

References

Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., Weiss, B., et al. (2005). A database of german emotional speech. In Interspeech, volume 5, pages 1517–1520.

Burnwal, S. (2021). Speech emotion recognition. [link].

Cao, H., Cooper, D. G., Keutmann, M. K., Gur, R. C., Nenkova, A., and Verma, R. (2014). CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing, 5(4):377–390.

Chatterjee, M., Zion, D. J., Deroche, M. L., Burianek, B. A., Limb, C. J., Goren, A. P., Kulkarni, A. M., and Christensen, J. A. (2015). Voice emotion recognition by cochlear-implanted children and their normally-hearing peers. Hearing research, 322:151–162.

El Ayadi, M., Kamel, M. S., and Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3):572–587.

Hook, J., Noroozi, F., Toygar, O., and Anbarjafari, G. (2019). Automatic speech based emotion recognition using paralinguistics features. Bulletin of the Polish Academy of Sciences. Technical Sciences, 67(3).

Jackson, P. and Haq, S. (2014). Surrey audio-visual expressed emotion (savee) database. University of Surrey: Guildford, UK.

Livingstone, S. R. and Russo, F. A. (2018). The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13(5):e0196391.

Matos, P. V. d. O. S., Andrade, R. S. S., Rehder, M. I. B. C., Guedes-Granzotti, R. B., Silva, K. d., and César, C. P. H. A. R. (2024). Reconhecimento da prosódia emocional por meio de pseudopalavras do Hoosier Vocal Emotions Collection. Revista CEFAC, 26:e3624.

Peixoto, G. d. S. and Linhares, J. E. d. S. (2023). Reconhecimento de emoçoes através da fala utilizando rede neural convolucional. In Seminário Integrado de Software e Hardware (SEMISH), pages 119–130. SBC.

Pichora-Fuller, M. K. and Dupuis, K. (2020). Toronto emotional speech set (tess). Scholars Portal Dataverse, 1:2020.

Retta, E. A., Sutcliffe, R., Mahmood, J., Berwo, M. A., Almekhlafi, E., Khan, S. A., Chaudhry, S. A., Mhamed, M., and Feng, J. (2023). Cross-corpus multilingual speech emotion recognition: Amharic vs. other languages. Applied Sciences, 13(23):12587.

Zaman, K., Sah, M., Direkoglu, C., and Unoki, M. (2023). A survey of audio classification using deep learning. IEEE access, 11:106620–106649.

Zehra, W., Javed, A. R., Jalil, Z., Khan, H. U., and Gadekallu, T. R. (2021). Cross corpus multi-lingual speech emotion recognition using ensemble learning. Complex & Intelligent Systems, 7(4):1845–1854.

Zhang, S., Liu, R., Tao, X., and Zhao, X. (2021). Deep cross-corpus speech emotion recognition: Recent advances and perspectives. Frontiers in Neurorobotics, Volume 15 - 2021.
Published
2025-10-16
KOMATI, Erick K.; KOMATI, Karin S.. Cross-Language Speech Emotion Recognition: English versus German. In: REGIONAL SCHOOL OF INFORMATICS OF ESPÍRITO SANTO (ERI-ES), 10. , 2025, Espírito Santo/ES. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 70-79. DOI: https://doi.org/10.5753/eries.2025.16020.