Interaffection of Multiple Datasets with Neural Networks in Speech Emotion Recognition

Ronnypetson da Silva; Valter M. Filho; Mario Souza

doi:10.5753/eniac.2020.12141

Ronnypetson da Silva CPQD
Valter M. Filho CPqD
Mario Souza CPFL Energia

DOI: https://doi.org/10.5753/eniac.2020.12141

Resumo

Many works that apply Deep Neural Networks (DNNs) to Speech
Emotion Recognition (SER) use single datasets or train and evaluate the models
separately when using multiple datasets. Those datasets are constructed with
specific guidelines and the subjective nature of the labels for SER makes it difficult to obtain robust and general models. We investigate how DNNs learn shared
representations for different datasets in both multi-task and unified setups. We
also analyse how each dataset benefits from others in different combinations
of datasets and popular neural network architectures. We show that the longstanding belief of more data resulting in more general models doesn’t always
hold for SER, as different dataset and meta-parameter combinations hold the
best result for each of the analysed datasets.

Palavras-chave: Speech, emotion, dataset, recognition, deep learning, multitask

Referências

Barker, J., Marxer, R., Vincent, E., and Watanabe, S. (2015). The third ‘chime’speech separation and recognition challenge: Dataset, task and baselines. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 504–511. IEEE.

Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J. N., Lee, S., and Narayanan, S. S. (2008). Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335.

Chiou, B.-C. and Chen, C.-P. (2014). Speech emotion recognition with cross-lingual databases. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pages 558–561.

Dileep, A. D. and Sekhar, C. C. (2013). Gmm-based intermediate matching kernel for classification of varying length patterns of long duration speech using support vector machines. IEEE Transactions on Neural Networks and Learning Systems, 25(8):1421– 1432.

Dupuis, K. and Pichora-Fuller, M. K. (2010). Toronto emotional speech set (TESS). University of Toronto, Psychology Department. Available in https://tspace. library.utoronto.ca/handle/1807/24487.

El Ayadi, M., Kamel, M. S., and Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3):572–587.

Fayek, H., Lech, M., and Cavedon, L. (2017). Evaluating deep learning architectures for speech emotion recognition. Neural Networks, 92.

Ghahremani, P., Manohar, V., Povey, D., and Khudanpur, S. (2016). Acoustic modelling from the signal domain using cnns. In Interspeech, pages 3434–3438.

Han, K., Yu, D., and Tashev, I. (2014). Speech emotion recognition using deep neural network and extreme learning machine. In Fifteenth annual conference of the international speech communication association.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780.

Jeub, M., Schafer, M., and Vary, P. (2009). A binaural room impulse response database for the evaluation of dereverberation algorithms. In 2009 16th International Conference on Digital Signal Processing, pages 1–5. IEEE.

Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Kinoshita, K., Delcroix, M., Yoshioka, T., Nakatani, T., Habets, E., Haeb-Umbach, R., Leutnant, V., Sehr, A., Kellermann, W., Maas, R., et al. (2013). The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 1–4. IEEE.

Ko, T., Peddinti, V., Povey, D., and Khudanpur, S. (2015). Audio augmentation for speech recognition. In Sixteenth Annual Conference of the International Speech Communication Association.

Ko, T., Peddinti, V., Povey, D., Seltzer, M. L., and Khudanpur, S. (2017). A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5220–5224. IEEE.

Krogh, A. and Hertz, J. A. (1992). A simple weight decay can improve generalization. In Advances in neural information processing systems, pages 950–957.

Lee, S. (2019). The generalization effect for multilingual speech emotion recognition across heterogeneous languages. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5881–5885.

Lefter I., Rothkrantz L.J.M., W. P. v. L. D. (2010). Emotion recognition from speech by combining databases and fusion of classifiers — springerlink.

Lim, W., Jang, D., and Lee, T. (2016). Speech emotion recognition using convolutional and recurrent neural networks. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pages 1–4.

Livingstone, S. R. and Russo, F. A. (2018). The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13(5).

Mao, Q., Dong, M., Huang, Z., and Zhan, Y. (2014). Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia, 16(8):2203–2213.

Mirsamadi, S., Barsoum, E., and Zhang, C. (2017). Automatic speech emotion recognition using recurrent neural networks with local attention. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2227–2231. IEEE.

Nakamura, S., Hiyane, K., Asano, F., Nishiura, T., and Yamada, T. (2000). Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00).

Neumann, M. and g. Thang Vu, N. (2018). Cross-lingual and multilingual speech emotion recognition on english and french. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5769–5773.

Neumann, M. and Vu, N. T. (2017). Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech. CoRR, abs/1706.00612.

Pandey, S. K., Shekhawat, H. S., and Prasanna, S. R. M. (2019). Deep learning techniques for speech emotion recognition: A review. In 2019 29th International Conference Radioelektronika (RADIOELEKTRONIKA), pages 1–6.

Ramakrishnan, S. and El Emary, I. M. (2013). Speech emotion recognition approaches in human computer interaction. Telecommunication Systems, 52(3):1467–1478.

Ringeval, F., Sonderegger, A., Sauer, J., and Lalanne, D. (2013). Introducing the recola multimodal corpus of remote collaborative and affective interactions. In 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pages 1–8.

Sagha, H., Matějka, P., Gavryukova, M., Povolny, F., Marchi, E., and Schuller, B. (2016). Enhancing multilingual recognition of emotion in speech by language identification. Interspeech 2016, pages 2949–2953.

Sarma, M., Ghahremani, P., Povey, D., Goel, N. K., Sarma, K. K., and Dehak, N. (2018). Emotion identification from raw speech signals using dnns. In Interspeech, pages 3097–3101.

Scheibler, R., Bezzam, E., and Dokmanić, I. (2018). Pyroomacoustics: A python pacage for audio room simulation and array processing algorithms. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 351–355. IEEE.

Schuller, B., Vlasenko, B., Eyben, F., Wöllmer, M., Stuhlsatz, A., Wendemuth, A., and Rigoll, G. (2010). Cross-corpus acoustic emotion recognition: Variances and strategies. IEEE Transactions on Affective Computing, 1:119–131.

Schuller, B. W. (2018). Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends. Communications of the ACM, 61(5):90–99.

Takeishi, E., Nose, T., Chiba, Y., and Ito, A. (2016). Construction and analysis of phonetically and prosodically balanced emotional speech database. In 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA), pages 16–21.

Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B., and Zafeiriou, S. (2016). Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5200–5204. IEEE.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.

Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE transactions on acoustics, speech, and signal processing, 37(3):328–339.

Zhang, Y., Liu, Y., Weninger, F., and Schuller, B. (2017). Multi-task deep neural network with shared hidden layers: Breaking down the wall between emotion representations. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4990–4994.