Multimodal Audio Emotion Recognition with Graph-based Consensus Pseudolabeling

  • Gabriel Natal Coutinho Universidade de São Paulo
  • Artur de Vlieger Lima Universidade de São Paulo
  • Juliano Yugoshi Universidade de São Paulo
  • Marcelo Isaias de Moraes Junior Universidade de São Paulo
  • Marcos Paulo Silva Gôlo Universidade de São Paulo
  • Ricardo Marcondes Marcacini Universidade de São Paulo


This paper presents a novel method called Multimodal Graph-based Consensus Pseudolabeling (MGCP) for unsupervised emotion recognition in audio. The goal is to determine the emotion of audio segments using the circumplex model of emotions. The method combines pre-trained unimodal models for audio and text and follows a three-step process. First, audio segments are represented using embeddings from unimodal models. Then, modality-specific graphs are constructed based on similarity and integrated into a multimodal graph. Finally, pseudolabels are generated by measuring consensus between modalities, and a graph regularization framework is introduced to estimate the final emotion coordinates. Experimental evaluation shows the effectiveness of the MGCP method, surpassing both unimodal and traditional multimodal models, enabling audio emotion recognition without labeled data specific to the target domain.

Palavras-chave: Audio Emotion Recognition, Pseudolabeling, Graph Learning


Abdullah, S. M. S. A., Ameen, S. Y. A., Sadeeq, M. A., and Zeebaree, S. (2021). Multimodal emotion recognition using deep learning. Journal of Applied Science and Technology Trends, 2(02):52–58.

Adoma, A. F., Henry, N.-M., and Chen, W. (2020). Comparative analyses of bert, roberta, distilbert, and xlnet for text-based emotion recognition. In 2020 17th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), pages 117–121. IEEE.

Arazo, E., Ortego, D., Albert, P., O’Connor, N. E., and McGuinness, K. (2020). Pseudolabeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE.

Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 12449–12460. Curran Associates, Inc.

Bagadi, K. R. (2021). A comprehensive analysis of multimodal speech emotion recognition. In Journal of Physics: Conference Series, volume 1917, page 012009. IOP Publishing.

Baltrusaitis, T., Ahuja, C., and Morency, L.-P. (2019). Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell., 41(2):423–443.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.

Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J. N., Lee, S., and Narayanan, S. S. (2008). Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359.

Das, R. and Singh, T. D. (2023). Multimodal sentiment analysis: A survey of methods, trends and challenges. ACM Computing Surveys.

Demszky, D., Movshovitz-Attias, D., Ko, J., Cowen, A., Nemade, G., and Ravi, S. (2020). Goemotions: A dataset of fine-grained emotions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4040–4054.

Deng, J. J., Leung, C. H., and Li, Y. (2021). Multimodal emotion recognition using transfer learning on audio and text data. In Computational Science and Its Applications–ICCSA 2021: 21st International Conference, Cagliari, Italy, September 13–16, 2021, Proceedings, Part III 21, pages 552–563. Springer.

do Carmo, P. and Marcacini, R. (2021). Embedding propagation over heterogeneous event networks for link prediction. In 2021 IEEE International Conference on Big Data (Big Data), pages 4812–4821. IEEE.

Ezzameli, K. and Mahersia, H. (2023). Emotion recognition from unimodal to multimodal analysis: A review. Information Fusion, page 101847.

Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., and Mohamed, A. (2021). Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.

Kenton, J. D. M.-W. C. and Toutanova, L. K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.

Kolesnikov, A., Dosovitskiy, A., Weissenborn, D., Heigold, G., Uszkoreit, J., Beyer, L., Minderer, M., Dehghani, M., Houlsby, N., Gelly, S., Unterthiner, T., and Zhai, X. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.

Konar, A. and Chakraborty, A. (2015). Emotion recognition: A pattern analysis approach. John Wiley & Sons.

Krishna, D. and Patil, A. (2020). Multimodal emotion recognition using cross-modal attention and 1d convolutional neural networks. In Interspeech, pages 4243–4247.

Lee, D.-H. et al. (2013). Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, page 896. Atlanta.

Priyasad, D., Fernando, T., Denman, S., Sridharan, S., and Fookes, C. (2020). Attention driven fusion for multi-modal emotion recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3227–3231. IEEE.

Rossi, R. G. (2016). Classificaçao automática de textos por meio de aprendizado de máquina baseado em redes. PhD thesis, Universidade de São Paulo.

Rossi, R. G., Lopes, A. A., and Rezende, S. O. (2014). A parameter-free label propagation algorithm using bipartite heterogeneous networks for text classification. In Proceedings of the 29th annual acm symposium on applied computing, pages 79–84.

Russell, J. A. (1980). A circumplex model of affect. Journal of personality and social psychology, 39(6):1161.

Saxena, A., Khanna, A., and Gupta, D. (2020). Emotion recognition and detection methods: A comprehensive survey. Journal of Artificial Intelligence and Systems, 2(1):53– 79.

Schneider, S., Baevski, A., Collobert, R., and Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. Interspeech 2019.

Shah Fahad, M., Ranjan, A., Yadav, J., and Deepak, A. (2021). A survey of speech emotion recognition in natural environment. Digital Signal Processing, 110:102951.

Siriwardhana, S., Reis, A., Weerasekera, R., and Nanayakkara, S. (2020). Jointly fine-tuning “bert-like” self supervised models to improve multimodal speech emotion recognition. Proc. Interspeech 2020, pages 3755–3759.

Tomar, P. S., Mathur, K., and Suman, U. (2022). Unimodal approaches for emotion recognition: A systematic review. Cognitive Systems Research.

Wang, Y., Boumadane, A., and Heba, A. (2021). A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding. arXiv preprint arXiv:2111.02735.

Zhou, D., Bousquet, O., Lal, T., Weston, J., and Schölkopf, B. (2003). Learning with local and global consistency. Advances in neural information processing systems, 16.

Zhu, X., Ghahramani, Z., and Lafferty, J. D. (2003). Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), pages 912–919.

Zhu, X. and Goldberg, A. B. (2022). Introduction to semi-supervised learning. Springer Nature.
COUTINHO, Gabriel Natal; LIMA, Artur de Vlieger; YUGOSHI, Juliano; MORAES JUNIOR, Marcelo Isaias de; GÔLO, Marcos Paulo Silva; MARCACINI, Ricardo Marcondes. Multimodal Audio Emotion Recognition with Graph-based Consensus Pseudolabeling. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 20. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 809-823. ISSN 2763-9061. DOI: