BAH: Beyond Acoustic Handcrafted features for speech emotion recognition in Portuguese
Resumo
It is through affective computing that we have the integration of human feelings and computing applications. One affective computing task is Speech Emotion Recognition (SER), which identifies emotions from spoken audio. Even though emotion is a universal aspect of human experience, each culture and language has different ways to express and understand emotions. So, when designing models for SER, it is common to focus on a single language. In this work, we explore VERBO, a Brazilian Portuguese dataset for categorical emotion recognition. Our main objective is to define the best way to extract acoustic features to train a classifier for SER.We compare 18 different methods to generate audio representations, grouped by handcrafted features and audio embeddings. The best representation for VERBO is TRILL embeddings, and with an SVM classifier, we achieved 92% accuracy in VERBO. As far as we know, this was the state of the art for this dataset.
Referências
Rodrigo Gregory Bastos Germano, Michel Pompeu Tcheou, Felipe da Rocha Henriques, and Sergio Pinto Gomes Junior. 2021. emoUERJ: an emotional speech database in Portuguese. DOI: 10.5281/zenodo.5427549
Kirsten Boehner, Rogério DePaula, Paul Dourish, and Phoebe Sengers. 2005. Affect: From Information to Interaction. In Proceedings of the 4th Decennial Conference on Critical Computing: Between Sense and Sensibility (Aarhus, Denmark) (CC ’05). Association for Computing Machinery, New York, NY, USA, 59–68. DOI: 10.1145/1094562.1094570
Arnaldo Candido Junior, Edresson Casanova, Anderson Soares, Frederico Santos de Oliveira, Lucas Oliveira, Ricardo Corso Fernandes Junior, Daniel Peixoto Pinto da Silva, Fernando Gorgulho Fayet, Bruno Baldissera Carlotto, Lucas Rafael Stefanel Gris, et al. 2022. CORAA ASR: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese. Language Resources and Evaluation (2022), 1–33.
Sanyuan Chen, ChengyiWang, Zhengyang Chen, YuWu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, and Furu Wei. 2022. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE Journal of Selected Topics in Signal Processing 16, 6 (2022), 1505–1518. DOI: 10.1109/JSTSP.2022.3188113
Roberto Yuri da Silva Franco, Rodrigo Santos do Amor Divino Lima, Rafael do Monte Paixão, Carlos Gustavo Resque dos Santos, and Bianchi Serique Meiguins. 2019. UXmood — A Sentiment Analysis and Information Visualization Tool to Support the Evaluation of Usability and User Experience. Information 10, 12 (2019). DOI: 10.3390/info10120366
Javier de Lope and Manuel Graña. 2023. An ongoing review of speech emotion recognition. Neurocomputing 528 (April 2023), 1–11. DOI: 10.1016/j.neucom.2023.01.002
Paul Ekman. 1999. Basic Emotions. John Wiley and Sons, Ltd, Chapter 3, 45–60. DOI: 10.1002/0470013494.ch3
Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: The Munich Versatile and Fast Open-Source Audio Feature Extractor. In Proceedings of the 18th ACM International Conference on Multimedia (Firenze, Italy) (MM ’10). Association for Computing Machinery, New York, NY, USA, 1459–1462. DOI: 10.1145/1873951.1874246
Geraldo P. Rocha Filho, Rodolfo I. Meneguette, Fábio Lúcio Lopes de Mendonça, Liriam Enamoto, Gustavo Pessin, and Vinícius P. Gonçalves. 2024. Toward an emotion efficient architecture based on the sound spectrum from the voice of Portuguese speakers. Neural Computing and Applications 36, 32 (Aug. 2024), 19939–19950. DOI: 10.1007/s00521-024-10249-4
A.V. Geetha, T. Mala, D. Priyanka, and E. Uma. 2024. Multimodal Emotion Recognition with Deep Learning: Advancements, challenges, and future directions. Information Fusion 105 (March 2024), 102–218. DOI: 10.1016/j.inffus.2023.102218
K Ghaayathri Devi, Kolluru Likhitha, J Akshaya, Rfj Gokul, and G Jyothish Lal. 2024. Multi-Lingual Speech Emotion Recognition: Investigating Similarities between English and German Languages. In 2024 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI). 1–10. DOI: 10.1109/ACCAI61061.2024.10601715
Theodoros Giannakopoulos. 2015. pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis. PLOS ONE 10 (12 2015), 1–17. DOI: 10.1371/journal.pone.0144610
Larissa Guder, João Paulo Aires, Felipe Meneguzzi, and Dalvan Griebler. 2024. Dimensional Speech Emotion Recognition from Bimodal Features. In Brazilian Symposium on Computing Applied to Health. Brazilian Computing Society, 12.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. DOI: 10.48550/ARXIV.1512.03385
Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. 2017. CNN Architectures for Large-Scale Audio Classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (New Orleans, LA, USA). IEEE Press, 131–135. DOI: 10.1109/ICASSP.2017.7952132
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 29 (Oct. 2021), 3451–3460. DOI: 10.1109/TASLP.2021.3122291
Neelakshi Joshi, Pedro V. V. Paiva, Murillo Batista, Marcos V. Cruz, and Josué J. G. Ramos. 2022. Improvements in Brazilian Portuguese Speech Emotion Recognition and its extension to Latin Corpora. In 2022 International Joint Conference on Neural Networks (IJCNN). 1–8. DOI: 10.1109/IJCNN55064.2022.9892110
Eva Lieskovská, Maroš Jakubec, Roman Jarina, and Michal Chmulík. 2021. A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism. Electronics 10 (January 2021), 1163. DOI: 10.3390/electronics10101163
Kristina Loderer, Kornelia Gentsch, Melissa C. Duffy, Mingjing Zhu, Xiyao Xie, Jason A. Chavarría, Elisabeth Vogl, Cristina Soriano, Klaus R. Scherer, and Reinhard Pekrun. 2020. Are concepts of achievement-related emotions universal across cultures? A semantic profiling approach. Cognition and Emotion 34 (March 2020), 1480–1488. DOI: 10.1080/02699931.2020.1748577
Albert Mehrabian. 1996. Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in Temperament. Current Psychology 14 (December 1996), 261–292. DOI: 10.1007/BF02686918
Nuzhat Mobassara, Nur Alam, and Nursadul Mamun. 2025. A Comprehensive Review of Speech Emotions Recognition using Machine Learning. In 2025 International Conference on Electrical, Computer and Communication Engineering (ECCE). 1–6. DOI: 10.1109/ECCE64574.2025.11013787
Myriam Munezero, Calkin Suero Montero, Erkki Sutinen, and John Pajunen. 2014. Are They Different? Affect, Feeling, Emotion, Sentiment, and Opinion Detection in Text. IEEE Transactions on Affective Computing 5 (April 2014), 101–111. DOI: 10.1109/TAFFC.2014.2317187
Jacob Peplinski, Joel Shor, Sachin Joglekar, Jake Garrison, and Shwetak Patel. 2021. FRILL: A Non-Semantic Speech Embedding for Mobile Devices. In Interspeech 2021 (interspeech 2021). ISCA, 1204–1208. DOI: 10.21437/interspeech.2021-2070
Rosalind W. Picard. 1997. Affective Computing. MIT Press, Cambridge, MA.
Diego Resende Faria, Abraham Itzhak Weinberg, and Pedro Paulo Ayrosa. 2024. Multimodal Affective Communication Analysis: Fusing Speech Emotion and Text Sentiment Using Machine Learning. Applied Sciences 14, 15 (2024). DOI: 10.3390/app14156631
Kirk Roberts, Michael A. Roach, Joseph Johnson, Josh Guthrie, and Sanda M. Harabagiu. 2012. EmpaTweet: Annotating and Detecting Emotions on Twitter. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association (ELRA), Istanbul, Turkey, 3806–3813. [link]
J.A. Russell. 1980. A circumplex model of affect. Journal of personality and social psychology 39 (December 1980), 1161–1178.
Björn Schuller, Stefan Steidl, Anton Batliner, Julia Hirschberg, Judee K. Burgoon, Alice Baird, Aaron Elkins, Yue Zhang, Eduardo Coutinho, and Keelan Evanini. 2016. The INTERSPEECH 2016 computational paralinguistics challenge: Deception, sincerity and native language. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 08-12-September-2016. 2001 – 2005. DOI: 10.21437/Interspeech.2016-129
Mayank Sharma. 2022. Multi-Lingual Multi-Task Speech Emotion Recognition Using wav2vec 2.0. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6907–6911. DOI: 10.1109/ICASSP43922.2022.9747417
Joel Shor, Aren Jansen, Ronnie Maor, Oran Lang, Omry Tuval, Félix de Chaumont Quitry, Marco Tagliasacchi, Ira Shavitt, Dotan Emanuel, and Yinnon Haviv. 2020. Towards Learning a Universal Non-Semantic Representation of Speech. In Interspeech. ISCA, 140–144. DOI: 10.21437/interspeech.2020-1242
Joel Shor and Subhashini Venugopalan. 2022. TRILLsson: Distilled Universal Paralinguistic Speech Representations. In Interspeech 2022 (interspeech 2022). ISCA. DOI: 10.21437/interspeech.2022-118
Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. [link]
Youddha Beer Singh and Shivani Goel. 2022. A systematic literature review of speech emotion recognition approaches. Neurocomputing 492 (July 2022), 245–263. DOI: 10.1016/j.neucom.2022.04.028
José; R. Torres Neto, Geraldo P.R. Filho, Leandro Y. Mano, and João; Ueyama. 2018. VERBO: Voice Emotion Recognition dataBase in Portuguese Language. Journal of Computer Science 14, 11 (Nov 2018), 1420–1430. DOI: 10.3844/jcssp.2018.1420.1430
Samarth Tripathi and Homayoon S. M. Beigi. 2018. Multi-Modal Emotion recognition on IEMOCAP Dataset using Deep Learning. arXiv:1804.05788 [link]
