SpeechVis: Simplifying Speech Emotion Visualization

  • Luan Dopke PUCRS
  • Arthur Accorsi PUCRS
  • João Paulo Aires PUCRS
  • Larissa Guder PUCRS
  • Isabel Harb Manssour PUCRS
  • Dalvan Griebler PUCRS

Resumo


As the amount of online content increases, analyzing and following discussions becomes harder. Relevant information, such as the main discussion topics and the emotions expressed in audio, e.g., in a podcast, requires people to watch or listen to the entire content to understand the context. However, this can take a long time, and people’s interpretations of emotions can bias their understanding of them. A visual summarization of such information can help people quickly understand the audio context and analyze the content regarding speakers, their emotions, and the main topics covered. In this work, we introduce SpeechVis, a visual analytics tool that visually summarizes speech emotions from an audio source. SpeechVis extracts multiple information from the audio, such as the transcription, speakers, main topics, and emotions, to provide visualizations and statistics about the discussed topics and each speaker’s emotions. We used multiple off-the-shelf machine learning models to extract audio information and developed several visual representations that aim to facilitate audio analysis. To evaluate SpeechVis, we selected two use cases and performed an analysis to demonstrate how the SpeechVis visualizations can give valuable insights and facilitate audio interpretation.
Palavras-chave: Visual Analytics, Speech Visualization, Emotion Classification, Signal Processing, Machine Learning

Referências

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv:2006.11477 [cs.CL]

Max Bain, Jaesung Huh, Tengda Han, and Andrew Zisserman. 2023. WhisperX: Time-Accurate Speech Transcription of Long-Form Audio. arXiv:2303.00747 [cs.SD]

Pablo Barros and Stefan Wermter. 2016. Developing crossmodal expression recognition based on a deep neural model. Adaptive Behavior 24, 5 (Oct. 2016), 373–396. DOI: 10.1177/1059712316664017

Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. 2011. D3: Data-Driven Documents. IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis) (2011). [link]

Hervé Bredin, Ruiqing Yin, Juan Manuel Coria, Gregory Gelly, Pavel Korshunov, Marvin Lavechin, Diego Fustes, Hadrien Titeux, Wassim Bouaziz, and Marie-Philippe Gill. 2019. pyannote.audio: neural building blocks for speaker diarization. arXiv:1911.01255 [eess.AS]

Paul Buitelaar, Ian D. Wood, Sapna Negi, Mihael Arcan, John P. McCrae, Andrejs Abele, Cecile Robin, Vladimir Andryushechkin, Housam Ziad, Hesam Sagha, Maximilian Schmitt, Bjorn W. Schuller, J. Fernando Sanchez-Rada, Carlos A. Iglesias, Carlos Navarro, Andreas Giefer, Nicolaus Heise, Vincenzo Masucci, Francesco A. Danza, Ciro Caterino, Pavel Smrz, Michal Hradis, Filip Povolny, Marek Klimes, Pavel Matejka, and Giovanni Tummarello. 2018. MixedEmotions: An Open-Source Toolbox for Multimodal Emotion Analysis. IEEE Transactions on Multimedia 20, 9 (Sept. 2018), 2454–2465. DOI: 10.1109/tmm.2018.2798287

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N. Chang, Sungbok Lee, and Shrikanth S. Narayanan. 2008. IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation 42, 4 (Dec. 2008), 335–359. 435 SpeechVis: Simplifying Speech Emotion Visualization WebMedia’2025, Rio de Janeiro, Brazil

Tim Dalgleish and Mick Power (Eds.). 1999. Handbook of Cognition and Emotion. John Wiley & Sons, Chichester, England. DOI: 10.1002/0470013494

Catherine D’Ignazio. 2017. Creative data literacy: Bridging the gap between the data-haves and data-have nots. Information Design Journal 23, 1 (2017), 6–18.

Christian Hildebrand, Fotis Efthymiou, Francesc Busquet, William H. Hampton, Donna L. Hoffman, and Thomas P. Novak. 2020. Voice analytics in business research: Conceptual foundations, acoustic feature extraction, and applications. Journal of Business Research 121 (2020), 364–374. DOI: 10.1016/j.jbusres.2020.09.020

Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy: Industrial-strength Natural Language Processing in Python. 10.5281/zenodo. 1212303

Harshit Katyal. 2024. harshit345/xlsr-wav2vec-speech-emotion-recognition. [link]

Renato Kempter, Valentina Sintsova, Claudiu Musat, and Pearl Pu. 2014. Emotion-Watch: Visualizing Fine-Grained Emotions in Event-Related Tweets. Proceedings of the International AAAI Conference on Web and Social Media 8, 1 (May 2014), 236–245. DOI: 10.1609/icwsm.v8i1.14556

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv:1910.13461 [cs.CL]

Kristina Loderer, Kornelia Gentsch, Melissa C. Duffy, Mingjing Zhu, Xiyao Xie, Jason A. Chavarría, Elisabeth Vogl, Cristina Soriano, Klaus R. Scherer, and Reinhard Pekrun. 2020. Are concepts of achievement-related emotions universal across cultures? A semantic profiling approach. Cognition and Emotion 34, 7 (April 2020), 1480–1488. DOI: 10.1080/02699931.2020.1748577

Kevin Maher, Zeyuan Huang, Jiancheng Song, Xiaoming Deng, Yu-Kun Lai, Cuixia Ma, HaoWang, Yong-Jin Liu, and HonganWang. 2022. E-ffective: A Visual Analytic System for Exploring the Emotion and Effectiveness of Inspirational Speeches. IEEE Transactions on Visualization and Computer Graphics 28, 1 (Jan. 2022), 508–517. DOI: 10.1109/tvcg.2021.3114789

Albert Mehrabian. 1996. Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in Temperament. Current Psychology 14 (Dec. 1996), 261–292. DOI: 10.1007/BF02686918

Saif Mohammad and Peter Turney. 2010. Emotions Evoked by Common Words and Phrases: Using Mechanical Turk to Create an Emotion Lexicon. In Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, Diana Inkpen and Carlo Strapparava (Eds.). Association for Computational Linguistics, Los Angeles, CA, 26–34. [link]

Andrew Cameron Morris, Viktoria Maier, and Phil Green. 2004. From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In Interspeech. ISCA, 2765–2768. DOI: 10.21437/Interspeech. 2004-668

Chris North and Ben Shneiderman. 2000. Snap-together visualization: a user interface for coordinating visualizations via relational schemata. In Proceedings of the Working Conference on Advanced Visual Interfaces (Palermo, Italy) (AVI ’00). Association for Computing Machinery, New York, NY, USA, 128–135. DOI: 10.1145/345513.345282

Caluã de Lacerda Pataca, Matthew Watkins, Roshan Peiris, Sooyeon Lee, and Matt Huenerfauth. 2023. Visualization of Speech Prosody and Emotion in Captions: Accessibility for Deaf and Hard-of-Hearing Users. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (, Hamburg, Germany, ) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 831, 15 pages. DOI: 10.1145/3544548.3581511

Rosalind W Picard. 1997. Affective Computing. MIT Press, Cambridge, MA.

Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion 37 (2017), 98–125. DOI: 10.1016/j.inffus.2017.02.003

James A. Russell. 1980. A circumplex model of affect. Journal of Personality and Social Psychology 39, 6 (Dec. 1980), 1161–1178. DOI: 10.1037/h0077714

Jane Simpson, Sarah Carter, Susan H. Anthony, and Paul G. Overton. 2006. Is Disgust a Homogeneous Emotion? Motivation and Emotion 30, 1 (March 2006), 31–41. DOI: 10.1007/s11031-006-9005-1

Valentina Sintsova, Claudiu Musat, and Pearl Pu. 2013. Fine-Grained Emotion Recognition in Olympic Tweets Based on Human Computation. In Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Alexandra Balahur, Erik van der Goot, and Andres Montoyo (Eds.). Association for Computational Linguistics, Atlanta, Georgia, 12–20. [link]

Johannes Wagner, Andreas Triantafyllopoulos, Hagen Wierstorf, Maximilian Schmitt, Felix Burkhardt, Florian Eyben, and Björn W. Schuller. 2022. Model for Dimensional Speech Emotion Recognition based on Wav2vec 2.0. DOI: 10.5281/zenodo.6221127

Johannes Wagner, Andreas Triantafyllopoulos, Hagen Wierstorf, Maximilian Schmitt, Felix Burkhardt, Florian Eyben, and Björn W Schuller. 2023. Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 9 (2023), 10745–10759.

Amy Beth Warriner, Victor Kuperman, and Marc Brysbaert. 2013. Norms of valence, arousal, and dominance for 13, 915 English lemmas. Behavior Research Methods 45, 4 (Feb. 2013), 1191–1207. DOI: 10.3758/s13428-012-0314-x

W.M. Wundt and C.H. Judd. 1897. Outlines of Psychology. W. Engelmann. [link]

Irena Yanushevskaya, Christer Gobl, and Ailbhe Ní Chasaide. 2013. Voice quality in affect cueing: does loudness matter? Frontiers in Psychology 4 (2013). DOI: 10.3389/fpsyg.2013.00335

Haipeng Zeng, Xingbo Wang, Aoyu Wu, Yong Wang, Quan Li, Alex Endert, and Huamin Qu. 2020. EmoCo: Visual Analysis of Emotion Coherence in Presentation Videos. IEEE Transactions on Visualization and Computer Graphics 26, 1 (2020), 927–937. DOI: 10.1109/TVCG.2019.2934656

Jian Zhao, Liang Gou, FeiWang, and Michelle Zhou. 2014. PEARL: An interactive visual analytic tool for understanding personal emotion style derived from social media. In 2014 IEEE Conference on Visual Analytics Science and Technology (VAST). IEEE, 203–212. DOI: 10.1109/vast.2014.7042496
Publicado
10/11/2025
DOPKE, Luan; ACCORSI, Arthur; AIRES, João Paulo; GUDER, Larissa; MANSSOUR, Isabel Harb; GRIEBLER, Dalvan. SpeechVis: Simplifying Speech Emotion Visualization. In: BRAZILIAN SYMPOSIUM ON MULTIMEDIA AND THE WEB (WEBMEDIA), 31. , 2025, Rio de Janeiro/RJ. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 428-436. DOI: https://doi.org/10.5753/webmedia.2025.16115.

Artigos mais lidos do(s) mesmo(s) autor(es)

1 2 3 4 5 6 7 8 > >>