Beyond Audio Signals: Generative Model-Based Speaker Diarization in Portuguese

  • Antônio Oss Boll USP
  • Letícia Maria Puttlitz USP
  • Heloísa Oss Boll UFRGS
  • Rodrigo Mor Malossi UFRGS

Resumo


Speaker diarization, the task of automatically identifying different speakers in audio and video, is frequently performed using probabilistic models and deep learning techniques. However, existing methods usually rely on direct analysis of the audio signal, which presents challenges for languages that lack established diarization methodologies, such as Portuguese. In this article, we propose a new approach to speaker diarization that leverages generative models for automatic speaker identification in Portuguese. We employed two generative models: one for refining the transcribed audio and another for performing the diarization task, as well as a model for initially transcribing the audio. Our method simplifies the diarization process by capturing and analyzing speaker style patterns from transcribed audio and achieves high accuracy without depending on direct signal analysis. This approach not only increases the effectiveness of speaker identification but also extends the usefulness of generative models to new domains. It opens a new perspective for diarization research, especially for the development of accurate systems for under-researched languages in audio and video applications.
Publicado
17/11/2024
BOLL, Antônio Oss; PUTTLITZ, Letícia Maria; BOLL, Heloísa Oss; MALOSSI, Rodrigo Mor. Beyond Audio Signals: Generative Model-Based Speaker Diarization in Portuguese. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 13. , 2024, Belém/PA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 247-259. ISSN 2643-6264.