From audio to information: Learning topics from audio transcripts

  • João Pedro Rodrigues Pontificia Universidade Católica do Paraná
  • Emerson Paraiso Pontificia Universidade Católica do Paraná


In this work, the technical feasibility of working with audio transcriptions from Youtube is analyzed, as well as presenting a method that allows data acquisition, pre-processing, and post-processing to work with this type of data. A topic modeling approach with the latent dirichlet allocation algorithm is used. An approach is also presented to dynamically determine the ideal number of topics that make up a given corpus. In the experiments, a database of 250 audio transcriptions was used, obtaining a model with coherence in the range of 40%.

Palavras-chave: audio transcription, data mining, machine learning, topic modeling


RODRIGUES, João Pedro; PARAISO, Emerson. From audio to information: Learning topics from audio transcripts. In: SYMPOSIUM ON KNOWLEDGE DISCOVERY, MINING AND LEARNING (KDMILE), 8. , 2020, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 121-128. ISSN 2763-8944. DOI: