From audio to information: Learning topics from audio transcripts

  • João Pedro Rodrigues Pontificia Universidade Católica do Paraná
  • Emerson Paraiso Pontificia Universidade Católica do Paraná


In this work, the technical feasibility of working with audio transcriptions from Youtube is analyzed, as well as presenting a method that allows data acquisition, pre-processing, and post-processing to work with this type of data. A topic modeling approach with the latent dirichlet allocation algorithm is used. An approach is also presented to dynamically determine the ideal number of topics that make up a given corpus. In the experiments, a database of 250 audio transcriptions was used, obtaining a model with coherence in the range of 40%.

Palavras-chave: audio transcription, data mining, machine learning, topic modeling


Alexe, B., Hernandez, M. A., Hildrum, K. W., Krishnamurthy, R., Koutrika, G., Nagarajan, M., Roitman, H., Shmueli-Scheuer, M., Stanoi, I. R., Venkatramani, C., et al. Surfacing time-critical insights from social media. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. pp. 657–660, 2012.

Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. Journal of machine Learning research 3 (Jan): 993–1022, 2003.

de Souza, M. and Souza, R. R. Modelagem de tópicos. Múltiplos Olhares em Ciência da Informação 9 (2), 2019.

Gausby, A. Attention spans. Consumer Insights, Microsoft Canada, 2015.

Hagen, L. Content analysis of e-petitions with topic modeling: How to train and evaluate lda models? Information Processing & Management 54 (6): 1292–1307, 2018.

He, Q., Chen, B., and Argawal, D. Building the linkedin knowledge graph. LinkedIn, 2016.

Kaushik, L., Sangwan, A., and Hansen, J. H. Automatic sentiment extraction from youtube videos. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, pp. 239–244, 2013.

Liu, L., Tang, L., Dong, W., Yao, S., and Zhou, W. An overview of topic modeling and its current applications in bioinformatics. SpringerPlus 5 (1): 1608, 2016.

Misra, H., Yvon, F., Jose, J. M., and Cappe, O. Text segmentation via topic modeling: an analytical study. In Proceedings of the 18th ACM conference on Information and knowledge management. pp. 1553–1556, 2009.

Munaro, A. C., Barcelos, R. H., Maffezzolli, E. C. F., Rodrigues, J. P. S., and Paraiso, E. C. The drivers of video popularity on youtube: An empirical investigation. In Advances in Digital Marketing and eCommerce. Springer, pp. 70–79, 2020.

Noy, N., Gao, Y., Jain, A., Narayanan, A., Patterson, A., and Taylor, J. Industry-scale knowledge graphs: Lessons and challenges. Queue 17 (2): 48–75, 2019.

O’callaghan, D., Greene, D., Carthy, J., and Cunningham, P. An analysis of the coherence of descriptors in topic modeling. Expert Systems with Applications 42 (13): 5645–5657, 2015.

Patel, A. B., Birla, M., and Nair, U. Addressing big data problem using hadoop and map reduce. In 2012 Nirma University International Conference on Engineering (NUiCONE). IEEE, pp. 1–5, 2012.

Rangaswamy, S., Ghosh, S., Jha, S., and Ramalingam, S. Metadata extraction and classification of youtube videos using sentiment analysis. In 2016 IEEE International Carnahan Conference on Security Technology (ICCST). IEEE, pp. 1–2, 2016.

Syed, S. and Spruit, M. Full-text or abstract? examining topic coherence scores using latent dirichlet allocation. In 2017 IEEE International conference on data science and advanced analytics (DSAA). IEEE, pp. 165–174, 2017.

Wöllmer, M., Weninger, F., Knaup, T., Schuller, B., Sun, C., Sagae, K., and Morency, L.-P. Youtube movie reviews: Sentiment analysis in an audio-visual context. IEEE Intelligent Systems 28 (3): 46–53, 2013.

Youtube. Youtube in numbers, 2020.
RODRIGUES, João Pedro; PARAISO, Emerson. From audio to information: Learning topics from audio transcripts. In: SYMPOSIUM ON KNOWLEDGE DISCOVERY, MINING AND LEARNING (KDMILE), 8. , 2020, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 121-128. ISSN 2763-8944. DOI: