Contributions to Social Media Analysis Based on Topic Modelling


This work proposes a computational approach to a task deeply related to human sciences, that of employing natural language processing in text analysis. Researchers working in that field are often faced with the need for extracting information from large masses of textual data. One of such applications is topic modelling, a task that requires the discovery of the topics discussed in texts - to deal with it, there are several available techniques, such as Latent Dirichlet Allocation (LDA), Biterm Topic Model (BTM), Topic Bidirectional Encoder Representation from Transformers (BERTopic) and Non-negative Matrix Factorization (NMF). In this work, we design a methodological setup and perform a comparative analysis of the aforementioned techniques over data retrieved from Twitter. Through this social media, we seek to contribute to the study of political, economic and social issues, as well as to assess the relative merits of topic modelling techniques. The results indicate a higher topic coherence performance for BERTopic, second for NMF, followed by BTM and, lastly, by LDA.

Palavras-chave: computational human sciences, natural language processing, social media analysis, topic modelling


Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., and Kochut, K. A brief survey of text mining: Classification, clustering and extraction techniques, 2017.

Angelov, D. Top2vec: Distributed representations of topics. arXiv preprint arXiv:2008.09470, 2020.

Bird, S., Klein, E., and Loper, E. Natural language processing with Python. O’Reilly Media, 2009.

Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. Journal of machine Learning research 3 (Jan): 993–1022, 2003.

Churchill, R. and Singh, L. The evolution of topic modeling. ACM Comput. Surv. 54 (10s), nov, 2022.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. Indexing by latent semantic analysis. Journal of the American Society for Information Science 41 (6): 391–407, 1990.

Gallagher, R. J., Reing, K., Kale, D., and Ver Steeg, G. Anchored correlation explanation: Topic modeling with minimal domain knowledge. Transactions of the Association for Computational Linguistics vol. 5, pp. 529–542, 2017.

Grootendorst, M. Bertopic., 2022a.

Grootendorst, M. Bertopic: Neural topic modeling with a class-based tf-idf procedure, 2022b.

Halbwachs, M. La mémoire collective [la memoria colectiva]. Paris, Francia: Presses Universitaires de France, 1950.

Hofmann, T. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’99. Association for Computing Machinery, New York, NY, USA, pp. 50–57, 1999.

Honnibal, M. and Montani, I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing, 2017. To appear.

Jónsson, E. An evaluation of topic modelling techniques for twitter, 2016.

Krippendorff, K. Content analysis. SAGE Publications, Thousand Oaks, CA, 2018.

Machado, M. G. and Colevati, J. Anticomunismo e Gramscismo Cultural no Brasil. Revista Aurora 14 (Edição Especial): 23–34, July, 2021. Number: Edição Especial.

Nisha and Kumar R, D. A. Implementation on text classification using bag of words model. SSRN Electron. J., 2019.

Paatero, P. and Tapper, U. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5 (2): 111–126, 1994.

Panichella, A. A systematic comparison of search-based approaches for LDA hyperparameter tuning. Information and Software Technology vol. 130, pp. 106411, 2021.

Rehůřek, R. and Sojka, P. Software framework for topic modelling with large corpora. pp. 45–50, 2010.

Roberts, M. E., Stewart, B. M., Tingley, D., Airoldi, E. M., et al. The structural topic model and applied social science. In Advances in neural information processing systems workshop on topic models: computation, application, and evaluation. Vol. 4. Harrahs and Harveys, Lake Tahoe, pp. 1–20, 2013.

Robila, M. and Robila, S. A. Applications of artificial intelligence methodologies to behavioral and social sciences. Journal of Child and Family Studies 29 (10): 2954–2966, Oct., 2020.

Röder, M., Both, A., and Hinneburg, A. Exploring the space of topic coherence measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. WSDM ’15. Association for Computing Machinery, New York, NY, USA, pp. 399–408, 2015.

Shadrova, A. Topic models do not model topics: epistemological remarks and steps towards best practices. Journal of Data Mining & Digital Humanities vol. 2021, 2021.

Sridhar, V. K. R. Unsupervised topic modeling for short texts using distributed representations of words. In Proceedings of the 1st workshop on vector space modeling for natural language processing. pp. 192–200, 2015.

Sumikawa, Y., Jatowt, A., and Düring, M. Digital history meets microblogging: Analyzing collective memories in twitter. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries. JCDL ’18. Association for Computing Machinery, New York, NY, USA, pp. 213–222, 2018.

Teh, Y., Jordan, M., Beal, M., and Blei, D. Sharing clusters among related groups: Hierarchical dirichlet processes. Advances in neural information processing systems vol. 17, 2004.

Wang, Y.-X. and Zhang, Y.-J. Nonnegative matrix factorization: A comprehensive review. IEEE Transactions on Knowledge and Data Engineering 25 (6): 1336–1353, 2013.

Yan, X., Guo, J., Lan, Y., and Cheng, X. A biterm topic model for short texts. WWW ’13. Association for Computing Machinery, New York, NY, USA, pp. 1445–1456, 2013.
Como Citar

Selecione um Formato
BERGAMINI GOMES, Giordanno Brunno; ATTUX, Romis. Contributions to Social Media Analysis Based on Topic Modelling. In: SYMPOSIUM ON KNOWLEDGE DISCOVERY, MINING AND LEARNING (KDMILE), 11. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 113-120. ISSN 2763-8944. DOI: