Modelagem de Tópicos em Textos Curtos: uma Avaliação Experimental

Annie Amorim; Nils Murrugarra-Llerena; Vítor Silva; Daniel de Oliveira; Aline Paes

doi:10.5753/sbbd.2022.224314

Annie Amorim Universidade Federal Fluminense
Nils Murrugarra-Llerena Weber State University
Vítor Silva Snap Inc.
Daniel de Oliveira Universidade Federal Fluminense
Aline Paes Universidade Federal Fluminense

DOI: https://doi.org/10.5753/sbbd.2022.224314

Resumo

As redes sociais são utilizadas para expressar opiniões ou interagir com outras pessoas. Diante do amplo escopo de assuntos publicados e a linguagem informal presente nas postagens, a busca de informações é significativamente desafiadora. Assim, descobrir automaticamente os tópicos tratados nos textos ruidosos e com pouco contexto postados é primordial. Dado este cenário, este artigo contribui com uma análise comparativa de métodos de modelagem de tópicos, incluindo os baseados em abordagens probabilísticas e neurais. Ademais, esse artigo contribui com um método para rotular automaticamente os tópicos, permitindo uma análise qualitativa dos tópicos descobertos.

Palavras-chave: Modelagem de Tópicos, Redes Sociais

Referências

Agarwal, N., Sikka, G., and Awasthi, L. K. (2020). Evaluation of web service clustering using dirichlet multinomial mixture model based approach for dimensionality reduction in service representation. IP&M, 57(4):102238.

Aizawa, A. (2003). An information-theoretic perspective of tf–idf measures. Inf. Proc. & Management, 39(1):45–65.

Albalawi, R., Yeap, T. H., and Benyoucef, M. (2020). Using topic modeling methods for short-text data: A comparative analysis. Frontiers in Artificial Intelligence, 3:42.

Alcoforado, A., Ferraz, T. P., Gerber, R., Bustos, E., Oliveira, A. S., Veloso, B. M., Siqueira, F. L., and Costa, A. H. R. (2022). Zeroberto–leveraging zero-shot text classification by topic modeling. arXiv.

Bird, S., Klein, E., and Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language toolkit. ”O’Reilly Media, Inc.”.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information. Trans. Assoc. Comput. Linguistics, 5:135–146.

Bouma, G. (2009). Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL, 30:31–40.

Costa, M. and Duarte, D. (2019). Avaliação de abordagens probabilísticas de extração de tópicos em documentos curtos. In Anais da XV Escola Regional de Banco de Dados, pages 51–60. SBC.

Dimitriadis, N. S. (2020). Applying topic modelling algorithms on twitter messages in greek language. Graduate Thesis. Aristotle University of Thessaloniki.

Egger, R. and Yu, J. (2022). A topic modeling comparison between lda, nmf, top2vec, and bertopic to demystify twitter posts. Frontiers in Sociology, 7.

Grootendorst, M. (2022). Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794.

Hofmann, T. (2013). Probabilistic latent semantic analysis. arXiv preprint arXiv:1301.6705.

Hong, L. and Davison, B. D. (2010). Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics, pages 80–88.

Huang, G., Guo, C., Kusner, M. J., Sun, Y., Sha, F., and Weinberger, K. Q. (2016). Supervised word mover’s distance. NeurIPS, 29.

Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., and Zhao, L. (2019). Latent dirichlet allocation (lda) and topic modeling: models, applications, a survey. M. Tools and App., 78(11):15169–15211.

Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791.

Li, X., Wang, Y., Zhang, A., Li, C., Chi, J., and Ouyang, J. (2018). Filtering out the noise in short text topic modeling. Information Sciences, 456:83–96.

Likhitha, S., Harish, B., and Kumar, H. K. (2019). A detailed survey on topic modeling for document and short text data. Int. J. of Computer App., 178(39):1–9.

Lossio-Ventura, J. A., Gonzales, S., Morzan, J., Alatrista-Salas, H., Hernandez-Boussard, T., and Bian, J. (2021). Evaluation of clustering and topic modeling methods over healthrelated tweets and emails. Artificial Intelligence in Medicine, 117:102096.

Mazarura, J. and DeWaal, A. (2016). A comparison of the performance of latent dirichlet allocation and the dirichlet multinomial mixture model on short text. In PRASA-RobMech, pages 1–6. IEEE.

Omurca, S. I., Ekinci, E., Yakupoglu, E., Arslan, E., and Çapar, B. (2021). Automatic detection of the topics in customer complaints with artificial intelligence. BJECE, 9(3):268–277.

Oraby, S., Bhuiyan, M., Gundecha, P., Mahmud, J., and Akkiraju, R. (2019). Modeling and computational characterization of twitter customer service conversations. ACM Trans. Interact. Intell. Syst., 9(2–3).

Qiang, J., Qian, Z., Li, Y., Yuan, Y., and Wu, X. (2020). Short text topic modeling techniques, applications, and performance: a survey. IEEE Transactions on Knowledge and Data Engineering.

Rehurek, R. and Sojka, P. (2011). Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University, Brno, Czech Republic, 3(2).

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.

Röder, M., Both, A., and Hinneburg, A. (2015). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining, pages 399–408.

Vermelho, S. C., Velho, A. P. M., Bonkovoski, A., and Pirola, A. (2014). Refletindo sobre as redes sociais digitais. Educação & sociedade, 35(126):179–196.

Wilson, A. and Chew, P. A. (2010). Term weighting schemes for latent dirichlet allocation. In human language technologies: The 2010 conf. of the N. American Chap. of the Assoc. for Comp. Linguistics, pages 465–473.

Wu, X., Li, C., Zhu, Y., and Miao, Y. (2020). Short text topic modeling with topic distribution quantization and negative sampling decoder. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1772–1782.

Yin, J. and Wang, J. (2014). A dirichlet multinomial mixture model-based approach for short text clustering. In ACM SIGKDD, pages 233–242.

Zuo, Y.,Wu, J., Zhang, H., Lin, H.,Wang, F., Xu, K., and Xiong, H. (2016). Topic modeling of short texts: A pseudo-document view. In ACM SIGKDD, pages 2105–2114.