Evaluating Topic Modeling Pre-processing Pipelines for Portuguese Texts

  • Antônio Pereira De Souza Júnior UFSJ
  • Pablo Cecilio UFSJ
  • Felipe Viegas UFMG
  • Washington Cunha UFMG
  • Elisa Tuler De Albergaria UFSJ
  • Leonardo Chaves Dutra Da Rocha UFSJ

Resumo


Topic Modeling (TM) is among the most exploited approaches to extracting and organizing information from large amounts of data. Basically, these approaches aim to find semantic topics from textual documents (e.g., product reviews, tweets). Despite the good results of these approaches in English texts, we do not observe the same semantic quality when applied in Portuguese Texts since they are more verbose, presenting varied and complex verb conjugations and many homonyms, among other specific particularities. This work intends to fill this scientific gap by exploiting and evaluating different Topic Modeling Pre-processing Pipelines for Portuguese texts, which correspond to sequences of tasks that needed to be performed before the TM strategies. More specifically, we evaluate different pre-processing pipeline configurations using different semantic data representations to overcome the challenges faced by TM strategies in Portuguese Text. In our experimentation evaluation, considering two datasets collected from Twitter and Reddit related to Brazilian political discussion, we show that our proposed extended pre-processing pipeline, especially considering semantic representations, can achieve significant gains in effectiveness when compared to the TM approaches originally proposed for English texts (up to 9x better).
Palavras-chave: Topic Modeling, Pre-processing Pipeline, Semantic Data Representation, Portuguese Text

Referências

Vithor Gomes Bertalan and Evandro Eduardo Seron Ruiz. 2019. Using Topic Modeling to Find Main Discussion Topics in Brazilian Political Websites. In Proceedings of the 25th Brazillian Symposium on Multimedia and the Web (Rio de Janeiro, Brazil) (WebMedia ’19). Association for Computing Machinery, New York, NY, USA, 245–248. https://doi.org/10.1145/3323503.3360644

Paulo Viana Bicalho, Tiago de Oliveira Cunha, Fernando Henrique Jesus Mourao, Gisele Lobo Pappa, and Wagner Meira. 2014. Generating Cohesive Semantic Topics from Latent Factors. In 2014 Brazilian Conference on Intelligent Systems. 271–276. https://doi.org/10.1109/BRACIS.2014.56

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. 3, null (March 2003), 993–1022.

Bradley Carron-Arthur, Julia Reynolds, Kylie Bennett, Anthony Bennett, and Kathleen M Griffiths. 2016. What’s all the talk about? Topic modelling in a mental health Internet support group. BMC psychiatry 16, 1 (2016), 367. https://doi.org/10.1186/s12888-016-1073-5

Washington Cunha, Sérgio Canuto, Felipe Viegas, Thiago Salles, Christian Gomes, Vitor Mangaravite, Elaine Resende, Thierson Rosa, Marcos André Gonçalves, and Leonardo Rocha. 2020. Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling. Information Processing & Management 57, 4 (2020), 102263. https://doi.org/10.1016/j.ipm.2020.102263

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805(2018). https://arxiv.org/abs/1810.04805

Derek Greene and James P. Cross. 2016. Exploring the Political Agenda of the European Parliament Using a Dynamic Topic Modeling Approach. arxiv:1607.03055 [cs.CL]

Nathan Hartmann, Erick Fonseca, Christopher Shulby, Marcos Treviso, Jessica Rodrigues, and Sandra Aluisio. 2017. Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks. arXiv e-prints, Article arXiv:1708.06025 (Aug. 2017), arXiv:1708.06025 pages. arxiv:1708.06025 [cs.CL]

C. Huang, Qing Wang, Donghui Yang, and Feifei Xu. 2018. Topic mining of tourist attractions based on a seasonal context aware LDA model. Intell. Data Anal. 22(2018), 383–405. https://doi.org/10.3233/IDA-173364

Carina Jacobi, Wouter van Atteveldt, and Kasper Welbers. 2016. Quantitative analysis of large amounts of journalistic texts using topic modelling. Digital Journalism 4, 1 (2016), 89–106. https://doi.org/10.1080/21670811.2015.1093271

Akrivi Krouska, Christos Troussas, and Maria Virvou. 2016. The effect of preprocessing techniques on Twitter sentiment analysis. In 2016 7th International Conference on Information, Intelligence, Systems & Applications (IISA). 1–5. https://doi.org/10.1109/IISA.2016.7785373

Daniel D. Lee and H. Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (1999), 788–791. https://doi.org/10.1038/44565

Daniel D. Lee and H. Sebastian Seung. 2000. Algorithms for Non-Negative Matrix Factorization. In Proceedings of the 13th International Conference on Neural Information Processing Systems (Denver, CO) (NIPS’00). MIT Press, Cambridge, MA, USA, 535–541.

Chenliang Li, Yu Duan, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2017. Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings. ACM Trans. Inf. Syst. 36, 2, Article 11 (aug 2017), 30 pages. https://doi.org/10.1145/3091108

Washington Luiz, Felipe Viegas, Rafael Alencar, Fernando Mourão, Thiago Salles, Dárlinton Carvalho, Marcos Andre Gonçalves, and Leonardo Rocha. 2018. A Feature-Oriented Sentiment Rating for Mobile App Reviews. In Proceedings of the 2018 World Wide Web Conference (Lyon, France) (WWW ’18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1909–1918. https://doi.org/10.1145/3178876.3186168

Jian Ming Luo, Huy Quan Vu, Gang Li, and Rob Law. 2020. Topic modelling for theme park online reviews: analysis of Disneyland. Journal of Travel & Tourism Marketing 37, 2 (2020), 272–285. https://doi.org/10.1080/10548408.2020.1740138

Sergey I. Nikolenko. 2016. Topic Quality Metrics Based on Distributed Word Representations. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (Pisa, Italy) (SIGIR ’16). Association for Computing Machinery, New York, NY, USA, 1029–1032. https://doi.org/10.1145/2911451.2914720

Sergey I. Nikolenko. 2016. Topic Quality Metrics Based on Distributed Word Representations. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (Pisa, Italy) (SIGIR ’16). Association for Computing Machinery, New York, NY, USA, 1029–1032. https://doi.org/10.1145/2911451.2914720

Sergey I. Nikolenko, Sergei Koltcov, and Olessia Koltsova. 2017. Topic Modelling for Qualitative Studies. J. Inf. Sci. 43, 1 (Feb. 2017), 88–102. https://doi.org/10.1177/0165551515617393

D. Nunes, D. Matos, J. Gomes, and F. Neto. 2021. Chronic Pain and Language: A Topic Modelling Approach to Personal Pain Descriptions. arxiv:2109.00402 [cs.CL] https://arxiv.org/abs/2109.00402

Matheus Adler Soares Pinto, Antonio Fernando Lavareda Jacob Junior, Antonio José G. Busson, and Sérgio Colcher. 2020. Relacionando Modelagem de Tópicos e Classificação de Sentimentos para Análise de Mensagens do Twitter Durante a Pandemia da COVID-19. In Anais Estendidos do XXVI Simpósio Brasileiro de Sistemas Multimídia e Web (São Luís). SBC, Porto Alegre, RS, Brasil, 61–64. https://doi.org/10.5753/webmedia_estendido.2020.13064

Shahzad Qaiser and Ramsha Ali. 2018. Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents. International Journal of Computer Applications 181 (07 2018). https://doi.org/10.5120/ijca2018917395

Jipeng Qiang, Ping Chen, Tong Wang, and Xindong Wu. 2017. Topic Modeling over Short Texts by Incorporating Word Embeddings. In Advances in Knowledge Discovery and Data Mining, Jinho Kim, Kyuseok Shim, Longbing Cao, Jae-Gil Lee, Xuemin Lin, and Yang-Sae Moon (Eds.). Springer International Publishing, Cham, 363–374.

Dharini Ramachandran and R Parvathi. 2019. Analysis of Twitter Specific Preprocessing Technique for Tweets. Procedia Computer Science 165 (2019), 245–251. 2nd International Conference on Recent Trends in Advanced Computing ICRTAC -DISRUP - TIV INNOVATION, 2019 November 11-12, 2019. https://doi.org/10.1016/j.procs.2020.01.083

Lauro C. J. Santos, Taís Christofani, Ismael S. Silva, and Daniel H. Dalip. 2019. Quality assessment of Wikipedia content using topic models. In Proceedings of the 25th Brazillian Symposium on Multimedia and the Web, WebMedia 2019, Rio de Janeiro, Brazil, October 29 - November 01, 2019, Joel A. F. dos Santos and Débora Christina Muchaluat-Saade (Eds.). ACM, 249–252. https://doi.org/10.1145/3323503.3360628

Tian Shi, Kyeongpil Kang, Jaegul Choo, and Chandan K. Reddy. 2018. Short-Text Topic Modeling via Non-Negative Matrix Factorization Enriched with Local Word-Context Correlations. In Proceedings of the 2018 World Wide Web Conference(Lyon, France) (WWW ’18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1105–1114. https://doi.org/10.1145/3178876.3186009

Tian Shi, Kyeongpil Kang, Jaegul Choo, and Chandan K. Reddy. 2018. Short-Text Topic Modeling via Non-Negative Matrix Factorization Enriched with Local Word-Context Correlations. In Proceedings of the 2018 World Wide Web Conference(Lyon, France) (WWW ’18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1105–1114. https://doi.org/10.1145/3178876.3186009

Amira Shoukry and Ahmed Rafea. 2012. Preprocessing Egyptian Dialect Tweets for Sentiment Mining. In Fourth Workshop on Computational Approaches to Arabic-Script-based Languages. Association for Machine Translation in the Americas, San Diego, California, USA, 47–56. https://aclanthology.org/2012.amta-caas14.7

Marcos de Souza and Renato Rocha Souza. 2020. Modelagem de tópicos: Resumir e organizar corpus de dados por meio de algoritmos de aprendizagem de máquina. Múltiplos Olhares em Ciência da Informação 9, 2 (jan. 2020). https://periodicos.ufmg.br/index.php/moci/article/view/19138

Alper Kursat Uysal and Serkan Gunal. 2014. The impact of preprocessing on text classification. Information Processing & Management 50, 1 (2014), 04 – 112. https://doi.org/10.1016/j.ipm.2013.08.006

Felipe Viegas, Sérgio Canuto, Christian Gomes, Washington Luiz, Thierson Rosa, Sabir Ribas, Leonardo Rocha, and Marcos André Gonçalves. 2019. CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (Melbourne VIC, Australia) (WSDM ’19). Association for Computing Machinery, New York, NY, USA, 753–761. https://doi.org/10.1145/3289600.3291032

Felipe Viegas, Washington Cunha, Christian Gomes, Antônio Pereira, Leonardo Rocha, and Marcos Goncalves. 2020. CluHTM - Semantic Hierarchical Topic Modeling based on CluWords. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 8138–8150. https://doi.org/10.18653/v1/2020.acl-main.724

Felipe Viegas, Washington Luiz, Christian Gomes, Amir Khatibi, Sérgio Canuto, Fernando Mourão, Thiago Salles, Leonardo Rocha, and Marcos André Gonçalves. 2018. Semantically-Enhanced Topic Modeling. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management (Torino, Italy) (CIKM ’18). Association for Computing Machinery, New York, NY, USA, 893–902. https://doi.org/10.1145/3269206.3271797
Publicado
07/11/2022
SOUZA JÚNIOR, Antônio Pereira De; CECILIO, Pablo; VIEGAS, Felipe; CUNHA, Washington; ALBERGARIA, Elisa Tuler De; ROCHA, Leonardo Chaves Dutra Da. Evaluating Topic Modeling Pre-processing Pipelines for Portuguese Texts. In: BRAZILIAN SYMPOSIUM ON MULTIMEDIA AND THE WEB (WEBMEDIA), 28. , 2022, Curitiba. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2022 . p. 203-213.