Um Estudo Aprofundado sobre Grupos Semânticos de Palavras - CluWords - em tarefas de PLN

Felipe Viegas; Leonardo Rocha; Marcos André Gonçalves

doi:10.5753/webmedia_estendido.2024.241871

Felipe Viegas UFMG
Leonardo Rocha UFSJ
Marcos André Gonçalves UFMG

DOI: https://doi.org/10.5753/webmedia_estendido.2024.241871

Resumo

This Ph.D. dissertation focused on proposing, designing and evaluating a novel textual document representation that exploits the “best of two worlds”: efficient and effective frequentist information (TFIDF representations) with semantic information derived from word embedding representations. In more details, our proposal – called CluWords – groups syntactically and semantically related words into clusters and applies domain-specific and applicationoriented filtering and weighting schemes over them to build powerful document representations especially tuned for the task in hand. We apply our novel Cluword concept to four Natural Language Processing (NLP) applications, related to topics from WebMedia interest: topic modeling, hierarchical topic modeling, sentiment lexicon building, and sentiment analysis. Some of the novel contributions of this dissertation include: (i) the introduction of a new data representation; (ii) the design of CluWords’ components capable of improving the effectiveness of Topic Modeling, Hierarchical Topic Modeling and Sentiment Analysis applications; (iii) the proposal of two new topic quality metrics to assess the topical quality of the hierarchical structures. Our extensive experimentation demonstrates that CluWords produce the current state-of-the-art topic modeling and hierarchical topic modeling. For sentiment analysis, our experiments show that CluWords filtering and weighting can mitigate semantic noise, surpassing powerful Transformer architectures in the task. Our results were published in some of the most important conferences in journals of the field, as detailed in this document. Our work was supported by two Google Research Awards.

Palavras-chave: Representação de dados, modelagem de tópicos, análise de sentimento, processamento de linguagem natural

Referências

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018). [link]

Maarten Grootendorst. 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794 (2022).

Clayton J Hutto and Eric Gilbert. 2014. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Eighth international AAAI conference on weblogs and social media.

Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Shaochen Zhong, Bing Yin, and Xia Hu. 2024. Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond. 18, 6 (2024). DOI: 10.1145/3649506