On the Role of Semantic Word Clusters — CluWords — in Natural Language Processing (NLP) Tasks

  • Felipe Viegas UFMG
  • Leonardo Rocha UFSJ
  • Marcos André Gonçalves UFMG

Resumo


The ability to represent data in meaningful and tractable ways is crucial for Natural Language Processing (NLP) applications. This Ph.D. dissertation focused on proposing, designing and evaluating a novel textual document representation that exploits the “best of two worlds”: efficient and effective frequentist information (TFIDF representations) with semantic information derived from word embedding representations. In more details, our proposal – called CluWords – groups syntactically and semantically related words into clusters and applies domain-specific and application-oriented filtering and weighting schemes over them to build powerful document representations especially tuned for the task in hand. We apply our novel Cluword concept to four NLP applications: topic modeling, hierarchical topic modeling, sentiment lexicon building and sentiment analysis. Some of the novel contributions of this dissertation include: (i) the introduction of a new data representation composed of three general steps (clustering, filtering, and weighting). These steps are specially designed to overcome task-specific challenges related to noise and lack of information; (ii) the design of CluWords’ components capable of improving the effectiveness of Topic Modeling, Hierarchical Topic Modeling and Sentiment Analysis applications; (iii) the proposal of two new topic quality metrics to assess the topical quality of the hierarchical structures. Our extensive experimentation demonstrates that CluWords produce the current state-of-the-art topic modeling and hierarchical topic modeling. For sentiment analysis, our experiments show that CluWords filtering and weighting can mitigate semantic noise, surpassing powerful Transformer architectures in the task. All code and datasets produced in this dissertation are available for replication. Our results were published in some of the most important conferences in journals of the field, as detailed in this document. Our work was supported by two Google Research Awards.

Referências

Baroni, M., Dinu, G., and Kruszewski, G. (2014). Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In ACL’14.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Dufter, P., Kassner, N., and Schütze, H. (2021). Static embeddings as efficient knowledge bases? Greene, D., O’Callaghan, D., and Cunningham, P. (2014). How many topics? stability analysis for topic models. CoRR.

Hutto, C. J. and Gilbert, E. (2014). VADER: A parsimonious rule-based model for sentiment analysis of social media text. In ICWSM’14.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. CoRR, abs/1301.3781.

Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., and Joulin, A. (2018). Advances in pre-training distributed word representations. In LREC’18.

Nooralahzadeh, F., Øvrelid, L., and Lønning, J. T. (2018). Evaluation of Domain-specific Word Embeddings using Knowledge Resources. In LREC 2018.

Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532–1543.

Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., and Qin, B. (2014). Learning sentiment-specific word embedding for twitter sentiment classification. In ACL 2014, pages 1555–1565.

Viegas, F., Alvim, M. S., Canuto, S., Rosa, T., Gonçalves, M. A., and Rocha, L. (2020a). Exploiting semantic relationships for unsupervised expansion of sentiment lexicons. Information Systems, 94:101606.

Viegas, F., Canuto, S., Cunha, W., França, C., Valiense, C., Rocha, L., and Gonçalves, M. A. (2023). Clusent – combining semantic expansion and de-noising for dataset-oriented sentiment analysis of short texts. In Webmedia 2023, page 110–118.

Viegas, F., Canuto, S., Gomes, C., Luiz, W., Rosa, T., Ribas, S., Rocha, L., and Gonçalves, M. A. (2019). Cluwords: Exploiting semantic word clustering representation for enhanced topic modeling. In WSDM ’19.

Viegas, F., Cunha, W., Gomes, C., Pereira, A., Rocha, L., and Gonçalves, M. A. (2020b). Cluhtm: Semantic hierarchical topic modeling based on cluwords. In ACL‘2020.

Viegas, F., Luiz, W., Gomes, C., Khatibi, A., Canuto, S., Mourão, F., Salles, T., Rocha, L., and Gonçalves, M. A. (2018). Semantically-enhanced topic modeling. In CIKM ’18.
Publicado
21/07/2024
VIEGAS, Felipe; ROCHA, Leonardo; GONÇALVES, Marcos André. On the Role of Semantic Word Clusters — CluWords — in Natural Language Processing (NLP) Tasks. In: CONCURSO DE TESES E DISSERTAÇÕES (CTD), 37. , 2024, Brasília/DF. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 38-47. ISSN 2763-8820. DOI: https://doi.org/10.5753/ctd.2024.2036.