CluSent – Combining Semantic Expansion and De-Noising for Dataset-Oriented Sentiment Analysis of Short Texts

  • Felipe Viegas UFMG
  • Sergio Canuto IFG
  • Washington Cunha UFMG
  • Celso França UFMG
  • Claudio Valiense UFMG
  • Leonardo Rocha UFSJ
  • Marcos André Gonçalves UFMG


The lack of sufficient information, mainly in short texts, is a major challenge to building effective sentiment models. Short texts can be enriched with more complex semantic relationships that better capture affective information, with a potential undesired side effect of noise introduced into the data. This work proposes a new strategy for customized dataset-oriented sentiment analysis – CluSent – that exploits a powerful, recently proposed concept for representing semantically related words – CluWords. CluSent tackles the issues mentioned above of information shortage and noise by: (i) exploiting the semantic neighborhood of a given pre-trained word embedding to enrich document representation and (ii) introducing dataset-oriented filtering and weighting mechanisms to cope with noise, which takes advantage of the polarity and intensity information from lexicons. In our experimental evaluation, considering 19 datasets, five state-of-the-art baselines (including modern transformer architectures), and two metrics, CluSent was the best method in 30 out of 38 possibilities, with significant gains over the strongest baselines (over 14%).

Palavras-chave: Sentiment Analysis, Classification, Natural Language Processing


Mohamad Alissa, Issa Haddad, Jonathan Meyer, Jade Obeid, Kostis Vilaetis, Nicolas Wiecek, and Sukrit Wongariyakavee. 2021. Sentiment Analysis for Open Domain Conversational Agent. arxiv:2101.00675 [cs.AI]

Washington Cunha, Vítor Mangaravite, Christian Gomes, Sérgio Canuto, Elaine Resende, Cecilia Nascimento, Felipe Viegas, Celso França, Wellington Santos Martins, Jussara M. Almeida, Thierson Rosa, Leonardo Rocha, and Marcos André Gonçalves. 2021. On the cost-effectiveness of neural and non-neural approaches and representations for text classification: A comprehensive comparative study. IP&M 58, 3 (2021), 102481.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805 (2018). [link]

Fábio Figueiredo, Leonardo Rocha, Thierson Couto, Thiago Salles, Marcos André Gonçalves, and Wagner Meira Jr.2011. Word Co-occurrence Features for Text Classification. Inf. Syst. 36 (2011).

Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification using distant supervision. Processing 150 (01 2009)

Xia Hu, Nan Sun, Chao Zhang, and Tat-Seng Chua. 2009. Exploiting internal and external semantics for the clustering of short texts using world knowledge. In Proceedings of CIKM. ACM, 919–928.

Qi Huang, Zhanghao Chen, Zijie Lu, and Yuan Ye. 2018. Analysis of Bag-of-n-grams Representation’s Properties Based on Textual Reconstruction. CoRR (2018). arxiv:1809.06502 [link]

Clayton J. Hutto and Eric Gilbert. 2014. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. In ICWSM’14

Zhigang Jin, Xiaofang Zhao, and Yuhong Liu. 2021. Heterogeneous Graph Network Embedding for Sentiment Analysis on Social Media. Cognitive Computation 13, 1 (01 Jan 2021), 81–95.

David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. JMLR. 5 (2004), 361–397.

Alhassan Mabrouk, Rebeca P. Díaz Redondo, and Mohammed Kayed. 2020. Deep Learning-Based Sentiment Classification: A Comparative Survey. IEEE Access 8 (2020), 85616–85638.

Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence 42, 4 (2018), 824–836

Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Advances in Pre-Training Distributed Word Representations. In LREC’18

Farhad Nooralahzadeh, Lilja Øvrelid, and Jan Tore Lønning. 2018. Evaluation of Domain-specific Word Embeddings using Knowledge Resources. In LREC’18, Nicoletta Calzolari (Conference chair), Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga (Eds.). ELRA, Miyazaki, Japan

Filipe N Ribeiro, Matheus Araújo, Pollyanna Gonçalves, Marcos André Gonçalves, and Fabrício Benevenuto. 2016. SentiBench: A benchmark comparison of state-of-the-practice sentiment analysis methods. EPJ Data Science 5, 1 (2016), 1–29

Sara Rosenthal, Noura Farra, and Preslav Nakov. 2019. SemEval-2017 Task 4: Sentiment Analysis in Twitter. CoRR abs/1912.00741 (2019). arxiv:1912.00741 [link]

Devendra Singh Sachan, Manzil Zaheer, and Ruslan Salakhutdinov. 2019. Revisiting LSTM Networks for Semi-Supervised Text Classification via Mixed Objective Function. Proceedings of the AAAI Conference on Artificial Intelligence 33, 01 (Jul. 2019), 6940–6948.

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In EMNLP’19. ACL, Seattle, Washington, USA, 1631–1642. [link]

Tan Thongtan and Tanasanee Phienthrakul. 2019. Sentiment Classification Using Document Embeddings Trained with Cosine Similarity. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Association for Computational Linguistics, Florence, Italy, 407–414.

Felipe Viegas, Mário S. Alvim, Sérgio Canuto, Thierson Rosa, Marcos André Gonçalves, and Leonardo Rocha. 2020. Exploiting semantic relationships for unsupervised expansion of sentiment lexicons. Information Systems 94 (2020), 101606.

Felipe Viegas, Sérgio Canuto, Christian Gomes, Washington Luiz, Thierson Rosa, Sabir Ribas, Leonardo Rocha, and Marcos André Gonçalves. 2019. CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling. In Proceedings of WSDM ’19 (Melbourne VIC, Australia). 753–761.

Felipe Viegas, Washington Cunha, Christian Gomes, Antônio Pereira, Leonardo Rocha, and Marcos Goncalves. 2020. CluHTM - Semantic Hierarchical Topic Modeling based on CluWords. In Proc. of the 58th Annual Meeting of the Assoc. for Computational Linguistics (ACL 2020). Association for Computational Linguistics, 8138–8150

Yanyan Wang, Fulian Yin, Jianbo Liu, and Marco Tosato. 2020. Automatic construction of domain sentiment lexicon for semantic disambiguation. Multim. Tools Appl. 79, 31-32 (2020), 22355–22373.

Da Yin, Tao Meng, and Kai-Wei Chang. 2020. SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics. In Proceedings of the 58th Conference of the Association for Computational Linguistics, ACL 2020, Seattle, USA
VIEGAS, Felipe; CANUTO, Sergio; CUNHA, Washington; FRANÇA, Celso; VALIENSE, Claudio; ROCHA, Leonardo; GONÇALVES, Marcos André. CluSent – Combining Semantic Expansion and De-Noising for Dataset-Oriented Sentiment Analysis of Short Texts. In: SIMPÓSIO BRASILEIRO DE SISTEMAS MULTIMÍDIA E WEB (WEBMEDIA), 29. , 2023, Ribeirão Preto/SP. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 110–118.

Artigos mais lidos do(s) mesmo(s) autor(es)

1 2 3 > >>