Cluster Fusion Training: Exploring Cluster Analysis to Enhance Cross-Domain Sentiment Classification

  • Victor Akihito Kamada Tomita Universidade de São Paulo
  • Angelo Cesar Mendes da Silva Universidade de São Paulo
  • Ricardo Marcondes Marcacini Universidade de São Paulo


Devido à escassez de dados para domínios específicos, muitos estudos optam por treinar modelos em domínios cruzados. A abordagem mais comum consiste em treinar modelos em todos os domínios-fonte e, em seguida, validar seu desempenho no domínio-alvo. Mas, essa abordagem não leva em consideração que diferentes palavras podem ter semânticas distintas dependendo do domínio. Neste artigo, é proposto um novo método que usa técnicas de clustering para agrupar dados semelhantes. A partir desses grupos, são trainados modelos especialistas que são usados em um processo de fusão. Através desse método, são demonstradas melhorias até significativas de até 5% de acurácia para modelos de classificação.

Palavras-chave: análise de sentimentos, domínio cruzado, agrupamento


Adoma, A. F., Henry, N.-M., and Chen, W. (2020). Comparative analyses of bert, roberta, distilbert, and xlnet for text-based emotion recognition. In 2020 17th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), pages 117–121.

Ahmed, M., Seraj, R., and Islam, S. M. S. (2020). The k-means algorithm: A comprehensive survey and performance evaluation. Electronics, 9(8):1295.

Ain, Q. T., Ali, M., Riaz, A., Noureen, A., Kamran, M., Hayat, B., and Rehman, A. (2017). Sentiment analysis using deep learning techniques: a review. International Journal of Advanced Computer Science and Applications, 8(6).

Alapati, Y. K. and Sindhu, K. (2016). Combining clustering with classification: a technique to improve classification accuracy. Lung Cancer, 32(57):3.

Araújo, M., Pereira, A., and Benevenuto, F. (2020). A comparative study of machine translation for multilingual sentence-level sentiment analysis. Information Sciences, 512:1078–1102.

Asghar, N. (2016). Yelp dataset challenge: Review rating prediction. arXiv preprint arXiv:1605.05362.

Birjali, M., Kasri, M., and Beni-Hssane, A. (2021). A comprehensive survey on sentiment analysis: Approaches, challenges and trends. Knowledge-Based Systems, 226:107134.

Bousdekis, A., Lepenioti, K., Apostolou, D., and Mentzas, G. (2021). A review of data-driven decision-making methods for industry 4.0 maintenance applications. Electronics, 10(7):828.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.

Chapuis, E., Colombo, P., Manica, M., Labeau, M., and Clavel, C. (2020). Hierarchical pre-training for sequence labelling in spoken dialog. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2636–2648, Online. Association for Computational Linguistics.

Dang, N. C., Moreno-García, M. N., and De la Prieta, F. (2020). Sentiment analysis based on deep learning: A comparative study. Electronics, 9(3):483.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Dos Santos, B. N., Marcacini, R. M., and Rezende, S. O. (2021). Multi-domain aspect extraction using bidirectional encoder representations from transformers. IEEE Access, 9:91604–91613.

Gowda, C., Anirudh, Pai, A., and kumar A, C. (2019). Twitter and reddit sentimental analysis dataset.

Grano, G., Di Sorbo, A., Mercaldo, F., Visaggio, C. A., Canfora, G., and Panichella, S. (2017). Android apps and user feedback: a dataset for software evolution and quality improvement. In Proceedings of the 2nd ACM SIGSOFT international workshop on app market analytics, pages 8–11.

Guo, M.-H., Xu, T.-X., Liu, J.-J., Liu, Z.-N., Jiang, P.-T., Mu, T.-J., Zhang, S.-H., Martin, R. R., Cheng, M.-M., and Hu, S.-M. (2022). Attention mechanisms in computer vision: A survey. Computational visual media, 8(3):331–368.

Habimana, O., Li, Y., Li, R., Gu, X., and Yu, G. (2020). Sentiment analysis using deep learning approaches: an overview. Science China Information Sciences, 63:1–36.

He, P., Gao, J., and Chen, W. (2021). Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543.

He, P., Liu, X., Gao, J., and Chen, W. (2020). Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.

Hinton, G., Vinyals, O., Dean, J., et al. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7).

Kaur, H., Ahsaan, S. U., Alankar, B., and Chang, V. (2021). A proposed sentiment analysis deep learning algorithm for analyzing covid-19 tweets. Information Systems Frontiers, pages 1–13.

Keung, P., Lu, Y., Szarvas, G., and Smith, N. A. (2020). The multilingual amazon reviews corpus. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.

Kriegel, H.-P., Schubert, E., and Zimek, A. (2017). The (black) art of runtime evaluation: Are we comparing algorithms or implementations? Knowledge and Information Systems, 52:341–378.

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.

Liao, Q., Yuan, J., Dong, M., Yang, L., Fielding, R., and Lam, W. W. T. (2020). Public engagement and government responsiveness in the communications about covid-19 during the early epidemic stage in china: infodemiology study on social media data. Journal of medical Internet research, 22(5):e18796.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

Ma, T., Sun, Y., Yang, Z., and Yang, Y. (2023). Prod: Prompting-to-disentangle domain knowledge for cross-domain few-shot image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19754–19763.

Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C. (2011). Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150.

Murtagh, F. and Contreras, P. (2012). Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1):86–97.

Nandwani, P. and Verma, R. (2021). A review on sentiment analysis and emotion detection from text. Social Network Analysis and Mining, 11(1):81.

Nassiri, K. and Akhloufi, M. (2022). Transformer models used for text-based question answering systems. Applied Intelligence, pages 1–34.

Niu, Z., Zhong, G., and Yu, H. (2021). A review on the attention mechanism of deep learning. Neurocomputing, 452:48–62.

Ortiz-Ospina, E. and Roser, M. (2023). The rise of social media. Our world in data.

Pang, B. and Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL.

Piernik, M. and Morzy, T. (2021). A study on using data clustering for feature extraction to improve the quality of classification. Knowledge and Information Systems, 63(7):1771–1805.

Rosenthal, S., Farra, N., and Nakov, P. (2017). Semeval-2017 task 4: Sentiment analysis in twitter. In Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017), pages 502–518.

Rosenthal, S., Farra, N., and Nakov, P. (2019). Semeval-2017 task 4: Sentiment analysis in twitter. arXiv preprint arXiv:1912.00741.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

Sarker, I. H. (2021). Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions. SN Computer Science, 2(6):420.

Sikhi, Y., Devi, S. A., Jasti, S. K., and Ram, M. S. (2022). Sentimental analysis through speech and text for imdb dataset. In 2022 4th International Conference on Smart Systems and Inventive Technology (ICSSIT), pages 1519–1522. IEEE.

Silva, E. H. d. and Marcacini, R. M. (2021). Aspect-based sentiment analysis using bert with disentangled attention. In Proceedings.

Singh, M., Jakhar, A. K., and Pandey, S. (2021). Sentiment analysis on the impact of coronavirus in social life using the bert model. Social Network Analysis and Mining, 11(1):33.

Sirusstara, J., Alexander, N., Alfarisy, A., Achmad, S., and Sutoyo, R. (2022). Clickbait headline detection in indonesian news sites using robustly optimized bert pre-training approach (roberta). In 2022 3rd International Conference on Artificial Intelligence and Data Sciences (AiDAS), pages 1–6.

Sivarajah, U., Irani, Z., Gupta, S., and Mahroof, K. (2020). Role of big data and social media analytics for business to business sustainability: A participatory web context. Industrial Marketing Management, 86:163–179.

Song, H. and Yang, W. (2022). Gscctl: a general semi-supervised scene classification method for remote sensing images based on clustering and transfer learning. International Journal of Remote Sensing, 43(15-16):5976–6000.

Sun, J., Lapuschkin, S., Samek, W., Zhao, Y., Cheung, N.-M., and Binder, A. (2021). Explanation-guided training for cross-domain few-shot classification. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 7609–7616.

Tseng, H.-Y., Lee, H.-Y., Huang, J.-B., and Yang, M.-H. (2020). Cross-domain few-shot classification via learned feature-wise transformation. arXiv preprint arXiv:2001.08735.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Wang, W., Bao, H., Huang, S., Dong, L., and Wei, F. (2020). Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers. arXiv preprint arXiv:2012.15828.

Wankhade, M., Rao, A. C. S., and Kulkarni, C. (2022). A survey on sentiment analysis methods, applications, and challenges. Artificial Intelligence Review, 55(7):5731–5780.

Yadav, A. and Vishwakarma, D. K. (2020). Sentiment analysis using deep learning architectures: a review. Artificial Intelligence Review, 53(6):4335–4385.

Zhang, L., Wang, S., and Liu, B. (2018). Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1253.

Zong, C., Xia, R., and Zhang, J. (2021). Text Data Mining, volume 711. Springer.
TOMITA, Victor Akihito Kamada; DA SILVA, Angelo Cesar Mendes; MARCACINI, Ricardo Marcondes. Cluster Fusion Training: Exploring Cluster Analysis to Enhance Cross-Domain Sentiment Classification. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 20. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 330-344. ISSN 2763-9061. DOI: