Transfer learning for Twitter sentiment analysis: Choosing an effective source dataset

  • Eliseu Guimarães Universidade Federal Fluminense, Marinha do Brasil
  • Jonnathan Carvalho Instituto Federal Fluminense
  • Aline Paes Universidade Federal Fluminense
  • Alexandre Plastino Universidade Federal Fluminense


Sentiment analysis on social media data can be a challenging task, among other reasons, because labeled data for training is not always available. Transfer learning approaches address this problem by leveraging a labeled source domain to obtain a model for a target domain that is different but related to the source domain. However, the question that arises is how to choose proper source data for training the target classifier, which can be made considering the similarity between source and target data using distance metrics. This article investigates the relation between these distance metrics and the classifiers’ performance. For this purpose, we propose to evaluate four metrics combined with distinct dataset representations. Computational experiments, conducted in the Twitter sentiment analysis scenario, showed that the cosine similarity metric combined with bag-of-words normalized with term frequency-inverse document frequency presented the best results in terms of predictive power, outperforming even the classifiers trained with the target dataset in many cases.

Palavras-chave: dataset representation, machine learning, metrics, sentiment analysis, supervised learning, transfer learning


GUIMARÃES, Eliseu; CARVALHO, Jonnathan; PAES, Aline; PLASTINO, Alexandre. Transfer learning for Twitter sentiment analysis: Choosing an effective source dataset. In: SYMPOSIUM ON KNOWLEDGE DISCOVERY, MINING AND LEARNING (KDMILE), 8. , 2020, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 161-168. ISSN 2763-8944. DOI: