Enriching datasets for sentiment analysis in tweets with instance selection

Eliseu Guimarães; Daniela Vianna; Aline Paes; Alexandre Plastino

doi:10.5753/kdmile.2021.17463

Eliseu Guimarães UFF / Marinha do Brasil
Daniela Vianna Pesquisadora independente
Aline Paes UFF http://orcid.org/0000-0002-9089-7303
Alexandre Plastino UFF http://orcid.org/0000-0003-4039-0915

DOI: https://doi.org/10.5753/kdmile.2021.17463

Resumo

Sentiment analysis in tweets is a research field of great importance, mainly due to the popularity of Twitter. However, collecting and annotating tweets is an expensive and time-consuming task, making that some domains have only a limited set of labeled data. A promising strategy to handle this issue is to leverage labeled domains rich in data to select instances that enrich target datasets. This paper proposes different strategies for selecting instances from a set of labeled source datasets in order to improve the performance of classifiers trained only with the target dataset. Different approaches are proposed, including similarity metrics and variations in the number of selected instances. The results show that the size of the training set plays an essential role in the predictive capacity of the classifier. Furthermore, the results point out the importance of taking into account diversity criteria when selecting the instances.

Palavras-chave: machine learning, sentiment analysis, supervised learning, transfer learning

Referências

Barreto, S., Moura, R., Carvalho, J., Paes, A., and Plastino, A. Sentiment analysis in tweets: an assessment study from classical to modern text representation models. CoRR vol. abs/2105.14373, 2021.

Bravo-Marquez, F., Frank, E., Mohammad, S. M., and Pfahringer, B. Determining word-emotion associations from tweets by multi-label classification. In Proceedings of the 2016 IEEE/WIC/ACM Int. Conf. on Web Intelligence (WI). IEEE, pp. 536–539, 2016.

Carvalho, J. and Plastino, A. On the evaluation and combination of state-of-the-art features in twitter sentiment analysis. Artificial Intelligence Review vol. 54, pp. 1887–1936, 03, 2021.

Guo, J., Shah, D., and Barzilay, R. Multi-source domain adaptation with mixture of experts. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. ACL, pp. 4694–4703, 2018.

Liu, B. Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Studies in Natural Language Processing. Cambridge University Press, 2020.

Liu, M., Song, Y., Zou, H., and Zhang, T. Reinforced training data selection for domain adaptation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. ACL, pp. 1957–1968, 2019.

Martínez-Cámara, E., Martín-Valdivia, M., López, L., and Montejo-Ráez, A. Sentiment analysis in twitter. Natural Language Engineering vol. 20, pp. 1–28, 01, 2014.

Pan, S. J. and Yang, Q. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering 22 (10): 1345–1359, 2010.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research vol. 12, pp. 2825–2830, 2011.

Ruder, S., Ghaffari, P., and Breslin, J. G. Data selection strategies for multi-domain sentiment analysis. CoRR vol. abs/1702.02426, 2017.

Ruder, S. and Plank, B. Learning to select data for transfer learning with Bayesian Optimization. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. ACL, pp. 372–382, 2017.