Enriching datasets for sentiment analysis in tweets with instance selection
Sentiment analysis in tweets is a research field of great importance, mainly due to the popularity of Twitter. However, collecting and annotating tweets is an expensive and time-consuming task, making that some domains have only a limited set of labeled data. A promising strategy to handle this issue is to leverage labeled domains rich in data to select instances that enrich target datasets. This paper proposes different strategies for selecting instances from a set of labeled source datasets in order to improve the performance of classifiers trained only with the target dataset. Different approaches are proposed, including similarity metrics and variations in the number of selected instances. The results show that the size of the training set plays an essential role in the predictive capacity of the classifier. Furthermore, the results point out the importance of taking into account diversity criteria when selecting the instances.
Bravo-Marquez, F., Frank, E., Mohammad, S. M., and Pfahringer, B. Determining word-emotion associations from tweets by multi-label classification. In Proceedings of the 2016 IEEE/WIC/ACM Int. Conf. on Web Intelligence (WI). IEEE, pp. 536–539, 2016.
Carvalho, J. and Plastino, A. On the evaluation and combination of state-of-the-art features in twitter sentiment analysis. Artificial Intelligence Review vol. 54, pp. 1887–1936, 03, 2021.
Guo, J., Shah, D., and Barzilay, R. Multi-source domain adaptation with mixture of experts. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. ACL, pp. 4694–4703, 2018.
Liu, B. Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Studies in Natural Language Processing. Cambridge University Press, 2020.
Liu, M., Song, Y., Zou, H., and Zhang, T. Reinforced training data selection for domain adaptation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. ACL, pp. 1957–1968, 2019.
Martínez-Cámara, E., Martín-Valdivia, M., López, L., and Montejo-Ráez, A. Sentiment analysis in twitter. Natural Language Engineering vol. 20, pp. 1–28, 01, 2014.
Pan, S. J. and Yang, Q. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering 22 (10): 1345–1359, 2010.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research vol. 12, pp. 2825–2830, 2011.
Ruder, S., Ghaffari, P., and Breslin, J. G. Data selection strategies for multi-domain sentiment analysis. CoRR vol. abs/1702.02426, 2017.
Ruder, S. and Plank, B. Learning to select data for transfer learning with Bayesian Optimization. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. ACL, pp. 372–382, 2017.