Opinion Mining and Active Learning: a Comparison of Sampling Strategies
Resumo
Existem dois problemas principais ao executar a Mineração de Opinião (OM) com fluxos de dados: a falta de dados rotulados e a necessidade de atualizar o modelo de aprendizagem. As técnicas de OM mais usadas não podem lidar bem com esses desafios, portanto, uma alternativa é usar métodos semissupervisionados, como o Active Learning, que é um método para rotular apenas dados selecionados em vez de todo o conjunto de dados; no entanto, requer a escolha de uma estratégia de amostragem para selecionar os dados a serem rotulados. Neste artigo, avaliamos oito estratégias em dez conjuntos de dados, a fim de identificar os melhores para OM com fluxos do Twitter. De acordo com nossos experimentos, a estratégia Entropy mostrou os melhores resultados, mas seleciona um grande número de instâncias a serem rotuladas, exigindo uma investigação mais aprofundada.
Referências
Aston, N., Munson, T., Liddle, J., Hartshaw, G., Livingston, D., and Hu, W. (2014b). Sentiment analysis on the social networks using stream algorithms. Journal of Data Analysis and Information Processing, 2(02):60.
Balazs, J. A. and Velásquez, J. D. (2016). Opinion Mining and Information Fusion: A survey. Information Fusion, 27:95–110.
Danka, T. and Horvath, P. (2018). modAL: A modular active learning framework for Python. available on arXiv at https://arxiv.org/abs/1805.00979.
Firmino Alves, A. L., Baptista, C. d. S., Firmino, A. A., Oliveira, M. G. a. d., and Paiva, A. C. d. (2014). A comparison of svm versus naive-bayes techniques for sentiment analysis in tweets: A case study with the 2013 fifa confederations cup. In Proceedings of the 20th Brazilian Symposium on Multimedia and the Web, WebMedia ’14, pages 123––130, Nova York, NY, EUA. ACM.
Go, A., Bhayani, R., and Huang, L. (2009). Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(12).
Guerra, P. C., Meira, Jr., W., and Cardie, C. (2014). Sentiment analysis on evolving social streams: How self-report imbalances can help. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining, WSDM ’14, pages 443– 452, New York, NY, USA. ACM.
Huang, S., Jin, R., and Zhou, Z. (2014). Active learning by querying informative and representative examples. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 36(10):1936–1949.
Kranjc, J., Smailović, J., Podpečan, V., Grčar, M., Žnidaršič, M., and Lavrač, N. (2015). Active learning for sentiment analysis on data streams: Methodology and workflow implementation in the clowdflows platform. Information Processing & Management, 51(2):187 – 203.
Lewis, D. D. and Gale, W. A. (1994). A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’94, pages 3–12, New York, NY, USA. Springer-Verlag New York, Inc.
Liu, B. and Zhang, L. (2012). A Survey of Opinion Mining and Sentiment Analysis, pages 415–463. Springer US, Boston, MA, EUA.
Ravi, K. and Ravi, V. (2015). A survey on opinion mining and sentiment analysis: Tasks, approaches and applications. Knowledge-Based Systems, 89:14–46.
Sanders, N. J. (2011). Twitter sentiment corpus.
Smailović, J., Grčar, M., Lavrač, N., and Žnidaršič, M. (2014). Stream-based active learning for sentiment analysis in the financial domain. Inf. Sci., 285(C):181–203.
Souza, E., Vitório, D., Castro, D., Oliveira, A. L. I., and Gusmão, C. (2016). Charac- terizing opinion mining: A systematic mapping study of the portuguese language. InSilva, J., Ribeiro, R., Quaresma, P., Adami, A., and Branco, A., editors, Computational Processing of the Portuguese Language, pages 122–127, Cham. Springer International Publishing.
Wagner, S., Zimmermann, M., Ntoutsi, E., and Spiliopoulou, M. (2015). Ageing-based multinomial naive bayes classifiers over opinionated data streams. In Proceedings of the 2015th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I, ECMLPKDD’15, pages 401–416, Switzerland. Springer.
Wang, D., Feng, S., Wang, D., and Yu, G. (2013). Detecting opinion drift from chinese web comments based on sentiment distribution computing. In Lin, X., Manolopoulos, Y., Srivastava, D., and Huang, G., editors, Web Information Systems Engineering – WISE 2013, pages 72–81, Berlin, Heidelberg. Springer Berlin Heidelberg.
Widmer, G. and Kubat, M. (1996). Learning in the presence of concept drift and hidden contexts. Mach. Learn., 23(1):69–101.
Yang, Y. and Loog, M. (2018). A benchmark and comparison of active learning for logistic regression. Pattern Recognition, 83:401 – 415.
Zhu, X., Zhang, P., Lin, X., and Shi, Y. (2007). Active learning from data streams. In Seventh IEEE International Conference on Data Mining (ICDM 2007), pages 757– 762.
Zhu, X., Zhang, P., Lin, X., and Shi, Y. (2010). Active learning from stream data using optimal weight classifier ensemble. IEEE Transactions on Systems, Man, and Cyber- netics, Part B (Cybernetics), 40(6):1607–1621.
Zimmermann, M., Ntoutsi, E., and Spiliopoulou, M. (2014a). Adaptive semi supervised opinion classifier with forgetting mechanism. In Proceedings of the 29th Annual ACM Symposium on Applied Computing, SAC ’14, pages 805–812, New York, NY, USA. ACM.
Zimmermann, M., Ntoutsi, E., and Spiliopoulou, M. (2014b). A semi-supervised self- adaptive classifier over opinionated streams. In 2014 IEEE International Conference on Data Mining Workshop, pages 425–432.
Zimmermann, M., Ntoutsi, E., and Spiliopoulou, M. (2015). Incremental active opinion learning over a stream of opinionated documents. arXiv preprint arXiv:1509.01288.
Žliobaitė, I., Bifet, A., Pfahringer, B., and Holmes, G. (2011). Active learning with evolv- ing streaming data. In Gunopulos, D., Hofmann, T., Malerba, D., and Vazirgiannis, M., editors, Machine Learning and Knowledge Discovery in Databases, pages 597–612, Berlin, Heidelberg. Springer Berlin Heidelberg.