Improving distributed vector representation of short and noisy texts in the context of online classification

  • Renato Silva Universidade Federal de São Carlos
  • Johannes Lochter Universidade Estadual de Campinas / Centro Universitário Facens
  • Tiago Almeida Universidade Federal de São Carlos

Resumo


A classificação de mensagens geradas pelos usuários em redes sociais e outras plataformas da Internet é desafiadora porque costumam ser curtas e repletas de gírias, abreviações e expressões idiomáticas, o que dificulta a extração dos atributos. Este trabalho propõe uma técnica de expansão de dados para aumentar o número de amostras com o objetivo de melhorar a qualidade do modelo de representação textual e elevar o desempenho na classificação. A técnica proposta é avaliada em um cenário de classificação online de sentimento em mensagens do Twitter. Os experimentos foram diligentemente realizados e uma análise estatística dos resultados indicou que a expansão de dados é efetiva na classificação online de mensagens de texto curtas e ruidosas.

Palavras-chave: Machine Learning, Text and Web Mining, Natural Language Processing

Referências

Agarwal, A., Xie, B., Vovsha, I., Rambow, O., e Passonneau, R. (2011). Sentiment analysis of twitter data. In Proceedings of the Workshop on Languages in Social Media (LSM’11), pages 30–38, Portland, Oregon. Association for Computational Linguistics.

Baziotis, C., Pelekis, N., e Doulkeridis, C. (2017). DataStories at SemEval-2017 task 4: Deep LSTM with attention for message-level and topic-based sentiment analysis. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval2017), pages 747–754, Vancouver, Canada. Association for Computational Linguistics.

Bojanowski, P., Grave, E., Joulin, A., e Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.

Das, M. K., Padhy, B., e Mishra, B. K. (2017). Opinion mining and sentiment classification: A review. In 2017 International Conference on Inventive Systems and Control (ICISC), pages 1–3.

Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30.

Freund, Y. e Schapire, R. E. (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277–296.

Gama, J., Sebastião, R., e Rodrigues, P. P. (2013). On evaluating stream learning algrithms. Machine Learning, 90(3):317–346.

Ghannay, S., Favre, B., Estève, Y., e Camelin, N. (2016). Word embedding evaluation and combination. In Chair), N. C. C., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., e Piperidis, S., editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France. European Language Resources Association (ELRA).

Goikoetxea, J., Agirre, E., e Soroa, A. (2016). Single or multiple? Combining word representations independently learned from text and WordNet. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 2608–2614. AAAI Press.

Hirschberg, J. e Manning, C. D. (2015). Advances in natural language processing. Science, 349(6245):261–266.

Lochter, J., Zanetti, R., Reller, D., e Almeida, T. (2016). Short text opinion detection using ensemble of classifiers and semantic indexing. Expert Systems with Applications, 62:243–249.

Lochter, J. V., Pires, P. R., Bossolani, C., Yamakami, A., e Almeida, T. A. (2018). Evaluating the impact of corpora used to train distributed text representation models for noisy and short texts. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–8.

Losing, V., Hammer, B., e Wersing, H. (2018). Incremental on-line learning: A review and comparison of state of the art algorithms. Neurocomputing, 275:1261–1274.

McCallum, A. e Nigam, K. (1998). A comparison of event models for naive Bayes text classification. In Proceedings of the 15th AAAI Workshop on Learning for Text Categorization (AAAI’98), pages 41–48, Madison, Wisconsin.

Mikolov, T., Kombrink, S., Burget, L., Černockỳ, J., e Khudanpur, S. (2011). Extensions of recurrent neural network language model. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 5528–5531. IEEE.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., e Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS’13), pages 3111–3119, Lake Tahoe, Nevada, USA. Curran Associates Inc.

Moro, A., Raganato, A., e Navigli, R. (2014). Entity linking meets word sense disambiguation: a unified approach. Transactions of the Association for Computational Linguistics (TACL), 2:231–244.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., e Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.

Saito, I., Suzuki, J., Nishida, K., Sadamitsu, K., Kobashikawa, S., Masumura, R., Matsumoto, Y., e Tomita, J. (2017). Improving neural text normalization with data augmentation at character-and morphological levels. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 257–262. Asian Federation of Natural Language Processing.

Shamma, D. A., Kennedy, L., e Churchill, E. F. (2009). Tweet the debates: Understanding community annotation of uncollected sources. In Proceedings of the First SIGMM Workshop on Social Media (WSM’09), pages 3–10, Beijing, China. ACM.

Silva, R. M., Alberto, T. C., Almeida, T. A., e Yamakami, A. (2017). Towards filtering undesired short text messages using an online learning approach with semantic indexing. Expert Systems with Applications, 83:314–325.

Socher, R. (2015). Recursive Deep Learning for Natural Language Processing and Computer Vision. PhD thesis, Stanford University.

Speriosu, M., Sudan, N., Upadhyay, S., e Baldridge, J. (2011). Twitter polarity classification with label propagation over lexical links and the follower graph. In Proceedings of the First Workshop on Unsupervised Learning in NLP (EMNLP’11), pages 53–63, Edinburgh, Scotland. Association for Computational Linguistics.

Thelwall, M., Buckley, K., e Paltoglou, G. (2012). Sentiment strength detection for the social web. Journal of the American Society for Information Science and Technology, 63(1):163–173.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., e Polosukhin, I. (2017). Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., e Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.

Vo, D. e Ock, C. (2015). Learning to classify short text from scientific documents using topic models with various types of knowledge. Expert Systems with Applications, 42(3):1684–1698.

Xie, Q., Dai, Z., Hovy, E. H., Luong, M., e Le, Q. V. (2019). Unsupervised data augmentation. CoRR, abs/1904.12848.

Yang, J. e Leskovec, J. (2011). Patterns of temporal variation in online media. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM ’11, pages 177–186, New York, NY, USA. ACM.

Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the 21th International Conference on Machine Learning (ICML’04), pages 116–123, Banff, Alberta, Canada. ACM.
Publicado
15/10/2019
SILVA, Renato; LOCHTER, Johannes; ALMEIDA, Tiago. Improving distributed vector representation of short and noisy texts in the context of online classification. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 16. , 2019, Salvador. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2019 . p. 190-201. ISSN 2763-9061. DOI: https://doi.org/10.5753/eniac.2019.9283.