A Topical Word Embeddings for Text Classification

  • João Marcos Carvalho Lima UECE
  • José Everardo Bessa Maia UECE

Resumo


This paper presents an approach that uses topic models based on LDA to represent documents in text categorization problems. The document representation is achieved through the cosine similarity between document embeddings and embeddings of topic words, creating a Bag-of-Topics (BoT) variant. The performance of this approach is compared against those of two other representations: BoW (Bag-of-Words) and Topic Model, both based on standard tf-idf. Also, to reveal the effect of the classifier, we compared the performance of the nonlinear classifier SVM against that of the linear classifier Naive Bayes, taken as baseline. To evaluate the approach we use two bases, one multi-label (RCV-1) and another single-label (20 Newsgroup). The model presents significant results with low dimensionality when compared to the state of the art.

Referências


Blei, D. M., Edu, B. B., Ng, A. Y., Edu, A. S., Jordan, M. I., and Edu, J. B. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993–1022.

Kim, H. K., Kim, H., and Cho, S. (2017). Bag-of-concepts: Comprehending document representation through clustering words in distributed representation. Neurocomputing, 266:336–352.

Lau, J. H., Grieser, K., Newman, D., and Baldwin, T. (2011). Automatic labelling of topic models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 1536–1545. Association for Computational Linguistics.

Li, S., Chua, T.-S., Zhu, J., and Miao, C. (2016). Generative Topic Embedding: a Continuous Representation of Documents (Extended Version with Proofs).

Liu, Y., Liu, Z., Chua, T.-S., and Sun, M. (2015). TopicalWord Embeddings. Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15), 2(C):2418–2424.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. In Burges, C. J. C.,

Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc.

Mouri˜no-garcía, M., Pérez-rodríguez, R., and Anido-rifón, L. (2015). Bag-of-Concepts Document Representation for Textual News Classification (PDF Download Available). pdf. 6(1):173–188.

Ramage, D., Hall, D., Nallapati, R., and Manning, C. D. (2009). Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pages 248–256. Association for Computational Linguistics.

Rubin, T. N., Chambers, A., Smyth, P., and Steyvers, M. (2012). Statistical topic models for multi-label document classification. Machine learning, 88(1-2):157–208.

Schütze, H., Manning, C. D., and Raghavan, P. (2008). Introduction to information retrieval, volume 39. Cambridge University Press.

Sriurai, W. (2011). Improving Text Categorization by using a Topic Model. Advanced Computing, 2(6):21–27.

Publicado
22/10/2018
LIMA, João Marcos Carvalho; MAIA, José Everardo Bessa. A Topical Word Embeddings for Text Classification. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 15. , 2018, São Paulo. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2018 . p. 25-35. ISSN 2763-9061. DOI: https://doi.org/10.5753/eniac.2018.4401.