Vector Semantic Representation for Similarity Analysis of Textual Documents
Abstract
This paper is based on a Natural Language Processing tool called Doc2Vec, for the semantic representation of textual documents. The database of interest is composed of 44 (forty-four) undergraduate course final papers. Text mining techniques were used to process the digital archives of the monographs and generate the text. Each document is represented by word vectors and the model performs term inferences for semantic analysis. As a result, the similarity of the documents is in the form of a weighted graph, closeness between each element of the data sample.
References
Beppler, M. D. and Fernandes, A. M. R. (2005). Aplicação de text mining para extração de conhecimento jurisprudencial. In Anais do I Congresso Sul Catarinense de Computação.
Castro, L. N. and Ferrari, D. G. (2016). Introdução a` Mineração de Dados: conceitos basicos,´ algoritimos e aplicações. Editora Saraiva, São Paulo.
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996). From Data Mining to Knowledge Discovery in Databases. American Association for Artificial Intelligence.
Hussein, H., Alaaeldin, H., and Hassan, M. (2015). Selection criteria for text mining approaches. Computers in Human Behavior, 51:729–733.
Jurafsky, D. and Martin, J. H. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice-Hall, Upper Saddle River, NJ, USA.
Le, Q. V. and Mikolov, T. (2014). Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning, volume 32, Beijing, China.
Lee, M. D. and Welsh, M. (2005). An empirical evaluation of models of text document similiarity. In CogSci2005, pages 1254–1259.
Loh, S. (2001). Abordagem baseada em conceitos para descoberta de conhecimento em textos. Tese de doutorado, Unviersidade Federal do Rio Grande do Sul, Porto Alegre.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space. Proceedings of Workshop at ICLR, abs/1301.3781.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, volume 2, pages 3111–3119.
Morais, E. A. M. and Ambrosio,´ A. P. L. (2007). Mineração de textos. Technical report, Universidade Federal de Goias,´ Goianiaˆ.
Norvig, P. and Russel, S. (2011). Inteligenciaˆ Artificial. Elsevier, 3 edition.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back propagating errors. Nature, 323:533–536.
Salton, G. and McGill, M. J. (1983). Introduction to Modern Information Retrieval. John Wiley & Sons, New York.
Silva, L. A., Peres, S. M., and Boscarioli, C. (2016). Introdução a` Mineração de Dados: com aplicações em R. Elsevier, Rio de Janeiro.
Silva, N. F. F. (2016). análise de sentimentos em textos curtos provenientes de redes sociais. Tese de doutorado, Instituto de Cienciasˆ Matematicas´ e de Computac¸ao,˜ São Carlos.
Specia, L. and Rino, L. H. (2002). Representação semantica:ˆ Alguns modelos ilustrativos. Technical report, NILC - ICMC-USP.
