Selecão de documentos baseado em centróides para classificacão de patentes usando Word2Vec e KNN

Henrique Camacho Farias; Andreia Gentil Bonfante; Claudia  Aparecida Martins

doi:10.5753/semish.2020.11335

Henrique Camacho Farias UFMT
Andreia Gentil Bonfante UFMT
Claudia Aparecida Martins UFMT

DOI: https://doi.org/10.5753/semish.2020.11335

Resumo

Este artigo apresenta um método de categorizacão de patentes baseado na representacão vetorial utilizando word embedding vectors (Word2Vec), na selecão de documentos através do cálculo dos centróides das classes e no algoritmo K-Nearest Neighbour (KNN), com o objetivo de classificar documentos de patentes no nível de secão da hierarquia IPC do conjunto de dados WIPO. Os resultados experimentais indicam que o método de classificacão proposto alcancou a acurácia de 75%.

Palavras-chave: categorização de patentes, cálculo de centróides, classificação de documentos

Referências

Benites, F., Malmasi, S., and Zampieri, M. (2018). Classifying Patent Applications with Ensemble Methods.

Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP 2014 - 2014 Conference on Empiri- cal Methods in Natural Language Processing, Proceedings of the Conference, pages 1724–1734. Association for Computational Linguistics (ACL).

Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classifi- cation learning algorithms. Neural Computation, 10(7):1895–1923.

Fall, C. J., Torcsvári, A., Benzineb, K., and Karetka, G. (2003). Automated categorization in the international patent classification. ACM SIGIR Forum, 37(1):10–25.

Gomez, J. C. and Moens, M.-F. (2014). A Survey of Automated Hierarchical Classifica- tion of Patents. pages 215–249. Springer, Cham.

Gong, L. and Ji, R. (2018). What Does a TextCNN Learn? ArXvi, abs/1801.0.

Grawe, M. F., Martins, C. A., and Bonfante, A. G. (2017). Automated Patent Classifica- tion Using Word Embedding. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 408–411.

Hochreiter, S. and Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computa- tion, 9(8):1735–1780.

Jolliffe, I. T. (1986). Principal Components in Regression Analysis. pages 129–155. Springer, New York, NY.

Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Pro- ceedings of the Conference, pages 1746–1751.

Li, S., Hu, J., Cui, Y., and Hu, J. (2018). DeepPatent: patent classification with convolu- tional neural networks and word embedding. Scientometrics, 117(2):721–744.

Lyu, L. and Han, T. (2019). A comparative study of Chinese patent literature automatic classification based on deep learning. In Proceedings of the ACM/IEEE Joint Confe- rence on Digital Libraries, volume 2019-June, pages 345–346. Institute of Electrical and Electronics Engineers Inc.

Mollá, D. and Seneviratne, D. (2018). Overview of the 2018 ALTA Shared Task: Clas- sifying Patent Applications. In Proceedings of the Australasian Language Technology Association Workshop 2018, pages 84–88.

Wipo (2019). Guide to the International Patent Classification. Technical report.

Xiao, L., Wang, G., and Zuo, Y. (2018). Research on Patent Text Classification Based on Word2Vec and LSTM. In Proceedings - 2018 11th International Symposium on Computational Intelligence and Design, ISCID 2018, volume 1, pages 71–74. Institute of Electrical and Electronics Engineers Inc.

Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., and Hovy, E. (2016). Hierarchical At- tention Networks for Document Classification. In Proceedings of the 2016 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489, San Diego, California. Association for Computational Linguistics.