An Approach to Attribute Selection Using Zipf's Law and the TF-IDF Measure in the Patent Classification Process
Abstract
Natural language processing aids in understanding data through linguistic methods combined with machine learning and statistical techniques. In this work, algorithms related to word frequency are being investigated with the aim of analyzing the relevance of words to a dataset. Zipf's law combined with Luhn cuts and the TF-IDF measure are used in selecting the most relevant attributes for the classification process in the domain of patent data.
References
Correa, L. M. S. (1999). “Aquisição da linguagem: uma retrospectiva dos últimos trinta anos”. Revista DELTA: Documentação de estudos em linguística teórica e aplicada. DOI: 10.1590/S0102-44501999000300014.
Fall, C. J., Tórcsvári, A., Benzineb, K., & Karetka, G. (2003). “Automated categorization in the international patent classification”. ACM SIGIRForum (37:1), pp. 10–25. URL [link].
Jing, L., Huang, H., and Shi, H. 2002. “Improved feature selection approach TFIDF in text mining”, in Proceedings. International Conference on Machine Learning and Cybernetics. IEEE. pp. 944-946.
Luhn, H. P. (1957). “A Statistical Approach to Mechanized Encoding and Searching of Literary Information”. IBM Journal of Research and Development. 1 (4): 309-317. ISSN 0018-8646. DOI: 10.1147/rd.14.0309
Wipo (2019). Guide to the International Patent Classification. Tech. rep. URL [link]
Zipf, G. K. (1949). Human Behavior and the Principle of Least Effort. Addison-Wesley.
