An Approach to Attribute Selection Using Zipf's Law and the TF-IDF Measure in the Patent Classification Process

  • Carlos Gabriel S. Rodrigues UFMT
  • Claudia A. Martins UFMT

Abstract


Natural language processing aids in understanding data through linguistic methods combined with machine learning and statistical techniques. In this work, algorithms related to word frequency are being investigated with the aim of analyzing the relevance of words to a dataset. Zipf's law combined with Luhn cuts and the TF-IDF measure are used in selecting the most relevant attributes for the classification process in the domain of patent data.

Keywords: Zipf’s law, Luhn cuts, TF-IDF measure, natural language processing

References

Allahverdyan, A. E., Deng, W., and Wang, Q. A. 2013. “Explaining Zipf's law via a mental léxicon”. Physical Review E, v. 88, n. 6, pp. 062804.

Correa, L. M. S. (1999). “Aquisição da linguagem: uma retrospectiva dos últimos trinta anos”. Revista DELTA: Documentação de estudos em linguística teórica e aplicada. DOI: 10.1590/S0102-44501999000300014.

Fall, C. J., Tórcsvári, A., Benzineb, K., & Karetka, G. (2003). “Automated categorization in the international patent classification”. ACM SIGIRForum (37:1), pp. 10–25. URL [link].

Jing, L., Huang, H., and Shi, H. 2002. “Improved feature selection approach TFIDF in text mining”, in Proceedings. International Conference on Machine Learning and Cybernetics. IEEE. pp. 944-946.

Luhn, H. P. (1957). “A Statistical Approach to Mechanized Encoding and Searching of Literary Information”. IBM Journal of Research and Development. 1 (4): 309-317. ISSN 0018-8646. DOI: 10.1147/rd.14.0309

Wipo (2019). Guide to the International Patent Classification. Tech. rep. URL [link]

Zipf, G. K. (1949). Human Behavior and the Principle of Least Effort. Addison-Wesley.
Published
2023-11-28
RODRIGUES, Carlos Gabriel S.; MARTINS, Claudia A.. An Approach to Attribute Selection Using Zipf's Law and the TF-IDF Measure in the Patent Classification Process. In: REGIONAL SCHOOL ON INFORMATICS OF MATO GROSSO (ERI-MT), 12. , 2023, Cuiabá/MT. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 241-245. ISSN 2447-5386. DOI: https://doi.org/10.5753/eri-mt.2023.236614.