An Extensive Empirical Evaluation of Preprocessing Techniques and Supervised One Class Learning Algorithms for Text Classification

  • Marcos Gôlo Federal University of Mato Grosso do Sul
  • Ricardo Marcacini Federal University of Mato Grosso do Sul
  • Rafael Rossi Federal University of Mato Grosso do Sul

Abstract


The texts automatic classification (TAC) has become interesting for academic and business purposes due to the massive volume of texts being produced. TAC is usually performed through multi-class learning, in which a user must provide labeled texts for all classes of an application domain. However, in scenarios in which the intent is to verify if a text belongs to a class of interest, the one-class learning (OCL) is adequate. OCL requires only labeled texts from the class of interest to generate a classification model. Despite the applicability, studies about this theme disregard the use of different algorithms, text pre-processing techniques, and text collections from different domains and characteristics. Therefore, there is no guide about what algorithms and text pre-processing techniques to use in practical applications. Thus, this article aims to address this gap. The results demonstrated that the k-Means-based OCL algorithm obtained the best classification performances for most experiments. Also, the use of dimensionality reduction techniques, which is usually carried out in the literature, did not demonstrate to increase classification performance.

Keywords: One class learning, text classification, text pre-processing

References

Aggarwal, C. C. (2018). Machine Learning for Text. Springer Publishing Company, Incorporated, 1st edition.

Biemann, C. and Mehler, A. (2014). Text Mining: From Ontology Learning to Automated Text Processing Applications. Springer.

Kemmler, M., Rodner, E., Wacker, E.-S., and Denzler, J. (2013). One-class classification with gaussian processes. Pattern Recognition, 46(12):3507–3518.

Khan, S. S. and Madden, M. G. (2009). A survey of recent trends in one class classification. In Irish Conf. Artifici. Intelligence and Cognitive Science, pages 188–197.

Kim, H., Howland, P., and Park, H. (2005). Dimension reduction in text classification with support vector machines. Journal of Machine Learning Research, 6:37–53.

Kumar, B. S. and Ravi, V. (2017a). One-class text document classification with OCSVM and LSI. In Art. Intel. & Evolutionary Computations in Eng. Systems, pages 597–606.

Kumar, B. S. and Ravi, V. (2017b). Text document classification with PCA and one-class SVM. In Proc. Int. Conf. Frontiers in Intel. Computing: Theory and Applications, pages 107–115.

Manevitz, L. and Yousef, M. (2007). One-class document classification via neural networks. Neurocomput., 70(7-9):1466–1481.

Manevitz, L. M. and Yousef, M. (2001). One-class SVMs for document classification. Journal of machine Learning research, 2(Dec):139–154.

Marcacini, R. M., Rossi, R. G., Nogueira, B. M., Martins, L. V., Cherman, E. A., and Rezende, S. O. (2017). Websensors analytics: Learning to sense the real world using web news events. In Simp. Brasileiro de Sistemas Multimı́dia e Web, pages 169–173.

Muir, A. (2005). Lean Six Sigma Statistics: Calculating Process Efficiencies in Transactional Project. McGraw Hill professional – Six sigma operational methods series.

Pan, R., Zhou, Y., Cao, B., Liu, N. N., Lukose, R., Scholz, M., and Yang, Q. (2008). One-class collaborative filtering. In Proc Int. Conf. Data Mining, pages 502–511.

Rossi, R. G. (2016). Classificação automática de textos por meio de aprendizado de máquina baseado em redes. PhD thesis, Universidade de São Paulo.

Rossi, R. G., Marcacini, R. M., and Rezende, S. O. (2013). Benchmarking text collections for classification and clustering tasks. Institute of Mathematics and Computer Sciences, University of São Paulo.

Shin, H. J., Eom, D.-H., and Kim, S.-S. (2005). One-class support vector machines—an application in machine fault detection and classification. Computers & Industrial Engineering, 48(2):395–408.

Tan, P., Steinbach, M., and Kumar, V. (2013). Introduction to Data Mining: Pearson New International Edition. Pearson Education Limited.

Tax, D. M. J. (2001). One-class classification: Concept learning in the absence of counter-examples. PhD thesis, Technische Universiteit Delft.

Zhang, B. and Zuo, W. (2008). Learning from positive and unlabeled examples: A survey. In 2008 International Symposiums on Information Processing, pages 650–654. IEEE.
Published
2019-10-15
GÔLO, Marcos; MARCACINI, Ricardo; ROSSI, Rafael. An Extensive Empirical Evaluation of Preprocessing Techniques and Supervised One Class Learning Algorithms for Text Classification. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 16. , 2019, Salvador. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2019 . p. 262-273. ISSN 2763-9061. DOI: https://doi.org/10.5753/eniac.2019.9289.