Positive Unlabeled Learning: Adapting NMF for text classification

Lucas S. S. Nunes; Thiago de P. Faleiros; Rafael G. Rossi

doi:10.5753/eniac.2023.234337

Lucas S. S. Nunes Universidade de Brasília
Thiago de P. Faleiros Universidade de Brasília
Rafael G. Rossi iFood

DOI: https://doi.org/10.5753/eniac.2023.234337

Resumo

Due to the overwhelming data generation that surpasses human evaluation capacity, manually labeling data for training machine learning models is becoming increasingly impractical. This article focuses on analyzing techniques to address the challenges of Positive Unlabeled Learning (PUL). To this end, we propose structural adaptations to the Non-Negative Matrix Factorization (NMF) algorithm, specifically tailored for PU data (NMFPUL). We compare NMFPUL with state-of-the-art techniques to identify improvements in the performance of textual data classification. Our study reveals that NMFPUL consistently outperforms most baseline algorithms across diverse document collections even with a limited number of labeled documents, and mainly on these situations.

Palavras-chave: positive unlabeled learning, non-negative matrix factorization, text classification

Referências

Bekker, J. and Davis, J. (2018). Learning from positive and unlabeled data under the selected at random assumption. Journal of Machine Learning Research, 1.

Bekker, J. and Davis, J. (2020). Learning from positive and unlabeled data: a survey. Springer Nature 2020.

Bekker, J., Robberechts, P., and Davis, J. (2019). Beyond the selected completely at random assumption for learning from positive and unlabeled data. Journal of Machine Learning Research, 1.

Bian, P., Liu, L., and Penny, S. (2021). Detecting spam game reviews on steam with a semi-supervised approach. Australian National University, 06.

Carnevali, J. C., Geraldelli Rossi, R., Milios, E., and de Andrade Lopes, A. (2021). A graph-based approach for positive and unlabeled learning. Information Sciences 580 (2021), 580.

Faleiros, T., Valejo, A., and de Andrade Lopes, A. (2020). Unsupervised learning of textual pattern based on propagation in bipartite graph. Intelligent Data Analysis.

He, D., Pan, M., Hong, K., Cheng, Y., Chan, S., Liu, X., and Guizani, N. (2020). Fake review detection based on pu learning and behavior density. IEEE Network, 92.

Hien, L. T. K. and Gillis, N. (2020). Algorithms for nonnegative matrix factorization with the kullback-leibler divergence. Journal of Scientific Computing, 87.

Jaemin, Y., Kim, J., Yoon, H., Kim, G., Jang, C., and U, K. (2022). Graph-based pu learning for binary and multiclass classification without class prior. Knowledge and Information Systems (2022), 10.

Jaskie, K. and Spanias, A. (2019). Positive and unlabeled learning algorithms and applications: a survey. SenSIP Center, School of ECEE, 1.

Ji, Z., Du, C., Jiang, J., Zhao, L., Zhang, H., and Ganchev, I. (2023). Improving non-negative positive-unlabeled learning for news headline classification. IEEE Access, 11.

Kowsari, K., Meimandi, K. J., Heidarysafa, M., Mendu, S., Barnes, L., and Brown, D. (2019). Text classification algorithms: A survey. Information.

Lee, D. and Seung, H. (2000). Algorithms for non-negative matrix factorization. Neural Inf. Process. Syst.

Li, M., Pan, S., Zhang, Y., and Cai, X. (2016). Classifying networked text data with positive and unlabeled examples. Pattern Recognition Letters.

Li, Q., Peng, H., Li, J., Xia, C., Yang, R., Yang, L., and P. S. Yu, S. (2022). A survey on text classification: From traditional to deep learning. ACM Transactions on Intelligent Systems and Technology.

Li, X. and L, B. (2003). Learning to classify texts using positive and unlabeled data. volume 1, pages 587–592.

Ma, S. and Zhang, R. (2017). Pu-lp: A novel approach for positive and unlabeled learning by label propagation. 2017 IEEE International Conference on Multimedia Expo Workshops (ICMEW), 01.

Mahesh, B. (2020). Machine learning algorithms - a review. International Journal of Science and Research (IJSR), 9.

Naeem, M., Jamal, T., Diaz-Martinez, J., A. Butt, S.and Montesano, N., I. Tariq, M., De-la Hoz-Franco, E., and De-la Hoz-Valdiris, E. (2021). Trends and future perspective challenges in big data.

P. Tan, M. Steinbach, A. K. and Kumar, V. (2019). Anomaly Detection. Pearson.

Rossi, R. G., Marcacini, R. M., and Rezende, S. O. (2013). Benchmarking text collections for classication and clustering tasks. Technical report.

van Engelen, J. E. and Hoos, H. H. (2020). A survey on semi-supervised learning. Ma-chine Learning, 109(2):373–440.

Wang, Z., Jiang, J., and Long, G. (2022). Positive unlabeled learning by semi-supervised learning. Australian Artificial Intelligence Institute,, 213.

Wu, M., Pan, S., Du, L., and Zhu, X. (2021). Learning graph neural networks withpositive and unlabeled nodes. ACM Trans. Knowl. Discov, 101.

Wu, Z., Cao, J., Wang, Y., Wang, Y., Zhang, L., and Wu, J. (2020). hpsd: A hybrid pulearning-based spammer detection model for product reviews. IEEE TRANSACTIONS ON CYBERNETICS, 50.

X. Lin, P. C. B. (2020). Optimization and expansion of non-negative matrix factorization. BMC Bioinformatics.

Yang, F., Dragut, E., and Mukherjee, A. (2020). Claim verification under positive unlabeled learning. 2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), 92.