SSL-VFC4.5: An approach to adapt Very Fast C4.5 classification algorithm to deal with semi-supervised learning

Carlos Eduardo Nass; Agustín Alejandro Ortíz Díaz; Fabiano Baldo

doi:10.5753/sbbd.2021.17862

Carlos Eduardo Nass Universidade do Estado de Santa Catarina (UDESC)
Agustín Alejandro Ortíz Díaz Universidade do Estado de Santa Catarina (UDESC) http://orcid.org/0000-0003-1133-9096
Fabiano Baldo Universidade do Estado de Santa Catarina (UDESC) http://orcid.org/0000-0002-6452-1900

DOI: https://doi.org/10.5753/sbbd.2021.17862

Resumo

The growing popularity of audio and video streaming, industry 4.0 and IoT (Internet of Things) technologies contribute to the fast augment of the generation of various types of data. Therefore, to analyze these data for decision-making, supervised machine learning techniques need to be fast while keeping a suitable predicting performance even in many real-life scenarios where labeled data are expensive and hard to be gotten. To overcome this problem, this work proposes an adaptation to the Very Fast C4.5 (VFC4.5) algorithm implementing on it a semi-supervised impurity metric presented in the literature. The results pointed out that this adaptation can slightly increase the accuracy of the VFC4.5 when the datasets have the presence of a very few amount of labeled instances, but it increases the training time, especially when the number of labeled instances in the datasets increase.

Palavras-chave: Machine learning, intrinsically semi-supervised classification, fast classification, impurity-based metric, top-down induction of decision trees

Referências

Bifet, A., Zhang, J., Fan, W., He, C., Zhang, J., Qian, J., Holmes, G., and Pfahringer, B. (2017). Extremely fast decision tree mining for evolving data streams. In Proceedings of the 23rd ACM SIGKDD, New York. Association for Computing Machinery.

Chapelle, O., Scholkopf, B., and Zien, A. (2006). Semi-supervised learning. MIT Press.

Chen, K. and Wang, S. (2011). Semi-supervised learning via regularized boosting working on multiple semi-supervised assumptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):129–143.

Cherfi, A., Nouira, K., and Ferchichi, A. (2018). Very fast c4.5 decision tree algorithm. Applied Artificial Intelligence, 32(2):119-137.

Domingos, P. and Hulten, G. (2000). Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 71–80.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The WEKA data mining software: an update. SIGKDD Explorations, 11(1):10–18.

Ip, R. H., Ang, L.-M., Seng, K. P., Broster, J., and Pratley, J. (2018). Big data and machine learning for crop protection. Computers and Electronics in Agriculture, 151:376–383.

Kim, K. (2016). A hybrid classification algorithm by subspace partitioning through semisupervised decision tree. Pattern Recognition, 60:157-163.

Leistner, C., Saffari, A., Santner, J., and Bischof, H. (2009). Semi-supervised random forests. In Proceedings of IEEE International Conference on Computer Vision, pages 506–513. IEEE.

Levatic, J., Ceci, M., Kocev, D., and Dzeroski, S. (2017). Semi-supervised classification trees. Journal of Intelligent Information Systems, 49(3):461–486.

Levatic, J., Kocev, D., Ceci, M., and Dzeroski, S. (2018). Semi-supervised trees for multi-target regression. Information Sciences, 450:109-127.

Lichman, M. (2013). Uci machine learning repository.

Ortiz-Díaz, A. A., Bayer, F. R., and Baldo, F. (2020). Ssl-c4. 5: Implementation of a classification algorithm for semi-supervised learning based on c4. 5. In Brazilian Conference on Intelligent Systems, pages 513–525. Springer.

Quinlan, R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann Publishers.

Reinsel, D., Rydning, J., and Gantz, J. F. (2021). Worldwide global datasphere forecast, 2021–2025: The world keeps creating more data — now, what do we do with it all? .

Santos, A. and Canuto, A. (2014). Applying semi-supervised learning in hierarchical multi-label classification. Expert Systems with Applications, 41(14):6075-6085.

Settouti, N., El Habib Daho, M., Amine Lazouni, M. E., and Chikh, M. A. (2013). Random forest in semi-supervised learning (co-forest). In 2013 8th International Workshop on Systems, Signal Processing and their Applications (WoSSPA), pages 326–329.

Song, E., Huang, D., Ma, G., and Hung, C.-C. (2011). Semi-supervised multi-class adaboost by exploiting unlabeled data. Expert Systems with Applications, 38(6):6720-6726.

Tanha, J., van Someren, M., and Afsarmanesh, H. (2014). Boosting for multiclass semisupervised learning. Pattern Recognition Letters, 37:63–77. Partially Supervised Learning for Pattern Recognition.

Tanha, J., van Someren, M., and Afsarmanesh, H. (2017). Semi-supervised self-training for decision tree classifiers. International Journal of Machine Learning and Cybernetics, 8(1):355–370.

Van Engelen, J. E. and Hoos, H. H. (2020). A survey on semi-supervised learning. Machine Learning, 109(2):373–440.

Xu, W.-h., Qin, Z., and Chang, Y. (2011). Clustering feature decision trees for semisupervised classification from high-speed data streams. Journal of Zhejiang University SCIENCE C, 12(8):615.