Algoritmo de Ensemble para Classificação em Fluxo de Dados com Classes Desbalanceadas e Mudanças de Conceito
Abstract
With the exponencial growth in data generation observed in the last decades, performing classification tasks on such data poses several challenges. These datasets are sometimes imbalanced in terms of their classes and changes in the formation of classes may occur over time, called concept drift. Among the algorithms aimed to address these problems, the Kappa Updated Ensemble (KUE) has presented good performance in data stream with concept drift. As its original formulation is not designed for imbalanced classes, this paper proposes modifications to KUE in order to make it more robust and adherent to the scenario of imbalanced datasets.In numerical experiments on eight datasets with different rates of imbalance, the modified KUE outperformed the original version in five datasets and yielded statistically equivalent performance in the remaining three. These results are promising and motivate further developments for this approach.
References
Brzezinski, D. and Stefanowski, J. (2013a). Classifiers for concept-drifting data streams: evaluating things that really matter. In ECML PKDD 2013 Workshop on Real-World Challenges for Data Stream Mining, September 27th, Prague, Czech Republic, pages 10-14. Citeseer.
Brzezinski, D. and Stefanowski, J. (2013b). Reacting to different types of concept drift: The accuracy updated ensemble algorithm. IEEE Transactions on Neural Networks and Learning Systems, 25(1):81-94.
Cano, A. and Krawczyk, B. (2020). Kappa updated ensemble for drifting data stream mining. Machine Learning, 109(1):175-218.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37-46.
Gaber, M. M. (2012). Advances in data stream mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1):79-85.
Gama, J., Žliobaite, I., Bifet, A., Pechenizkiy, M., and Bouchachia, A. (2014). A survey on concept drift adaptation. ACM computing surveys (CSUR), 46(4):1-37.
Gomes, H. M., Bifet, A., Read, J., Barddal, J. P., Enembreck, F., Pfharinger, B., Holmes, G., and Abdessalem, T. (2017). Adaptive random forests for evolving data stream classification. Machine Learning, 106(9):1469-1495.
Han, H., Wang, W.-Y., and Mao, B.-H. (2005). Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing, pages 878-887. Springer.
Hansen, L. K. and Salamon, P. (1990). Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence, 12(10):993-1001.
Hulten, G., Spencer, L., and Domingos, P. (2001). Mining time-changing data streams. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 97-106.
Kolter, J. Z. and Maloof, M. A. (2007). Dynamic weighted majority: An ensemble method for drifting concepts. The Journal of Machine Learning Research, 8:2755-2790.
Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 5(4):221-232.
Krawczyk, B., Minku, L. L., Gama, J., Stefanowski, J., and Wozniak, M. (2017). Ensemble learning for data stream analysis: A survey. Information Fusion, 37:132-156.
Kubat, M., Matwin, S., et al. (1997). Addressing the curse of imbalanced training sets: one-sided selection. In Icml, volume 97, pages 179-186. Citeseer.
Manapragada, C., Webb, G. I., and Salehi, M. (2018). Extremely fast decision tree. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1953-1962.
Pesaranghader, A., Viktor, H., and Paquet, E. (2018). Reservoir of diverse adaptive learners and stacking fast hoeffding drift detection methods for evolving data streams. Machine Learning, 107(11):1711-1743.
Pietruczuk, L., Rutkowski, L., Jaworski, M., and Duda, P. (2017). How to adjust an ensemble size in stream data mining? Information Sciences, 381:46-54.
Ren, S., Liao, B., Zhu, W., and Li, K. (2018). Knowledge-maximized ensemble algorithm for different types of concept drift. Information Sciences, 430:261-281.
Tan, P.-N., Steinbach, M., and Kumar, V. (2009). Introdução ao datamining: mineração de dados. Ciência Moderna.
Webb, G. I., Hyde, R., Cao, H., Nguyen, H. L., and Petitjean, F. (2016). Characterizing concept drift. Data Mining and Knowledge Discovery, 30(4):964-994.
Weiss, G. M. (2004). Mining with rarity: a unifying framework. ACM Sigkdd Explorations Newsletter, 6(1):7-19.
Zhai, T., Gao, Y., Wang, H., and Cao, L. (2017). Classification of high-dimensional evolving data streams via a resource-efficient online ensemble. Data Mining and Knowledge Discovery, 31(5):1242-1265.
Zhang, L., Lin, J., and Karim, R. (2016). Sliding window-based fault detection from high-dimensional data streams. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 47(2):289-303.
