Stacking Bagged and Boosted Forests for Classification of Noisy and High-Dimensional Data
Abstract
Random Forests (RF) are one of the most successful strategies for automated classification tasks. Motivated by the RF success, recently proposed RF-based classification approaches leverage the central RF idea of aggregating a large number of low-correlated trees, which are inherently parallelizable and provide exceptional generalization capabilities. In this context, this work brings several new contributions to this line of research. First, we propose a new RF-based strategy (BERT) that applies the boosting technique in bags of extremely randomized trees. Second, we empirically demonstrate that this new strategy, as well as the recently proposed BROOF and LazyNN RF classifiers do complement each other, motivating us to stack them to produce an even more effective classifier. Up to our knowledge, this is the first strategy to effectively combine the three main ensemble strategies: stacking, bagging (the cornerstone of RFs) and boosting. Finally, we exploit the efficient and unbiased stacking strategy based on out-of-bag (OOB) samples to considerably speedup the very costly training process of the stacking procedure. Our experiments in several datasets covering two high-dimensional and noisy domains of topic and sentiment classification provide strong evidence in favor of the benefits of our RF-based solutions. We show that BERT is among the top performers in the vast majority of analyzed cases, while retaining the unique benefits of RF classifiers (explainability, parallelization and easiness of parameterization).
References
Campos, R. R. and Gonçalves, M. A. (2016). Bert: Melhorando classicação de texto com Árvores extremamente aleatórias, bagging e boosting. In Proc. of the 31st Brazilian Symposium on Databases, 2016.
Campos, R. R., Gonçalves, M. A., and Salles, T. C. (2016). Quando a amazônia encontra a mata atlântica: Empilhamento de florestas para classificação efetiva de texto. In 4th Symp. on Knowledge Discovery, Mining and Learning.
Dong, Y.-S. and Han, K.-S. (2004). A comparison of several ensemble methods for text categorization. In Services Computing, 2004. (SCC 2004), pages 419–422.
Fernández-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014). Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res., 15(1):3133–3181.
Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. Mach. Learn., 63(1):3–42.
Hastie, T., Tibshirani, R., and Friedman, J. H. (2009). The Elements of Statistical Learning. Springer.
Kuncheva, L. I. and Whitaker, C. J. (2003). Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach. Learn., 51(2):181–207.
Salles, T., Gonçalves, M., and Rocha, L. (2017). Phd dissertation: Random forest based classifiers for classification tasks with noisy data. Federal University of Minas Gerais.
Salles, T., Gonçalves, M., Rodrigues, V., and Rocha, L. (2015). Broof: Exploiting out-of-bag errors, boosting and random forests for effective automated classification. In SIGIR’15.
Segal, M. R. (2004). Machine learning benchmarks and random forest regression. Technical report, University of California.
Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5:241–259.
