B-Boost: An Extension of the Boosting Method for Imbalanced Training Sets

  • Joseane P. Rodrigues UFPE
  • Ricardo B. C. Prudêncio UFPE
  • Flávia A. Barros UFPE

Abstract


Boosting methods have been well succeeded on a broad range of classification problems, being one of the most investigated approaches in the literature for ensembles of classifiers. Despite its potential performance gain, Boosting presents limitations when dealing with unbalanced training sets, i.e. sets presenting majority classes with size much higher than the others. In this context, we propose in this work the B-Boost method, an extension of Boosting for unbalanced training sets. Different from standard Boosting, the B-Boost performs, at each iteration, a sampling of training examples separately per class. The sampling in B-Boost is performed in such a way to generate a balanced training set containing the instances of each class which, at the iteration, are hard to be correctly classified. Experiments were performed comparing the proposed method to the standard Boosting. The results revealed that B-Boost can improve the classification performance for minority classes, which is an important aspect in different contexts of application.

References

Akbani, R., Kwek, S., and Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. In Proceedings of the 15th European Conference on Machine Learning, pages 39–50.

Asuncion, A. and Newman, D. (2007). Uci machine learning repository. University of California, Irvine, School of Information and Computer Sciences. [link].

Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2):123–140.

Chawla, N., Bowyer, K., Hall, L., and Kegelmeyer, W. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321–357.

Chawla, N., Japkowicz, N., and Kotcz, A. (2004). Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl., 6(1):1–6.

Chawla, N., Lazarevic, A., Hall, L., and Bowyer, K. (2003). Smoteboost: Improving prediction of the minority class in boosting. Lecture Notes in Computer Science, 2838:107–119.

Dietterich, T. (2000). Ensemble methods in machine learning. Lecture Notes in Computer Science, 1857:1–15.

Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139.

Guo, H. and Viktor, H. (2004). Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. SIGKDD Explor. Newsl., 6(1):30–39.

Kubat, M. and Matwin, S. (1997). Addressing the curse of imbalanced data set: One sided sampling. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 179–186.

Kuncheva, L. (2004). Combining Pattern Classifiers - Methods and Algorithms. John Wiley and Sons, New Jersey.

Ling, C. and Li, C. (1998). Data mining for marketing: Problems and solutions. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pages 73–79.

Nilsson, N. J. (1965). Learning Machines: Foundations of Trainable Pattern-Classifying Systems. McGraw Hill, New York, EUA.

Sun, Y., Kamel, M., Wong, A., and Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12):3358–3378.

Todorovski, L. and Džeroski, S. (2003). Combining classifiers with meta decision trees. Machine Learning, 50(3):223–249.

Visa, S. and Ralescu, A. (2005). Issues in mining imbalanced data sets - a review paper. In Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, pages 67–73.

Witten, I. H. and Frank, E. (2005). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, 2 edition.

Wolpert, D. (1995). Stacked generalization. Neural Networks, 5:241–259.
Published
2009-07-20
RODRIGUES, Joseane P.; PRUDÊNCIO, Ricardo B. C.; BARROS, Flávia A.. B-Boost: An Extension of the Boosting Method for Imbalanced Training Sets. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 7. , 2009, Bento Gonçalves/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2009 . p. 412-421. ISSN 2763-9061.

Most read articles by the same author(s)