Sample Bias Effect on Meta-Learning
Resumo
Sample bias is a common issue on traditional machine learning studies but rarely considered when discussing meta-learning. It happens when the training data sample lacks or overemphasizes one or more characteristics, compared to others. Herewith, models trained on such data may become inaccurate for some instances. This work aims to analyze this issue in the meta-learning context. Indeed, in most of the meta-learning literature, a random sample of datasets is taken for building meta-models. Nonetheless, there is no discussion over a possible side-effect bias not controlled in such random sampling. This work aims to analyze these effects, in order not only to discuss their consequences, but also to start a debate over the need of their consideration in meta-learning research.
Referências
Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
Comi, M. (2018). Is artificial intelligence racist? (and other concerns), https://towardsdatascience.com/is-artificial-intelligence-racist-and-other-concerns817fa60d75e9.
Dua, D. and Graff, C. (2017). UCI machine learning repository, http://archive.ics.uci.edu/ml.
Fernández, A., Garcı́a, S., Galar, M., Prati, R. C., Krawczyk, B., and Herrera, F. (2018). Learning from imbalanced data sets. Springer.
Feurer, M., van Rijn, J. N., Kadra, A., Gijsbers, P., Mallik, N., Ravi, S., Müller, A., Vanschoren, J., and Hutter, F. (2019). Openml-python: an extensible python api for openml. arXiv preprint arXiv:1911.02490.
Garcia, L. P., Lorena, A. C., de Souto, M. C., and Ho, T. K. (2018). Classifier recommendation using data complexity measures. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 874–879. IEEE.
Gareth, J., Daniela, W., Trevor, H., and Robert, T. (2013). An introduction to statistical learning: with applications in R. Springer.
Ho, T. K. (1995). Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition, volume 1, pages 278–282. IEEE. Figure 7. Sparsity proportion over random samples. Figure 8. IR proportion over random samples.
Ho, T. K. and Basu, M. (2002). Complexity measures of supervised classification problems. IEEE Trans. on pattern analysis and machine intelligence, 24(3):289–300.
Huang, J., Gretton, A., Borgwardt, K., Schölkopf, B., and Smola, A. J. (2007). Correcting sample selection bias by unlabeled data. In Advances in NIPS, pages 601–608.
Hutter, F., Kotthoff, L., and Vanschoren, J. (2019). Automated machine learning: methods, systems, challenges. Springer Nature.
Liu, A. and Ziebart, B. (2014). Robust classification under sample selection bias. In Advances in neural information processing systems, pages 37–45.
Lorena, A. C., De Carvalho, A. C., and Gama, J. M. (2008). A review on the combination of binary classifiers in multiclass problems. Artificial Intelligence Review, 30(1-4):19.
Macià, N. and Bernadó-Mansilla, E. (2014). Towards uci+: A mindful repository design. Information Sciences, 261:237–262. Figure 9. Number of classes proportion over random samples.
Maloney, C. (2017). Weapons of math destruction: How big data increases inequality and threatens democracy. Journal of Markets & Morality, 20(1):194–197.
Muñoz, M. A., Villanova, L., Baatar, D., and Smith-Miles, K. (2018). Instance spaces for machine learning classification. Machine Learning, 107(1):109–147.
Vanschoren, J. (2019). Meta-learning. In Automated Machine Learning, pages 35–61. Springer, Cham.
Vanschoren, J., Van Rijn, J. N., Bischl, B., and Torgo, L. (2014). Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49–60.
Wallis, J. (2018). Is artificial intelligence sexist?, https://www.theglobeandmail.com/business/careers/leadership/article-is-artificialintelligence-sexist/.
Zadrozny, B. (2004). Learning and evaluating classifiers under sample selection bias. In Proceedings of the 21st international conference on Machine learning, page 114.