Confident Learning Techniques for cleaning data and improving the performance of Machine Learning Classifiers: a study case
Abstract
Model-Centric techniques, such as hyperparameter selection and regularization, are commonly used in the literature to improve the performance of Machine Learning Classifiers. However, when a dataset with uncertain data is used, Data-Centric approaches have a good potential. These methods aim to systematically engineer data to improve model performance. Thus, Confident Learning (CL) techniques were applied for a study case of Species Distribution Modeling in the Amazon Basin using Machine Learning Classifiers, which aimed to predict the probability of occurrence of a species, given environmental conditions. In comparison with Model-Centric methods, CL techniques presented a 23% improvement of ROC-AUC for Logistic Regression.
References
Beery, S., Cole, E., Parker, J., Perona, P., and Winner, K. (2021). Species distribution modeling for machine learning practitioners: A review. In Proceedings of ACM SIGCAS Conference on Computing and Sustainable Societies (COMPASS) 2021.
Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In Proceedings of 26th International Conference on Machine Learning. ACM.
Di Lorenzo, B., Farcomeni, A., and Golini, N. (2011). A bayesian model for presenceonly semicontinuous data, with application to prediction of abundance of taxus baccata in two italian regions. Journal of Agriculture Biological and Environmental Statistics, 16:339–356.
Elcan, K. (2001). The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI’01).
Elcan, K. and Noto, K. (2008). Learning classifiers from only positive and unlabeled data. In Proceedings of the SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2008.
Forman, G. (2005). Counting positives accurately despite inaccurate classification. In Proceedings of the 16th European Conference on Machine Learning.
GBIF (2023). Gbif | global biodiversity information facility. https://www.gbif.org/. Acesso em: 2023-05-14.
Golini, N. (2011). Bayesian Modelling of Presence-only Data. PhD thesis, Spienza Universidade de Roma.
Hamid, O. H. (2022). From model-centric to data-centric ai: A paradigm shift or rather a complementary approach? In Proceedings of 2022 8th International Conference on Information Technology Trends (ITT), pages 45–54. IEE.
Hegel, T. M., Cushman, A., Evans, J., and Huetmann, F. (2010). Spatial Complexity, Informatics and Wildlife Conservation, chapter Current State of the Art for Statistical Modelling of Species Distributions. Springer.
Hernandez, P. A., Graham, C. H., Master, L. L., and Albert, D. L. (2006). The effect of sample size and species characteristics on performance of different species distribution modeling methods. Ecography, 29(5):773–785.
Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Applications to nonorthogonal problems. Technometrics, 12(1):69–82.
Huang, J., Qu, L., Jia, R., and Zhao, B. (2019). O2u-net: A simple noisy label detection approach for deep neural networks. In Proceedings of the International Conference on Computer Vision (ICCV) 2019.
Hutchinson, G. E. (1991). Population studies: Animal ecology and demography. Bulletin of Mathematical Biology, 53(1-2):193–213.
ICMBio (2023). Portal da biodiversidade do instituto chico mendes de conservação da biodiversidade. [link]. Acesso em: 2023-05-14.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer, Londres.
Johnson, R., Chawla, N., and Hellmann, J. (2012). Species distribution modeling and prediction: A class imbalance problem. pages 9–16.
Lipton, Z., Wang, Y., and Smola, A. (2018). Detecting and correcting for label shift with black box predictors. In Proceedings of the International Conference on Machine Learning (ICML) 2018.
Marsh, J. C., Gavish, Y., Kuemmerlen, M. C., Stoll, S., Haase, P., and Kunin, W. E. (2023). Sdm profiling: A tool for assessing the information-content of sampled and unsampled locations for species distribution models. Ecological Modelling, 475(1).
Martin, S. T., Artaxo, P., Machado, L. A. T., Manzi, A. O., Souza, R. A. F. d., Schumacher, C., Wang, J., Andreae, M. O., Barbosa, H., Fan, J., et al. (2016). Introduction: observations and modeling of the green ocean amazon (goamazon2014/5). Atmospheric Chemistry and Physics, 16(8):4785–4797.
Martin, S. T., Artaxo, P., Machado, L., Manzi, A. O., Souza, R. A. F. d., Schumacher, C., Wang, J., Biscaro, T., Brito, J., Calheiros, A., et al. (2017). The green ocean amazon experiment (goamazon2014/5) observes pollution affecting gases, aerosols, clouds, and rainfall over the rain forest. Bulletin of the American Meteorological Society, 98(5):981–997.
Martin, T. G., Kuhnert, P. M., Mengersen, K., and Possingham, H. P. (2005). The power of expert opinion in ecological models using bayesian methods: Impact of grazing on birds. Ecological Applications, 15:266–280.
Mateo, R. G., Vanderpoorten, A., Muñoz, J., Laenen, B., and Désamoré, A. (2013). Modeling species distributions from heterogeneous data for the biogeographic regionalization of the european bryophyte flora. PLoS One, 8(2):e55648.
Miyaji, R. O. and Corrêa, P. L. P. (2021). Handling uncertainty through bayesian inference for species distribution modelling in the amazon basin region. In 2021: ANAIS DO XVIII ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL.
Miyaji, R. O., Bauer, L. O., Ferrari, V. M., Almeida, F. V., Corrêa, P. L. P., and Rizzo, L. V. (2021). Interpolação espacial de variáveis ambientais e aerossóis na região da bacia amazônica próxima a manaus-am. In Anais do XII Workshop de Computação Aplicada à Gestão do Meio Ambiente e Recursos Naturais. SBC.
Northcutt, C. G., Athalye, A., and Mueller, J. (2021a). Pervasive label errors in test sets destabilize machine learning benchmarks. In Proceedings of 35th Conference on Neural Information Processing Systems (NeurIPS 2021).
Northcutt, C. G., Jiang, L., and Chuang, I. L. (2021b). Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research (JAIR), 70(1):1373–1411.
Pinaya, J. and Corrêa, P. (2014). Metodologia para definição das atividades do processo de modelagem de distribuição de espécies. In Anais do V Workshop de Computação Aplicada a Gestão do Meio Ambiente e Recursos Naturais, pages 45–54, Porto Alegre, RS, Brasil. SBC.
The Imbalanced-learn Developers (2021). Imbalanced-learn documentation. https://imbalanced-learn.org/stable/. Acesso em: 14/05/2023.
Tibshirani, R. (1996). Regression shrinkage and selection via lasso. Journal of the Royal Statistical Society, 58(1):267–288.
