Aplicação de Técnicas de Confident Learning para Limpeza de Dados e Melhoria de Desempenho de Classificadores de Aprendizado de Máquina: um Estudo de Caso

Renato O. Miyaji; Felipe V. de Almeida; Pedro L. P. Corrêa

doi:10.5753/sbbd.2023.232175

Renato O. Miyaji Universidade de São Paulo http://orcid.org/0000-0002-7279-4546
Felipe V. de Almeida Universidade de São Paulo
Pedro L. P. Corrêa Universidade de São Paulo

DOI: https://doi.org/10.5753/sbbd.2023.232175

Resumo

Técnicas centradas em modelos, como otimização de hiper parâmetros e regularizações, são comumente utilizadas na literatura para aprimorar o desempenho de Classificadores de Aprendizado de Máquina. Entretanto, quando tratando um conjunto de dados com incertezas, abordagens centradas em dados apresentam bom potencial. Assim, técnicas de Confident Learning (CL) foram aplicadas para um estudo de caso de Modelagem de Distribuição de Espécies na Amazônica utilizando Classificadores para estimar a probabilidade de ocorrência de uma espécie, com base em condições ambientais. Em comparação com métodos centrados em modelos, as técnicas CL apresentaram uma melhoria de 23% no ROC-AUC para Regressão Logística.

Palavras-chave: Aplicações centradas em Dados, Aprendizado de Máquina, Inteligência Artificial

Referências

Almeida, F. V., Bueno, W. M., Miyaji, R. O., and Corrêa, P. L. P. (2021). Experimento de modelagem de distribuição de espécies baseada em variáveis ambientais e de aerossóis na região próxima a manaus (am). In Anais do XII Workshop de Computação Aplicada à Gestão do Meio Ambiente e Recursos Naturais. SBC.

Beery, S., Cole, E., Parker, J., Perona, P., and Winner, K. (2021). Species distribution modeling for machine learning practitioners: A review. In Proceedings of ACM SIGCAS Conference on Computing and Sustainable Societies (COMPASS) 2021.

Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In Proceedings of 26th International Conference on Machine Learning. ACM.

Di Lorenzo, B., Farcomeni, A., and Golini, N. (2011). A bayesian model for presenceonly semicontinuous data, with application to prediction of abundance of taxus baccata in two italian regions. Journal of Agriculture Biological and Environmental Statistics, 16:339–356.

Elcan, K. (2001). The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI’01).

Elcan, K. and Noto, K. (2008). Learning classifiers from only positive and unlabeled data. In Proceedings of the SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2008.

Forman, G. (2005). Counting positives accurately despite inaccurate classification. In Proceedings of the 16th European Conference on Machine Learning.

GBIF (2023). Gbif | global biodiversity information facility. https://www.gbif.org/. Acesso em: 2023-05-14.

Golini, N. (2011). Bayesian Modelling of Presence-only Data. PhD thesis, Spienza Universidade de Roma.

Hamid, O. H. (2022). From model-centric to data-centric ai: A paradigm shift or rather a complementary approach? In Proceedings of 2022 8th International Conference on Information Technology Trends (ITT), pages 45–54. IEE.

Hegel, T. M., Cushman, A., Evans, J., and Huetmann, F. (2010). Spatial Complexity, Informatics and Wildlife Conservation, chapter Current State of the Art for Statistical Modelling of Species Distributions. Springer.

Hernandez, P. A., Graham, C. H., Master, L. L., and Albert, D. L. (2006). The effect of sample size and species characteristics on performance of different species distribution modeling methods. Ecography, 29(5):773–785.

Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Applications to nonorthogonal problems. Technometrics, 12(1):69–82.

Huang, J., Qu, L., Jia, R., and Zhao, B. (2019). O2u-net: A simple noisy label detection approach for deep neural networks. In Proceedings of the International Conference on Computer Vision (ICCV) 2019.

Hutchinson, G. E. (1991). Population studies: Animal ecology and demography. Bulletin of Mathematical Biology, 53(1-2):193–213.

ICMBio (2023). Portal da biodiversidade do instituto chico mendes de conservação da biodiversidade. [link]. Acesso em: 2023-05-14.

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer, Londres.

Johnson, R., Chawla, N., and Hellmann, J. (2012). Species distribution modeling and prediction: A class imbalance problem. pages 9–16.

Lipton, Z., Wang, Y., and Smola, A. (2018). Detecting and correcting for label shift with black box predictors. In Proceedings of the International Conference on Machine Learning (ICML) 2018.

Marsh, J. C., Gavish, Y., Kuemmerlen, M. C., Stoll, S., Haase, P., and Kunin, W. E. (2023). Sdm profiling: A tool for assessing the information-content of sampled and unsampled locations for species distribution models. Ecological Modelling, 475(1).

Martin, S. T., Artaxo, P., Machado, L. A. T., Manzi, A. O., Souza, R. A. F. d., Schumacher, C., Wang, J., Andreae, M. O., Barbosa, H., Fan, J., et al. (2016). Introduction: observations and modeling of the green ocean amazon (goamazon2014/5). Atmospheric Chemistry and Physics, 16(8):4785–4797.

Martin, S. T., Artaxo, P., Machado, L., Manzi, A. O., Souza, R. A. F. d., Schumacher, C., Wang, J., Biscaro, T., Brito, J., Calheiros, A., et al. (2017). The green ocean amazon experiment (goamazon2014/5) observes pollution affecting gases, aerosols, clouds, and rainfall over the rain forest. Bulletin of the American Meteorological Society, 98(5):981–997.

Martin, T. G., Kuhnert, P. M., Mengersen, K., and Possingham, H. P. (2005). The power of expert opinion in ecological models using bayesian methods: Impact of grazing on birds. Ecological Applications, 15:266–280.

Mateo, R. G., Vanderpoorten, A., Muñoz, J., Laenen, B., and Désamoré, A. (2013). Modeling species distributions from heterogeneous data for the biogeographic regionalization of the european bryophyte flora. PLoS One, 8(2):e55648.

Miyaji, R. O. and Corrêa, P. L. P. (2021). Handling uncertainty through bayesian inference for species distribution modelling in the amazon basin region. In 2021: ANAIS DO XVIII ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL.

Miyaji, R. O., Bauer, L. O., Ferrari, V. M., Almeida, F. V., Corrêa, P. L. P., and Rizzo, L. V. (2021). Interpolação espacial de variáveis ambientais e aerossóis na região da bacia amazônica próxima a manaus-am. In Anais do XII Workshop de Computação Aplicada à Gestão do Meio Ambiente e Recursos Naturais. SBC.

Northcutt, C. G., Athalye, A., and Mueller, J. (2021a). Pervasive label errors in test sets destabilize machine learning benchmarks. In Proceedings of 35th Conference on Neural Information Processing Systems (NeurIPS 2021).

Northcutt, C. G., Jiang, L., and Chuang, I. L. (2021b). Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research (JAIR), 70(1):1373–1411.

Pinaya, J. and Corrêa, P. (2014). Metodologia para definição das atividades do processo de modelagem de distribuição de espécies. In Anais do V Workshop de Computação Aplicada a Gestão do Meio Ambiente e Recursos Naturais, pages 45–54, Porto Alegre, RS, Brasil. SBC.

The Imbalanced-learn Developers (2021). Imbalanced-learn documentation. https://imbalanced-learn.org/stable/. Acesso em: 14/05/2023.

Tibshirani, R. (1996). Regression shrinkage and selection via lasso. Journal of the Royal Statistical Society, 58(1):267–288.