Um Estudo de Limpeza em Base de Dados Desbalanceada com Sobreposição de Classes
Resumo
Este artigo apresenta um estudo em desenvolvimento sobre degradação de classificadores no domínio de base de dados desbalanceadas. A hipótese central deste estudo é que a aplicação das técnicas convencionais de amostragem de dados neste domínio pode não resultar em melhora de desempenho, provavelmente devido ao uso delas aumentar a ocorrência de sobreposição de classes ou esta ser inerente aos dados. Para tanto, é apresentada uma nova abordagem para este problema baseada na aprendizagem não supervisionada. Os resultados até agora obtidos com o método desenvolvido sob esta nova abordagem, intitulado C-clear, ainda não são conclusivos, mas indicam que ela se mostra promissora.Referências
Batista, G.E.A.P.A.; Prati, R.C.; Monard, M.C. (2004) A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. SIGKDD Explorations, v.6 p.20-29.
Chawla, N.V.; Bowyer, K.W.; Hall, L.O. & Kegelmeyer, W.P. (2002) SMOTE: Synthetic Minority Over-sampling Technique. JAIR, v.16, p.321–357.
Chawla, N.V.; Japkowicz, N.; Kotcz, A. (2004) Editorial: Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explorations. v.6. p.1-6.
Fawcett, T. (2004) “ROC Graphs Notes and Practical Considerations”, Machine Learning.
Ferri, C.; Flach, P.; Hernández-Orallo, J. H. (2002) Learning Decision Trees using the Area under the ROC curve. In C. S. A. Hoffman, editor, Nineteenth International Conference on Machine Learning (ICML). Morgan Kaufmann Publishers. p.139–146.
Guo, H.; Viktor, H.L. (2004) Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach. SIGKDD Explorations, v6 p.30-39
He, Z; Xu, X.; Deng, S.; (2002) Squeezer: an Efficient Algorithm for Clustering Categorical Data. Jornal of Computer Science and Technology. v.17, n.5, p.611-625
He, Z; Xu, X.; Deng, S.; (2005) Clustering Mixed Numeric and Categorical Data: a Cluster Ensemble Approach. ArXiv Computer Science e-prints. (Acesso em 12/12/2006. Disponível em: [link]
Han, H.; Wang, W.Y.; Mao, B.H. (2005) Borderline-SMOTE: a New Over-Sampling Method in Imbalanced Data Sets Learning. Advances in Intelligent Computing. International Conference on Intelligent Computing (ICIC). Lecture Notes in Computer Science. V.3644, Springer-Verlag, Hefei (China) p.878-887
Japkowicz, N. (2002) Supervised Learning with Unsupervised Output Separation. In Proceedings of the IASTED International Conference on Artificial Intelligence and Soft Computing (ASC). p.321-325.
Japkowicz, N. (2003) Class imbalances: Are we Focusing on the Right Issue? In Proceedings of the ICML Workshop on Learning from Imbalanced Data Sets (II).
Jo, T.; Japkowicz, N. (2004). Class Imbalances versus Small Disjuncts. SIGKDD Explorations, v.6, p.40-49.
Kohavi, R. (1995) A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. International Joint Conference on Artificial Intelligence (IJCAI). p.1137-1145
Ladeira, M; Vieira, M.H.P; Prado, H.A; Noivo, R.M; Castanheira, D.B.S (2005). UnBMiner Ferramenta Aberta Para Mineração de Dados. Revista Tecnologia da Informação, Brasília-DF, v.5, n.1, p.45-63.
MacQueen, J.B. (1967) Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, University of California Press, v.1, p.281-297.
Merz, C.J.; Murphy, P.M. (1998) UCI Repository of Machine Learning Datasets. [link]. (Acesso em 20/01/2007).
Nickerson, A.; Japkowicz, N.; Milios, E. (2001) Using Unsupervised Learning to Guide Re-Sampling in Imbalanced Data Sets. Proceedings of the Eighth International Workshop on AI and Statistics. p. 261-265.
Phua, C.; Alahakoon, D.; Lee, V. (2004). Minority Report in Fraud Detection: Classification of Skewed Data. ACM SIGKDD Explorations. v.6, p.50-59.
Prati, R.C.; Batista, G.E.A.P.A.; Monard, M.C (2004). Class Imbalances versus Class Overlapping: an Analysis of a Learning System Behavior. In MICAI, p. 312-321.
Quinlan, J.R. (1993). C4.5 Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.
Sanches, M. K.; Monard, M. C. (2004). Proposta de um Algoritmo de Clustering Semisupervisionado para Rotular Exemplos a Partir de Poucos Exemplos Rotulados. In: Workshop in Artificial Intelligence, Arica-Chile. Jornadas Chilenas de Computación. Chile : Sociedad Chilena de Ciencias de la Computación, v.1, p.1-9.
SPSS Inc., NCR Systems Engineering Copenhagen & DaimlerChrysler AG (1999). CRISP-DM 1.0 – Step-by-step Data Mining Guide. SPSS & CRISP-DM Consortium. (Acesso em 05/03/2005. Disponível em [link].
Wilson, D.R.; Martinez, T.R. (2000). Reduction Techniques for Exemplar-Based Learning Algorithms. Machine Learning. v.38, n.3, p 257-286.
Weiss, G. (2004) Mining with Rarity: A Unifying Framework. ACM SIGKDD Explorations v.6. p.7-19.
Chawla, N.V.; Bowyer, K.W.; Hall, L.O. & Kegelmeyer, W.P. (2002) SMOTE: Synthetic Minority Over-sampling Technique. JAIR, v.16, p.321–357.
Chawla, N.V.; Japkowicz, N.; Kotcz, A. (2004) Editorial: Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD Explorations. v.6. p.1-6.
Fawcett, T. (2004) “ROC Graphs Notes and Practical Considerations”, Machine Learning.
Ferri, C.; Flach, P.; Hernández-Orallo, J. H. (2002) Learning Decision Trees using the Area under the ROC curve. In C. S. A. Hoffman, editor, Nineteenth International Conference on Machine Learning (ICML). Morgan Kaufmann Publishers. p.139–146.
Guo, H.; Viktor, H.L. (2004) Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach. SIGKDD Explorations, v6 p.30-39
He, Z; Xu, X.; Deng, S.; (2002) Squeezer: an Efficient Algorithm for Clustering Categorical Data. Jornal of Computer Science and Technology. v.17, n.5, p.611-625
He, Z; Xu, X.; Deng, S.; (2005) Clustering Mixed Numeric and Categorical Data: a Cluster Ensemble Approach. ArXiv Computer Science e-prints. (Acesso em 12/12/2006. Disponível em: [link]
Han, H.; Wang, W.Y.; Mao, B.H. (2005) Borderline-SMOTE: a New Over-Sampling Method in Imbalanced Data Sets Learning. Advances in Intelligent Computing. International Conference on Intelligent Computing (ICIC). Lecture Notes in Computer Science. V.3644, Springer-Verlag, Hefei (China) p.878-887
Japkowicz, N. (2002) Supervised Learning with Unsupervised Output Separation. In Proceedings of the IASTED International Conference on Artificial Intelligence and Soft Computing (ASC). p.321-325.
Japkowicz, N. (2003) Class imbalances: Are we Focusing on the Right Issue? In Proceedings of the ICML Workshop on Learning from Imbalanced Data Sets (II).
Jo, T.; Japkowicz, N. (2004). Class Imbalances versus Small Disjuncts. SIGKDD Explorations, v.6, p.40-49.
Kohavi, R. (1995) A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. International Joint Conference on Artificial Intelligence (IJCAI). p.1137-1145
Ladeira, M; Vieira, M.H.P; Prado, H.A; Noivo, R.M; Castanheira, D.B.S (2005). UnBMiner Ferramenta Aberta Para Mineração de Dados. Revista Tecnologia da Informação, Brasília-DF, v.5, n.1, p.45-63.
MacQueen, J.B. (1967) Some Methods for Classification and Analysis of Multivariate Observations. Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, University of California Press, v.1, p.281-297.
Merz, C.J.; Murphy, P.M. (1998) UCI Repository of Machine Learning Datasets. [link]. (Acesso em 20/01/2007).
Nickerson, A.; Japkowicz, N.; Milios, E. (2001) Using Unsupervised Learning to Guide Re-Sampling in Imbalanced Data Sets. Proceedings of the Eighth International Workshop on AI and Statistics. p. 261-265.
Phua, C.; Alahakoon, D.; Lee, V. (2004). Minority Report in Fraud Detection: Classification of Skewed Data. ACM SIGKDD Explorations. v.6, p.50-59.
Prati, R.C.; Batista, G.E.A.P.A.; Monard, M.C (2004). Class Imbalances versus Class Overlapping: an Analysis of a Learning System Behavior. In MICAI, p. 312-321.
Quinlan, J.R. (1993). C4.5 Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.
Sanches, M. K.; Monard, M. C. (2004). Proposta de um Algoritmo de Clustering Semisupervisionado para Rotular Exemplos a Partir de Poucos Exemplos Rotulados. In: Workshop in Artificial Intelligence, Arica-Chile. Jornadas Chilenas de Computación. Chile : Sociedad Chilena de Ciencias de la Computación, v.1, p.1-9.
SPSS Inc., NCR Systems Engineering Copenhagen & DaimlerChrysler AG (1999). CRISP-DM 1.0 – Step-by-step Data Mining Guide. SPSS & CRISP-DM Consortium. (Acesso em 05/03/2005. Disponível em [link].
Wilson, D.R.; Martinez, T.R. (2000). Reduction Techniques for Exemplar-Based Learning Algorithms. Machine Learning. v.38, n.3, p 257-286.
Weiss, G. (2004) Mining with Rarity: A Unifying Framework. ACM SIGKDD Explorations v.6. p.7-19.
Publicado
30/06/2007
Como Citar
MACHADO, Emerson Lopes; LADEIRA, Marcelo.
Um Estudo de Limpeza em Base de Dados Desbalanceada com Sobreposição de Classes. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 6. , 2007, Rio de Janeiro/RJ.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2007
.
p. 1499-1508.
ISSN 2763-9061.
