An Iterative Decision Tree Threshold Filter

  • Oscar Picchi Netto USP
  • José Augusto Baranauskas USP

Resumo


Neste trabalho é proposto e analisado um novo filtro para seleção de atributos utilizando um método iterativo com árvores de decisão. Utilizando diversas bases biomédicas, o filtro foi avaliado em capacidade de compressão e valor AUC (Area Under Curve) para três cenários. Em média, o filtro foi capaz de compactar 50% dos dados. As análises dos valores AUC comparando todos os atributos contra aqueles atributos selecionados não produziu perda de desempenho significativa nos cinco algoritmos de aprendizado de máquina testados.

Referências

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57:289–300.

Blum, A. L. and Langley, P. (1997). Selection of relevant features and examples in machine learning. AI, 97(1–2):245–271.

Estévez, P., Tesmer, M., Perez, C., and Zurada, J. (2009). Normalized mutual information feature selection. Neural Networks, IEEE Transactions on, 20(2):189–201.

Fayyad, U. M., Piatetsky-Shapiro, G., and Smyth, P. (1996). From Data Mining to Knowledge Discovery: An Overview, pages 1–30.

Foithong, S., Pinngern, O., and Attachoo, B. (2011). Feature subset selection wrapper based on mutual information and rough sets. Expert Systems with Applications.

Frank, A. and Asuncion, A. (2010). Uci machine learning repository.

Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11(1):86–92.

Gao, K., Khoshgoftaar, T., and Van Hulse, J. (2010). An evaluation of sampling on filter-based feature selection methods. In Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, pages 416–421.

Han, J., Kamber, M., and Pei, J. (2011). Data mining: concepts and techniques. Morgan Kaufmann. Institute, B. (2010). Cancer program data sets.

Kantardzic, M. (2011). Data mining: concepts, models, methods, and algorithms. WileyIEEE Press.

Lan, Y., Ren, H., Zhang, Y., Yu, H., and Zhao, X. (2011). A hybrid feature selection method using both filter and wrapper in mammography cad. In Image Analysis and Signal Processing (IASP), 2011 International Conference on, pages 378–382. IEEE.

Larrañaga, P., Calvo, B., Santana, R., Bielza, C., Galdiano, J., Inza, I., Lozano, J., Armañanzas, R., Santafé, G., Pérez, A., et al. (2006). Machine learning in bioinformatics. Briefings in bioinformatics, 7(1):86–112.

Min, H. and Fangfang, W. (2010). Filter-wrapper hybrid method on feature selection. In Intelligent Systems (GCIS), 2010 Second WRI Global Congress on, volume 3, pages 98–101. IEEE.

Netto, O., Nozawa, S., Mitrowsky, R., Macedo, A., Baranauskas, J., and Lins, C. (2010). Applying decision trees to gene expression data from dna microarrays: A leukemia case study. In XXX Congress of the Brazilian Computer Society, X Workshop on Medical Informatics, page 10.

Oshiro, T. M., Perez, P. S., and Baranauskas, J. A. (2012). How many trees in a random forest? In Proceedings of the 8th International Conference on Machine Learning and Data Mining. Submitted.

Saeys, Y., Inza, I., and Larrañaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19):2507.

Witten, I. H. and Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques, Second Edition. Morgan Kaufmann.
Publicado
16/07/2012
NETTO, Oscar Picchi; BARANAUSKAS, José Augusto. An Iterative Decision Tree Threshold Filter. In: SIMPÓSIO BRASILEIRO DE COMPUTAÇÃO APLICADA À SAÚDE (SBCAS), 12. , 2012, Curitiba/PR. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2012 . p. 41-51. ISSN 2763-8952.

Artigos mais lidos do(s) mesmo(s) autor(es)