An Iterative Decision Tree Threshold Filter
Abstract
In this paper we propose and analyze a new filter for feature subset selection using an iterative decision tree threshold method. Using several biomedical or bioinformatics datasets, the filter has been evaluated on its data compression ability and AUC (Area Under Curve) performance within three scenarios. On average, the filter compressed almost 50% of the data. Additionally, AUC values using all versus selected filter features have not produced performance degradation in five different machine learning algorithms.References
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B, 57:289–300.
Blum, A. L. and Langley, P. (1997). Selection of relevant features and examples in machine learning. AI, 97(1–2):245–271.
Estévez, P., Tesmer, M., Perez, C., and Zurada, J. (2009). Normalized mutual information feature selection. Neural Networks, IEEE Transactions on, 20(2):189–201.
Fayyad, U. M., Piatetsky-Shapiro, G., and Smyth, P. (1996). From Data Mining to Knowledge Discovery: An Overview, pages 1–30.
Foithong, S., Pinngern, O., and Attachoo, B. (2011). Feature subset selection wrapper based on mutual information and rough sets. Expert Systems with Applications.
Frank, A. and Asuncion, A. (2010). Uci machine learning repository.
Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11(1):86–92.
Gao, K., Khoshgoftaar, T., and Van Hulse, J. (2010). An evaluation of sampling on filter-based feature selection methods. In Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, pages 416–421.
Han, J., Kamber, M., and Pei, J. (2011). Data mining: concepts and techniques. Morgan Kaufmann. Institute, B. (2010). Cancer program data sets.
Kantardzic, M. (2011). Data mining: concepts, models, methods, and algorithms. WileyIEEE Press.
Lan, Y., Ren, H., Zhang, Y., Yu, H., and Zhao, X. (2011). A hybrid feature selection method using both filter and wrapper in mammography cad. In Image Analysis and Signal Processing (IASP), 2011 International Conference on, pages 378–382. IEEE.
Larrañaga, P., Calvo, B., Santana, R., Bielza, C., Galdiano, J., Inza, I., Lozano, J., Armañanzas, R., Santafé, G., Pérez, A., et al. (2006). Machine learning in bioinformatics. Briefings in bioinformatics, 7(1):86–112.
Min, H. and Fangfang, W. (2010). Filter-wrapper hybrid method on feature selection. In Intelligent Systems (GCIS), 2010 Second WRI Global Congress on, volume 3, pages 98–101. IEEE.
Netto, O., Nozawa, S., Mitrowsky, R., Macedo, A., Baranauskas, J., and Lins, C. (2010). Applying decision trees to gene expression data from dna microarrays: A leukemia case study. In XXX Congress of the Brazilian Computer Society, X Workshop on Medical Informatics, page 10.
Oshiro, T. M., Perez, P. S., and Baranauskas, J. A. (2012). How many trees in a random forest? In Proceedings of the 8th International Conference on Machine Learning and Data Mining. Submitted.
Saeys, Y., Inza, I., and Larrañaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19):2507.
Witten, I. H. and Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques, Second Edition. Morgan Kaufmann.
Blum, A. L. and Langley, P. (1997). Selection of relevant features and examples in machine learning. AI, 97(1–2):245–271.
Estévez, P., Tesmer, M., Perez, C., and Zurada, J. (2009). Normalized mutual information feature selection. Neural Networks, IEEE Transactions on, 20(2):189–201.
Fayyad, U. M., Piatetsky-Shapiro, G., and Smyth, P. (1996). From Data Mining to Knowledge Discovery: An Overview, pages 1–30.
Foithong, S., Pinngern, O., and Attachoo, B. (2011). Feature subset selection wrapper based on mutual information and rough sets. Expert Systems with Applications.
Frank, A. and Asuncion, A. (2010). Uci machine learning repository.
Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11(1):86–92.
Gao, K., Khoshgoftaar, T., and Van Hulse, J. (2010). An evaluation of sampling on filter-based feature selection methods. In Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, pages 416–421.
Han, J., Kamber, M., and Pei, J. (2011). Data mining: concepts and techniques. Morgan Kaufmann. Institute, B. (2010). Cancer program data sets.
Kantardzic, M. (2011). Data mining: concepts, models, methods, and algorithms. WileyIEEE Press.
Lan, Y., Ren, H., Zhang, Y., Yu, H., and Zhao, X. (2011). A hybrid feature selection method using both filter and wrapper in mammography cad. In Image Analysis and Signal Processing (IASP), 2011 International Conference on, pages 378–382. IEEE.
Larrañaga, P., Calvo, B., Santana, R., Bielza, C., Galdiano, J., Inza, I., Lozano, J., Armañanzas, R., Santafé, G., Pérez, A., et al. (2006). Machine learning in bioinformatics. Briefings in bioinformatics, 7(1):86–112.
Min, H. and Fangfang, W. (2010). Filter-wrapper hybrid method on feature selection. In Intelligent Systems (GCIS), 2010 Second WRI Global Congress on, volume 3, pages 98–101. IEEE.
Netto, O., Nozawa, S., Mitrowsky, R., Macedo, A., Baranauskas, J., and Lins, C. (2010). Applying decision trees to gene expression data from dna microarrays: A leukemia case study. In XXX Congress of the Brazilian Computer Society, X Workshop on Medical Informatics, page 10.
Oshiro, T. M., Perez, P. S., and Baranauskas, J. A. (2012). How many trees in a random forest? In Proceedings of the 8th International Conference on Machine Learning and Data Mining. Submitted.
Saeys, Y., Inza, I., and Larrañaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19):2507.
Witten, I. H. and Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques, Second Edition. Morgan Kaufmann.
Published
2012-07-16
How to Cite
NETTO, Oscar Picchi; BARANAUSKAS, José Augusto.
An Iterative Decision Tree Threshold Filter. In: BRAZILIAN SYMPOSIUM ON COMPUTING APPLIED TO HEALTH (SBCAS), 12. , 2012, Curitiba/PR.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2012
.
p. 41-51.
ISSN 2763-8952.
