New Relevance Measures for Lazy Attribute Selection

  • Douglas B. Pereira UFF
  • Alexandre Plastino UFF
  • Rafael B. Pereira UFF
  • Bianca Zadrozny IBM Research Brasil
  • Luiz Henrique de C. Merschmann UFOP
  • Alex A. Freitas University of Kent

Abstract


Attribute selection is a data preprocessing step used to identify attributes relevant to the classification task. Recently, a lazy technique which postpones the choice of attributes to the moment an instance is submitted to classification was proposed. In the original lazy technique proposal, a measure based on the entropy concept was presented to evaluate the quality of the attributes. In this work, we propose four new measures, based on: the chi-square statistic test, the Cramer coefficient, the Gini index and the gain ratio concept. Experimental results show the relevance of this proposal since, for a large number of datasets, the best performance of the lazy selection strategy was achieved when the new measures were used.

References

Asuncion, A. and Newman, J. (2007). Uci machine learning repository. [link].

Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and regression trees. Wadsworth & Brooks/Cole Advanced Books & Software.

Dasarathy, B. V. (1991). Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press.

Fayyad, U. M. and Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, pages 1022–1029.

Friedman, J., Kohavi, R., and Tun, Y. (1996). Lazy decision trees. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI’96), pages 717–724.

Guyon, I., Gunn, S., Nikravesh, M., and Zadeh, L., editors (2006). Feature Extraction, Foundations and Applications. Springer.

Hall, M. A. (2000). Correlation-based feature selection for discrete and numeric class machine learning. In Procs. of 17th Intl. Conf. on Machine Learning, pages 359–366.

Han, J. and Kamber, M. (2006). Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd edition.

Liu, H. and Motoda, H. (2008). Computational Methods of Feature Selection. Chapman & Hall/CRC.

Liu, H. and Setiono, R. (1995). Chi2: Feature selection and discretization of numeric attributes. In 7th Intl. Conference on Tools with Artificial Intelligence, pages 388–391.

Liu, H. and Setiono, R. (1996). A probabilistic approach to feature selection: A filter solution. In Procs. of the 13th Intl. Conference on Machine Learning, pages 319–327.

Menezes, R., Plastino, A., Zadrozny, B., Pereira, R., Merschmann, L. H., and Freitas, A. (2009). Avaliação de uma nova medida para seleção lazy de atributos baseada no teste chi-quadrado. In Anais do V Workshop em Algoritmos e Aplicações de Mineração de Dados (WAAMD 2009/SBBD 2009), pages 58–65.

Pereira, R., Plastino, A., Zadrozny, B., Merschmann, L., and Freitas, A. (2008). Seleção lazy de atributos – uma nova perspectiva. In Anais do IV Workshop em Algoritmos e Aplicações de Mineração de Dados (WAAMD 2008/SBBD 2008), pages 1–9.

Pereira, R., Plastino, A., Zadrozny, B., Merschmann, L., and Freitas, A. (2011). Lazy attribute selection – choosing attributes at classification time. Intelligent Data Analysis, to appear.

Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1:81–106.

Spiegel, M. R. (1993). Estatística. Makron Books.

Waikato (2009). Weka (waikato environment for knowledge analysis) machine learning project [link].

Yang, Y. and Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Procs. of the 14th Intl. Conf. on Machine Learning, pages 412–420.
Published
2011-07-19
PEREIRA, Douglas B.; PLASTINO, Alexandre; PEREIRA, Rafael B.; ZADROZNY, Bianca; MERSCHMANN, Luiz Henrique de C.; FREITAS, Alex A.. New Relevance Measures for Lazy Attribute Selection. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 8. , 2011, Natal/RN. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2011 . p. 536-547. ISSN 2763-9061.

Most read articles by the same author(s)

1 2 > >>