Um algoritmo de indução de árvore de decisão baseado em agrupamento

  • Rodrigo C. Barros USP
  • Márcio P. Basgalupp UNIFESP
  • André C. P. L. F. de Carvalho USP
  • Marcos G. Quiles UNIFESP

Resumo


Decision tree induction algorithms are well known techniques for assigning objects to predefined categories in a transparent fashion. Most decision tree induction algorithms rely on a greedy top-down recursive strategy for growing the tree, and pruning techniques to avoid overfitting. Even though such a strategy has been quite successful in many problems, it falls short in several others. For instance, there are cases in which the hyper-rectangular surfaces generated by these algorithms can only map the problem description after several sub-sequential partitions, which results in a large and incomprehensible tree. Hence, we propose a new decision tree induction algorithm based on clustering which seeks to provide more accurate models and/or shorter descriptions, which are comprehensible for the end-user. We do not base our performance analysis solely on the straightforward comparison of our proposed algorithm to baseline methods. Instead, we propose a data-dependent analysis in order to look for evidences which may explain in which situations our algorithm outperforms a well-known decision tree induction algorithm.

Referências

Barros, R. C., Ruiz, D. D., and Basgalupp, M. P. (2011). Evolutionary model trees for handling continuous classes in machine learning. Information Sciences, 181:954–971.

Basgalupp, M., Barros, R. C., de Carvalho, A., Freitas, A., and Ruiz, D. (2009). Legal-tree: a lexicographic multi-objective genetic algorithm for decision tree induction. In Shin, S., Ossowski, S., Martins, P., Menezes, R., Virol, M., Hong, J., Shin, D., Palakal, M., Fritzke, U., Crosby, M., and Haddad, H., editors, Proceedings of the 2009 ACM Symposium on Applied Computing, pages 1085–1090. ACM Press.

Brazdil, P., Gama, J. a., and Henery, B. (1994). Characterizing the applicability of classification algorithms using meta-level learning. In Proceedings of the European conference on machine learning on Machine Learning, pages 83–102, Secaucus, NJ, USA. Springer-Verlag New York, Inc.

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38.

Frank, A. and Asuncion, A. (2010). UCI machine learning repository. Freitas, A. A. (2008). Soft Computing for Knowledge Discovery and Data Mining, chapter A Review of evolutionary Algorithms for Data Mining, pages 79–111. Springer US.

Freitas, A. A., Wieser, D. C., and Apweiler, R. (2010). On the importance of comprehensible classification models for protein function prediction. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 7:172–182.

Gaddam, S. R., Phoha, V. V., and Balagani, K. S. (2007). K-means+id3: A novel method for supervised anomaly detection by cascading k-means clustering and id3 decision tree learning methods. IEEE Transactions on Knowledge and Data Engineering, 19:345–354.

Gama, J. and Brazdil, P. (1995). Characterization of classification algorithms. In Proceedings of the 7th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence, pages 189–200, London, UK. Springer-Verlag.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer Series in Statistics. Springer, 2nd ed. 2009. corr. 3rd printing edition.

Ho, T., Basu, M., and Law, M. (2006). Measures of geometrical complexity in classification problems. In Jain, L., Wu, X., Basu, M., and Ho, T. K., editors, Data Complexity in Pattern Recognition, Advanced Information and Knowledge Processing, pages 1–23. Springer London.

Ho, T. K. and Basu, M. (2002). Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell., 24(3):289–300.

Hoekstra, A. and Duin, R. P. (1996). On the nonlinearity of pattern classifiers. In In Proceedings of the 13th ICPR, pages 271–275.

Kalousis, A. (2002). Algorithm Selection via Meta-Learning. PhD thesis, Université de Genève, Centre Universitaire d’Informatique.

Kothari, R. and Pitts, D. (1999). On finding the number of clusters. Pattern Recognition Letters, 20(4):405 – 416.

Lloyd, S. (1982). Least squares quantization in pcm. Information Theory, IEEE Transactions on, 28(2):129 – 137.

Mahalanobis, P. C. (1936). On the generalised distance in statistics. In Proceedings National Institute of Science, India, volume 2, pages 49–55.

Michie, D., Spiegelhalter, D. J., Taylor, C. C., and Campbell, J., editors (1994). Machine learning, neural and statistical classification. Ellis Horwood, Upper Saddle River, NJ, USA.

Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1):81–106. Quinlan, J. R. (1993). C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

Rendell, L., Sheshu, R., and Tcheng, D. (1987). Layered concept-learning and dynamically variable bias management. In Proceedings of the 10th international joint conference on Artificial intelligence - Volume 1, pages 308–314, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.

Sánchez, J. S., Barandela, R., Marqués, A. I., Alejo, R., and Badenas, J. (2003). Analysis of new techniques to obtain quality training sets. Pattern Recogn. Lett., 24:1015–1022.

Seni, G. and Elder, J. (2010). Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions. Morgan and Claypool Publishers.

Smith-Miles, K. A. (2009). Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput. Surv., 41:6:1–6:25.

Tan, P.-N., Steinbach, M., and Kumar, V. (2005). Introduction to Data Mining, (First Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.

Vendramin, L., Campello, R. J. G. B., and Hruschka, E. R. (2010). Relative clustering validity criteria: A comparative overview. Stat. Anal. Data Min., 3:209–235.

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics, 1:80–83. Witten, I. H. and Frank, E. (1999). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann.

Wolpert, D. H. and Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1):67–82.
Publicado
19/07/2011
BARROS, Rodrigo C.; BASGALUPP, Márcio P.; CARVALHO, André C. P. L. F. de; QUILES, Marcos G.. Um algoritmo de indução de árvore de decisão baseado em agrupamento. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 8. , 2011, Natal/RN. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2011 . p. 382-393. ISSN 2763-9061.