Applying Decision Trees to Gene Expression Data from DNA Microarrays: A Leukemia Case Study

Oscar Picchi Netto; Sérgio Ricardo Nozawa; Rafael Andrés Rosales Mitrowsky; Alessandra Alaniz Macedo; José Augusto Baranauskas

Oscar Picchi Netto USP
Sérgio Ricardo Nozawa Centro Universitário Nilton Lins
Rafael Andrés Rosales Mitrowsky USP
Alessandra Alaniz Macedo USP
José Augusto Baranauskas USP

Resumo

Analyzing gene expression data is a challenging task since the large number of features against the shortage of available examples can be prone to overfitting. In order to avoid this pitfall and achieve high performance, some approaches construct complex classifiers, using new or well-established strategies. The main objective of this communication is to construct classifiers that can be human readable as well as robust in performance in microarray data using decision trees. Using one well-known leukemia dataset, a publicly available gene expression classification problem, we show the feasibility of decision trees on microarray data. Summarizing our results, we have obtained simple decision trees with performance comparable to related work.

Referências

Baranauskas, J. A. and Monard, M. C. (2003). Combining symbolic classifiers from multiple inducers. Knowledge-Based Systems, 16(3):129–136.

Baranauskas, J. A., Monard, M. C., and Horst, P. S. (1999). Evaluation of CN2 induced rules using feature selection. In Proceedings of the Argentine Symposium on Artificial Intelligence (ASAI/JAIIO/SADIO), pages 141–154, Buenos Aires, Argentine.

Blum, A. L. and Langley, P. (1997). Selection of relevant features and examples in machine learning. Artificial Intelligence, 97(1–2):245–271.

Chow, M., Moler, E., and Mian, I. (2001). Identifying marker genes in transcription profile data using a mixture of feature relevance experts. Physiol. Genomics, 5:99–111.

Demšar, J. (2006). Statistical comparison of classifiers over multiple data sets. Journal of Machine Learning Research, 7(1):1–30.

Dobra, A. (2008). Dependency networks for genome-wide data. Technical Report 547, Department of Statistics, University of Washington.

Domingos, P. (1999). The role of occam’s razor in knowledge discovery. Data Mining and Knowledge Discovery, 3:409–425.

Dudoit, S., Fridlyand, J., and Speed, T. (2000). Comparison of discrimination methods for the classification of tumors using gene expression data. Technical report, University of California, Berkeley.

Ein-Dor, L., Kela, I., Getz, G., Givol, D., and Domany, E. (2005). Outcome signature genes in breast cancer: is there a unique set? Bioinformatics, 21(2):171–178.

Fayyad, U. M. and Irani, K. B. (1992). The attribute-selection problem in decision tree generation. In Proceedings of the Tenth National Conference on Artificial Intelligence, pages 104–110, Menlo Park, CA. American Association for Artificial Intelligence.

Fugimoto, P. M., Sales, L. D. F., Pereira Júnior, G. A., Passos, A. D. C., Alves, D., and Baranauskas, J. A. (2009). Análise comparativa entre Árvores de decisão e TRISS na predição de sobrevida de pacientes traumatizados. In IV Congresso da Academia Trinacional de Ciências, page 10 p., Foz do Iguaçu, PR.

Gamberger, D., Lavrač, N., Zelezny, F., and Tolar, J. (2004). Induction of comprehensible models for gene expression datasets by subgroup discovery methodology. Journal of Biomedical Informatics, 37:269–284.

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286:531–537.

Hastie, T., Tibshirani, R., and Friedman, J. (2001). The elements of statistical learning, data mining, inference and prediction. Berlin: Springer.

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI, pages 1137–1145.

Krause, D. S., Lazarides, K., von Andrian, U. H., and Etten, R. (2006). Requirement for CD44 in homing and engraftment of BCR-ABL-expressing leukemic stem cells. Nature Medicine, 12(10):1175–1180.

Li, J. and Wong, L. (2002). Geography of differences between two classes of data. In PKDD ’02: Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery, pages 325–337, London, UK. Springer-Verlag.

Liu, H. and Motoda, H., editors (1998). Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer Academic Publishers.

Mitchell, T. M. (1997). Machine Learning. McGraw–Hill.

Molla, M., Waddell, M., Page, D., and Shavlik, J. (2004). Using machine learning to design and interpret gene-expression microarrays. AI Mag., 25(1):23–44.

Monard, M. C. and Baranauskas, J. A. (2003). Indução de Regras e Árvores de Decisão, chapter 5, pages 115–139. In [Rezende 2003].

Paik, S., Tang, G., Shak, S., Kim, C., Baker, J., Kim, W., Cronin, M., Baehner, F. L., Watson, D., Bryant, J., Costantino, J. P., Geyer, Charles E., J., Wickerham, D. L., and Wolmark, N. (2006). Gene Expression and Benefit of Chemotherapy in Women With Node-Negative, Estrogen Receptor-Positive Breast Cancer. J Clin Oncol, 24(23):3726–3734.

Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann. San Francisco, CA.

Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.-H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J. P., Poggio, T., Gerald, W., Loda, M., Lander, E. S., and Golub, T. R. (2001). Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences of the United States of America, 98(26):15149–15154.

Rezende, S. O., editor (2003). Sistemas Inteligentes - Fundamentos e Aplicações. Manole.

Rosenfeld, N., Aharonov, R., Meiri, E., Rosenwald, S., Spector, Y., Zepeniuk, M., Benjamin, H., Shabes, N., Tabak, S., Levy, A., et al. (2008a). MicroRNAs accurately identify cancer tissue origin. Nature biotechnology, 26(4):462–469.

Rosenfeld, N., Aharonov, R., Meiri, E., Rosewalt, S., and Spector, Y. (2008b). MicroRNAs accurately identify cancer tissue origin. Nature Biotechnology, 26(4):462–469.

Schachtner, R., Lutter, D., Theis, F., Lang, E., Tomé, A., Saez, J. G., and Puntonet, C. (2007). Blind Matrix Decomposition Techniques to Identify Marker Genes from Microarrays. Springer Berlin / Heidelberg.

Schena, M., Shalon, D., Davis, R. W., and Brown, P. O. (1995). Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science, 270(5235):467–470.

Screaton, G., Bell, M., Jackson, D., Cornelis, F., Gerth, U., and Bell, J. (1992). Genomic structure of dna encoding the lymphocyte homing receptor cd44 reveals at least 12 alternatively spliced exons. Proc. Natl. Acad. Sci., 89:12160–12164.

Sun, Y., Dong, L.-J., Tian, F., Wang, S.-Q., Jia, Z.-L., and et al. (2004). Identification of acute leukemia-specific genes from leukemia recipient/sibling donor pairs by distinguishing study with oligonucleotide microarrays. Journla Of Experimental Hematology, 12:450–454.

Tang, L.-J., Jiang, J.-H., Wu, H.-L., Shen, G.-L., and Yu, R.-Q. (2009). Variable selection using probability density function similarity for support vector machine classification of high-dimensional microarray data. Talanta, 79(2):260 – 267.

van de Vijver, M. J., He, Y. D., van ’t Veer, L. J., Dai, H., Hart, A. A., Voskuil, D. W., Schreiber, G. J., Peterse, J. L., Roberts, C., Marton, M. J., Parrish, M., Atsma, D., Witteveen, A., Glas, A., Delahaye, L., van der Velde, T., Bartelink, H., Rodenhuis, S., Rutgers, E. T., Friend, S. H., and Bernards, R. (2002). A Gene-Expression Signature as a Predictor of Survival in Breast Cancer. N Engl J Med, 347(25):1999–2009.

Witten, I. H. and Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques, Second Edition. Morgan Kaufmann.

Applying Decision Trees to Gene Expression Data from DNA Microarrays: A Leukemia Case Study

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)