Impacto de Estratégias de Balanceamento no Problema de Classificação de Sítios de Splice

Cláudia G. Varassin; Alexandre Plastino; Bianca Zadrozny; Helena G. Leitão

Cláudia G. Varassin UFF
Alexandre Plastino UFF
Bianca Zadrozny IBM Research
Helena G. Leitão UFF

Resumo

Sítios de splice são os locais de junção entre certos segmentos dos genes de eucariotos. A detecção desses sítios no DNA é um problema de classificação altamente desbalanceado. Visando aumentar a capacidade de aprendizado nesse problema, duas técnicas de reamostragem de dados que lidam com classes desbalanceadas são empregadas. Os resultados experimentais mostram que é possível melhorar o desempenho adotando conjuntos de treinamento com fatores de desbalanceamento distintos do que ocorre nos dados originais.

Referências

Batista, G. E., Prati, R. C. and Monard, M. C. (2004) A study of the behavior of several methods for balancing machine learning training data. In SIGKDD Explorations Special Issue on Learning from Imbalanced Datasets, v.5, n.1, p.20-29.

Brent, M. R. and Guigó, R. (2004) Recent Advances in gene Structure Prediction. In Current Opinion in Structural Biology, v.14, p.264-272.

Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P. (2002) SMOTE: Synthetic Minority Over-sampling Technique JAIR, v.16, p.321–357.

Chen, T. M. et al (2005) Prediction of splice sites with dependency graphs and their expanded Bayesian networks. In Bioinformatics, v.21, p.471-482.

Deepa, T. and Punithavalli, M. (2010) An Analysis for Mining Imbalanced Datasets. In International Journal of Computer Science and Information Security, v.8, p.132-137.

Degroeve, S., Saeys, Y., Baets, B. D., Rouzé, P. and de Peer, Y. V. (2005) SpliceMachine: predicting splice sites from high-dimensional local context representations”. In Bioinformatics, v.21, p.1332-1338.

Freund Y. (1999) The alternating decision tree learning algorithm, In Machine Learning: Proceedings of the Sixteenth International Conference, p.124-133.

Han, J. and Kamber, M. (2006) Data Mining, Concepts and techniques. Morgan Kaufmann. 2 nd edition Kotsiantis, S., Kanellopoulos, D. and Pintelas, P. (2006) Handling imbalanced datasets: A review. In GESTS International Transactions on Computer Science and Engineering.

Schweikert, G, et al. (2009) mGene: accurate SVM-based gene finding with an application to nematode genomes. In Genome Research, v.19, p.1233-2143.

Sonnenburg, S., Philips, P., Schweikert, G. and Rätsch, G. (2007) Accurate splice site prediction using support vector machines. In BMC Bioinformatics, v.8.

Yeo, G. and Burge, C. (2004) Maximum entropy modeling of short sequences motifs with applications to RNA splicing signals. In Journal of Computational Biology v.11, p.377-94.

Staden, R. (1984) Computer methods to locate signals in nucleic acid sequences. In Nucleic Acids Research.,v.12, p.505–519.

Weiss, G. M. and Provost, F. (2003) Learning when training data are costly: the effect of class distribution in tree induction. In Journal of Artificial Intelligence Research v.19, p. 315-354.

Witten, I. H. and Frank, E. (2005) Practical Machine Learning Tools and Techniques. Morgan Kaufmann. 2 nd edition.

Zadrozny, B., Langford J. and N. Abe (2003) Cost-Sensitive Learning by Cost-Proportionate Example Weighting. In Proceedings of the 2003 IEEE International Conference on Data Mining.