Impact of Balancing Strategies on the Splice Site Classification Problem

  • Cláudia G. Varassin UFF
  • Alexandre Plastino UFF
  • Bianca Zadrozny IBM Research
  • Helena G. Leitão UFF

Abstract


Splice sites are the boundaries between certain stretches in eukaryotic genes. The detection of such sites in the DNA is a highly imbalanced classification task. Aiming to increase the learning ability in this problem, two existing resampling techniques designed to deal with this kind of imbalance are used. The experimental results show that is possible to increase classification performance using training sets with an imbalance factor different from the naturally occurring one.

References

Batista, G. E., Prati, R. C. and Monard, M. C. (2004) A study of the behavior of several methods for balancing machine learning training data. In SIGKDD Explorations Special Issue on Learning from Imbalanced Datasets, v.5, n.1, p.20-29.

Brent, M. R. and Guigó, R. (2004) Recent Advances in gene Structure Prediction. In Current Opinion in Structural Biology, v.14, p.264-272.

Chawla, N.V., Bowyer, K.W., Hall, L.O. and Kegelmeyer, W.P. (2002) SMOTE: Synthetic Minority Over-sampling Technique JAIR, v.16, p.321–357.

Chen, T. M. et al (2005) Prediction of splice sites with dependency graphs and their expanded Bayesian networks. In Bioinformatics, v.21, p.471-482.

Deepa, T. and Punithavalli, M. (2010) An Analysis for Mining Imbalanced Datasets. In International Journal of Computer Science and Information Security, v.8, p.132-137.

Degroeve, S., Saeys, Y., Baets, B. D., Rouzé, P. and de Peer, Y. V. (2005) SpliceMachine: predicting splice sites from high-dimensional local context representations”. In Bioinformatics, v.21, p.1332-1338.

Freund Y. (1999) The alternating decision tree learning algorithm, In Machine Learning: Proceedings of the Sixteenth International Conference, p.124-133.

Han, J. and Kamber, M. (2006) Data Mining, Concepts and techniques. Morgan Kaufmann. 2 nd edition Kotsiantis, S., Kanellopoulos, D. and Pintelas, P. (2006) Handling imbalanced datasets: A review. In GESTS International Transactions on Computer Science and Engineering.

Schweikert, G, et al. (2009) mGene: accurate SVM-based gene finding with an application to nematode genomes. In Genome Research, v.19, p.1233-2143.

Sonnenburg, S., Philips, P., Schweikert, G. and Rätsch, G. (2007) Accurate splice site prediction using support vector machines. In BMC Bioinformatics, v.8.

Yeo, G. and Burge, C. (2004) Maximum entropy modeling of short sequences motifs with applications to RNA splicing signals. In Journal of Computational Biology v.11, p.377-94.

Staden, R. (1984) Computer methods to locate signals in nucleic acid sequences. In Nucleic Acids Research.,v.12, p.505–519.

Weiss, G. M. and Provost, F. (2003) Learning when training data are costly: the effect of class distribution in tree induction. In Journal of Artificial Intelligence Research v.19, p. 315-354.

Witten, I. H. and Frank, E. (2005) Practical Machine Learning Tools and Techniques. Morgan Kaufmann. 2 nd edition.

Zadrozny, B., Langford J. and N. Abe (2003) Cost-Sensitive Learning by Cost-Proportionate Example Weighting. In Proceedings of the 2003 IEEE International Conference on Data Mining.
Published
2012-07-16
VARASSIN, Cláudia G.; PLASTINO, Alexandre; ZADROZNY, Bianca; LEITÃO, Helena G.. Impact of Balancing Strategies on the Splice Site Classification Problem. In: BRAZILIAN E-SCIENCE WORKSHOP (BRESCI), 6. , 2012, Curitiba/PR. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2012 . p. 24-31. ISSN 2763-8774.