Análise de composição de conjunto de treinamento para avaliação de aprendizagem de máquina aplicada à predição de genes

  • Raíssa da Silva UFPA
  • Kleber Padovani UFPA
  • Wendel Santos UFPA
  • Roberto Xavier UFPA
  • Ronnie Alves UFPA / ITV

Abstract


Metagenomics allows the study of microbial communities, known as metagenomes, describing them through their compositions and the relation and activities of the microorganisms that coexist there, thus allowing a deeper knowledge about the fundamentals of life and about the broad microbiological diversity, which is still poorly known. Such description can be achieved by the analysis of information from genes contained in (meta) genomes, extracted through the process of identifying genes in DNA sequences, called gene prediction. This work presents a study that allows the analysis of the impact of the training set composition when using machine learning in protein-coding genes prediction.

Keywords: Gene Identification, Regulation and Expression Analysis

References

Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.

De Filippo, C., Ramazzotti, M., Fontana, P., and Cavalieri, D. (2012). Bioinformatic approaches for functional annotation and pathway inference in metagenomics data. Briefings in bioinformatics, 13(6):696–710.

Fassetti, F., Giallombardo, C., Leone, O., Palopoli, L., Rombo, S. E., Ruffolo, P., and Saiardi, A. (2017). Automatic simulation of rna editing in plants for the identification of novel putative open reading frames. PeerJ Preprints, 5:e3362v1.

Goés, F., Alves, R., Correa, L., Chaparro, C., and Thom, L. (2014). Towards an ensemble learning strategy for metagenomic gene prediction. In Advances in Bioinformatics and Computational Biology, pages 17–24. Springer International Publishing.

Hoff, K. J. (2009). Gene prediction in metagenomic sequencing reads. PhD thesis, Georg August University Gottingen.

Kuhn, M. (2008). Building predictive models inRUsing thecaretPackage. Journal of Statistical Software, 28(5).

Noguchi, H., Taniguchi, T., and Itoh, T. (2008). Metageneannotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA research, 15(6):387–396.

Rho, M., Tang, H., and Ye, Y. (2010). Fraggenescan: predicting genes in short and error-prone reads. Nucleic acids research, page 747.

Sieber, P., Platzer, M., and Schuster, S. (2018). The definition of open reading frame revisited. Trends in Genetics, 34(3):167–170.

Zhu, W., Lomsadze, A., and Borodovsky, M. (2010). Ab initio gene identification in metagenomic sequences. Nucleic acids research, 38(12):e132–e132.
Published
2018-10-30
DA SILVA, Raíssa; PADOVANI, Kleber; SANTOS, Wendel; XAVIER, Roberto; ALVES, Ronnie. Análise de composição de conjunto de treinamento para avaliação de aprendizagem de máquina aplicada à predição de genes. In: SHORT PAPERS - BRAZILIAN SYMPOSIUM ON BIOINFORMATICS (BSB) , 2018, Niterói. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2018 . p. 13-18. DOI: https://doi.org/10.5753/bsb_estendido.2018.8798.