Análise de desempenho de algoritmos para correção híbrida de sequências genômicas em ambiente de memória compartilhada e distribuída
Abstract
Genome analysis is an area with extensive research because it allows the study of diseases and the development of new treatments. To do this, researchers use the genome, assembled with computational tools to perform their analysis. This work presents a performance analysis of a hybrid correction algorithm for genome sequences, this being a necessary stage for the assembly of the genome. Seven versions of the algorithm were implemented to compare their performance. The results obtained from the tests show that it is possible to obtain performance gains of up to about 17 times in relation to the sequential version, and that the best version of the algorithm has scalability higher than linear.
References
Bankevich, A., Nurk, S., Antipov, D., Gurevich, A. A., Dvorkin, M., Kulikov, A. S., Lesin, V. M., Nikolenko, S. I., Pham, S., Prjibelski, A. D., et al. (2012). Spades: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology, 19(5):455–477.
Bolger, A. M., Lohse, M., and Usadel, B. (2014). Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics, 30(15):2114–2120.
Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L., and Rice, P. M. (2009). The sanger fastq file format for sequences with quality scores, and the solexa/illumina fastq variants. Nucleic acids research, 38(6):1767–1771.
Compeau, P., Pevzner, P., and Tesler, G. (2011). How to apply de bruijn graphs to genome assembly. Nature biotechnology, 29(11):987–991. Consortium, . G. P. et al. (2010). A map ofhuman genome variation from population-scale sequencing. Nature, 467(7319):1061. Consortium, . G. P. et al. (2015). A global reference for human genetic variation. Nature, 526(7571):68.
Del Angel, V. D., Hjerde, E., Sterck, L., Capella-Gutierrez, S., Notredame, C., Pettersson, O. V., Amselem, J., Bouri, L., Bocs, S., Klopp, C., et al. (2018). Ten steps to get started in genome assembly and annotation. F1000Research, 7.
Goodwin, S., McPherson, J. D., and McCombie, W. R. (2016). Coming of age: ten years of next-generation sequencing technologies. Nature Reviews Genetics, 17(6):333.
Grohme, M. A., Schloissnig, S., Rozanski, A., Pippel, M., Young, G. R., Winkler, S., Brandl, H., Henry, I., Dahl, A., Powell, S., et al. (2018). The genome of schmidtea mediterranea and the evolution of core cellular mechanisms. Nature, 554(7690):56.
Illumina (2010). De novo assembly using illumina reads. Khan, A. R., Pervez, M. T., Babar, M. E., Naveed, N., and Shoaib, M. (2018). A com- prehensive study of de novo genome assemblers: current challenges and future pros- pective. Evolutionary Bioinformatics, 14:1176934318758650.
Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., Bergman, N. H., and Phillippy, A. M. (2017). Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome research, 27(5):722–736.
Purbarani, S. C., Sanabila, H. R., Bowolaksono, A., and Wiweko, B. (2016). A survey of whole genome alignment tools and frameworks based on hadoop’s mapreduce. In 2016 International Workshop on Big Data and Information Security (IWBIS), pages 65–70.
Rhoads, A. and Au, K. F. (2015). Pacbio sequencing and its applications. Genomics, Pro- teomics Bioinformatics, 13(5):278 – 289. SI: Metagenomics of Marine Environments.
Ruan, J. (2015). Smartdenovo. https://github.com/ruanjue/smartdenovo. Acesso: 02/03/2019.
Salmela, L. and Rivals, E. (2014). Lordec: accurate and efficient long read error correc- tion. Bioinformatics, 30(24):3506–3514.
Wheeler, D. A., Srinivasan, M., Egholm, M., Shen, Y., Chen, L., McGuire, A., He, W., Chen, Y.-J., Makhijani, V., Roth, G. T., et al. (2008). The complete genome of an individual by massively parallel dna sequencing. Nature, 452(7189):872.
