CUDA-Parttree: A Multiple Sequence Alignment Parallel Strategy in GPU
Resumo
In this paper, we propose and evaluate CUDA-Parttree, a parallel strategy that executes the first phase of the MAFFT Parttree Multiple Sequence Alignment tool (distance matrix calculation with 6mers) on GPU. When compared to Parttree, CUDA-Parttree obtained a speedup of 6.10x on the distance matrix calculation for the Cyclodex gly tran (50, 280 sequences) set, reducing the execution time from 33.94s to 5.57s. Including data conversion and movement to/from the GPU, the speedup was 2.59x. With the sequence set Syn 100000 (100, 000 sequences), a speedup of 4.46x was attained, reducing execution time from 209.54s to 47.00s.
Referências
Dayhoff, M., Schwartz, R., and Orcutt, B. (1978). 22 A Model of Evolutionary Change in Proteins. In Atlas of protein sequence and structure, volume 5, pages 345–352. National Biomedical Research Foundation Silver Spring, MD.
Deorowicz, S., Debudaj-Grabysz, A., and Gudyś, A. (2016). FAMSA: Fast and accurate multiple sequence alignment of huge protein families. Scientific reports, 6.
Finn, R. D. et al. (2006). Pfam: clans, web tools and services. Nucleic acids research, 34(suppl 1):D247–D251.
Ghosh, P., Krishnamoorthy, S., and Kalyanaraman, A. (2019). Pakman: Scalable assembly of large genomes on distributed memory machines. bioRxiv.
González-Domı́nguez, J., Liu, Y., Touriño, J., and Schmidt, B. (2016). MSAProbs-MPI: parallel multiple sequence aligner for distributed-memory systems. Bioinformatics, 32(24):3826–3828.
Gudys, A. (2016). extHomFam benchmark.
Hirschberg, D. S. (1977). Algorithms for the longest common subsequence problem. Journal of the ACM (JACM), 24(4):664–675.
Hung, C.-L. et al. (2015). CUDA ClustalW: An efficient parallel algorithm for progressive multiple sequence alignment on Multi-GPUs. Computational biology and chemistry, 58:62–68.
Katoh, K. et al. (2002). MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic acids research, 30(14):3059–3066.
Katoh, K. and Toh, H. (2006). PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences. Bioinformatics, 23(3):372–374.
Lamnidis, T. C. et al. (2018). Ancient fennoscandian genomes reveal origin and spread of siberian ancestry in europe. Nature communications, 9(1):5018.
Mi, H., Muruganujan, A., and Thomas, P. D. (2012). Panther in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic acids research, 41(D1):D377–D386.
Mount, D. (2013). Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press, 2nd edition.
Nam-phuong, D. N. et al. (2015). Ultra-large alignments using phylogeny-aware profiles. Genome biology, 16(1):124.
Smith, T. F., Waterman, M. S., et al. (1981). Identification of common molecular subsequences. Journal of molecular biology, 147(1):195–197.
Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic acids research, 22(22):4673–4680.
Wang, L. and Jiang, T. (1994). On the complexity of multiple sequence alignment. Journal of computational biology, 1(4):337–348.
Zhu, X. et al. (2015). Parallel implementation of MAFFT on CUDA-enabled graphics hardware. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 12(1):205–218.