Fragmentando o DNA de Ferramentas de Alinhamento Progressivo: uma Metaferramenta Eficiente

Mario João Jr.; Alexandre C. Sena; Vinod E. F. Rebello

doi:10.5753/wscad.2023.235781

Mario João Jr. UERJ / UFF
Alexandre C. Sena UERJ
Vinod E. F. Rebello UFF

DOI: https://doi.org/10.5753/wscad.2023.235781

Resumo

O Alinhamento Múltiplo de Sequências genéticas é essencial para a área de bioinformática. Devido à sua complexidade exponencial, heurísticas são utilizadas. A mais popular é o Alinhamento Progressivo, com inúmeras ferramentas desenvolvidas ao longo dos anos. Entretanto, nenhuma consegue gerar sempre o melhor alinhamento, nem se sobressair. Assim, os cientistas são obrigados a escolher e utilizar mais de uma ferramenta. Ao invés de desenvolver uma nova heurística, este trabalho apresenta uma metaferramenta que avalia novas combinações de técnicas extraídas de outras ferramentas e coordena suas execuções eficientemente. A abordagem é capaz de alcançar speedups superlineares, mantendo, e por vezes melhorando, a qualidade dos alinhamentos.

Referências

Bashford, D., Chothia, C., and Lesk, A. M. (1987). Determinants of a protein fold: Unique features of the globin amino acid sequences. Journal of Molecular Biology, 196(1):199–216.

Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., and Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4):316–319.

Do, C. B., Mahabhashyam, M. S. P., Brudno, M., and Batzoglou, S. (2005). ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome research, 15(2):330–40.

Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics, 14(9):755–763.

Edgar, R. C. and Batzoglou, S. (2006). Multiple sequence alignment. Current Opinion in Structural Biology, 16(3):368–373.

Feng, D.-F. and Doolittle, R. F. (1987). Progressive Sequence Alignment as a Prerequisite to Correct Phylogenetic Trees. Journal of Molecular Evolution, 25:351–360.

Finn, R. D., Bateman, A., Clements, J., Coggill, P., Eberhardt, R. Y., Eddy, S. R., Heger, A., Hetherington, K., Holm, L., Mistry, J., Sonnhammer, E. L., Tate, J., and Punta, M. (2014). Pfam: the protein families database. Nucleic acids research, 42.

Goh, C.-S. and Cohen, F. E. (2002). Co-evolutionary analysis reveals insights into protein–protein interactions. Journal of Molecular Biology, 324(1):177–192.

Gotoh, O. (2014). Heuristic Alignment Methods, chapter 2, pages 29–43. Humana Press, Totowa, NJ.

Higgins, D. G. and Sharp, P. M. (1988). CLUSTAL: A package for performing multiple sequence alignment on a microcomputer. Gene, 73(1):237 – 244.

Hung, L.-W., Wang, I. X., Nikaido, K., Liu, P.-Q., Ames, G. F.-L., and Kim, S.-H. (1998). Crystal structure of the ATP-binding subunit of an ABC transporter. Nature, 396(6712):703–707.

João Jr, M., Sena, A. C., and Rebello, V. E. F. (2019). On the parallelization of Hirschberg’s algorithm for multi-core and many-core systems. Concurrency and Computation: Practice and Experience, 31(18):e5174.

João Jr, M., Sena, A. C., and Rebello, V. E. F. (2022). On using consistency consistently in multiple sequence alignments. In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 152–161.

João Jr, M., Sena, A. C., and Rebello, V. E. F. (2023). On closing the inopportune gap with consistency transformation and iterative refinement. PLoS ONE, 18(7):1–24.

Katoh, K., Misawa, K., Kuma, K.-i., and Miyata, T. (2002). MAFFT: A novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic Acids Research, 30(14):3059–3066.

Katoh, K. and Toh, H. (2008). Recent developments in the MAFFT multiple sequence alignment program. Briefings in Bioinformatics, 9(4):286–298.

Kemena, C. and Notredame, C. (2009). Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics, 25(19):2455–2465.

Mirarab, S. and Warnow, T. (2011). FastSP: linear time calculation of alignment accuracy. Bioinformatics, 27(23):3250–3258.

Myers, E. W. and Miller, W. (1988). Optimal alignments in linear space. Bioinformatics, 4(1):11–17.

Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443 – 453.

Notredame, C., Higgins, D. G., and Heringa, J. (2000). T-Coffee: A Novel Method for Fast and Accurate Multiple Sequence Alignment. Journal of Molecular Biology, 302(1):205 – 217.

Przybylski, D. and Rost, B. (2002). Alignments grow, secondary structure prediction improves. Proteins, 46(2):197–205.

Saitou, N. and Nei, M. (1987). The Neighbor-joining Method: A New Method for Reconstructing Phylogenetic Trees. Molecular Biology and Evolution, 4(4):406–425.

Sokal, R. R. and Michener, C. D. (1958). A statistical method for evaluating systematic relationships. The University of Kansas Science Bulletin, 38(22):1409–1438.

Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994). CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22(22):4673–4680.

Thompson, J. D., Linard, B., Lecompte, O., and Poch, O. (2011). A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives. PLoS ONE, 6(3).

Wilbur, W. J. and Lipman, D. J. (1984). The context dependent comparison of biological sequences. SIAM Journal on Applied Mathematics, 44(3):557–567.