Refatorando um Pipeline de Bioinformática: Um Estudo de Caso para Análise de Amplicons
Resumo
Os pipelines de bioinformática são essenciais para possibilitar o processamento da enorme quantidade de dados biológicos disponíveis. Uma abordagem na análise de dados biológicos consiste em implementar scripts usando uma linguagem de programação como Perl, Python, R ou Bash. No entanto, esses scripts podem ser difíceis de serem mantidos e compreendidos por outros desenvolvedores. Neste trabalho, reimplementamos um pipeline para análise de amplicons, implementado em Perl, para torná-lo escalável, portável, com código simplificado e para a paralelização de seus processos. Para isso, utilizamos um gerenciador de workflow denominado Nextflow.
Palavras-chave:
Bioinformática, Nextflow, Pipeline
Referências
Bolger, A. M., Lohse, M., and Usadel, B. (2014). Trimmomatic: a flexible trimmer forillumina sequence data.Bioinformatics, 30(15):2114–2120.
Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., and Notredame,C. (2017). Nextflow enables reproducible computational workflows. Nature biotech-nology, 35(4):316–319.
Durham, A. M., Kashiwabara, A. Y., Matsunaga, F. T., Ahagon, P. H., Rainone, F., Va-ruzza, L., and Gruber, A. (2005). Egene: a configurable pipeline generation system forautomated sequence analysis.Bioinformatics, 21(12):2812–2813.
Fresnedo-Ramírez, J., Yang, S., Sun, Q., Karn, A., Reisch, B. I., and Cadle-Davidson, L.(2019). Computational analysis of ampseq data for targeted, high-throughput genotyping of amplicons. Frontiers in plant science, 10:599.
Garlan, D. and Shaw, M. (1993). An introduction to software architecture. InAdvancesin software engineering and knowledge engineering, pages 1–39. World Scientific.
Goecks, J., Nekrutenko, A., and Taylor, J. (2010). Galaxy: a comprehensive approach forsupporting accessible, reproducible, and transparent computational research in the lifesciences.Genome biology, 11(8):1–13.
Gordon, A., Hannon, G., et al. (2010). Fastx-toolkit.FASTQ/A short-reads preprocessingtools (unpublished) http://hannonlab.cshl.edu/fastxtoolkit, 5.
Koster, J. and Rahmann, S. (2012). Snakemake - a scalable bioinformatics workflowengine. Bioinformatics, 28(19):2520–2522.
Magoc, T. and Salzberg, S. L. (2011). Flash: fast length adjustment of short reads toimprove genome assemblies. Bioinformatics, 27(21):2957–2963.
Richards, M. and Ford, N. (2020). Fundamentals of Software Architecture: An Engineering Approach. O’Reilly Media, Inc.
Sadedin, S. P., Pope, B., and Oshlack, A. (2012). Bpipe: a tool for running and managingbioinformatics pipelines.Bioinformatics, 28(11):1525–1526.
Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W., Lopez, R., McWil-liam, H., Remmert, M., Soding, J., et al. (2011). Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega. Molecular systemsbiology, 7(1):539.
Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., and Notredame,C. (2017). Nextflow enables reproducible computational workflows. Nature biotech-nology, 35(4):316–319.
Durham, A. M., Kashiwabara, A. Y., Matsunaga, F. T., Ahagon, P. H., Rainone, F., Va-ruzza, L., and Gruber, A. (2005). Egene: a configurable pipeline generation system forautomated sequence analysis.Bioinformatics, 21(12):2812–2813.
Fresnedo-Ramírez, J., Yang, S., Sun, Q., Karn, A., Reisch, B. I., and Cadle-Davidson, L.(2019). Computational analysis of ampseq data for targeted, high-throughput genotyping of amplicons. Frontiers in plant science, 10:599.
Garlan, D. and Shaw, M. (1993). An introduction to software architecture. InAdvancesin software engineering and knowledge engineering, pages 1–39. World Scientific.
Goecks, J., Nekrutenko, A., and Taylor, J. (2010). Galaxy: a comprehensive approach forsupporting accessible, reproducible, and transparent computational research in the lifesciences.Genome biology, 11(8):1–13.
Gordon, A., Hannon, G., et al. (2010). Fastx-toolkit.FASTQ/A short-reads preprocessingtools (unpublished) http://hannonlab.cshl.edu/fastxtoolkit, 5.
Koster, J. and Rahmann, S. (2012). Snakemake - a scalable bioinformatics workflowengine. Bioinformatics, 28(19):2520–2522.
Magoc, T. and Salzberg, S. L. (2011). Flash: fast length adjustment of short reads toimprove genome assemblies. Bioinformatics, 27(21):2957–2963.
Richards, M. and Ford, N. (2020). Fundamentals of Software Architecture: An Engineering Approach. O’Reilly Media, Inc.
Sadedin, S. P., Pope, B., and Oshlack, A. (2012). Bpipe: a tool for running and managingbioinformatics pipelines.Bioinformatics, 28(11):1525–1526.
Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W., Lopez, R., McWil-liam, H., Remmert, M., Soding, J., et al. (2011). Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega. Molecular systemsbiology, 7(1):539.
Publicado
18/07/2021
Como Citar
BINI, Aline Mara Rudsit; OLIVEIRA, Liliane Santana; POLISELO, Heloisa; GUIMARÃES, Francismar Correa Marcelino; KASHIWABARA, André Yoashiaki.
Refatorando um Pipeline de Bioinformática: Um Estudo de Caso para Análise de Amplicons. In: BRAZILIAN E-SCIENCE WORKSHOP (BRESCI), 15. , 2021, Evento Online.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2021
.
p. 137-140.
ISSN 2763-8774.
DOI: https://doi.org/10.5753/bresci.2021.15799.