CellHeap: A Workflow for Optimizing COVID-19 Single-Cell RNA-Seq Data Processing in the Santos Dumont Supercomputer

  • Vanessa S. Silva Fiocruz
  • Maiana O. C. Costa LNCC
  • Maria Clicia S. Castro UERJ
  • Helena S. Silva UnB
  • Maria Emilia M. T. Walter UnB
  • Alba C. M. A. Melo UnB
  • Kary A. C. Ocaña LNCC
  • Marcelo T. dos Santos LNCC
  • Marisa F. Nicolas LNCC
  • Anna Cristina C. Carvalho Fiocruz
  • Andrea Henriques-Pons Fiocruz
  • Fabrício A. B. Silva Fiocruz

Resumo


Currently, several hundreds of Terabytes of COVID-19 single-cell RNA-seq (scRNA-seq) data are available in public repositories. This data refers to multiple tissues, comorbidities, and conditions. We expect this trend to continue, and it is realistic to predict amounts of COVID-19 scRNA-seq data increasing to several Petabytes in the coming years. However, thoughtful analysis of this data requires large-scale computing infrastructures, and software systems optimized for such platforms to generate biological knowledge. This paper presents CellHeap, a portable and robust workflow for scRNA-seq customizable analyses, with quality control throughout the execution steps and deployable on supercomputers. Furthermore, we present the deployment of CellHeap in the Santos Dumont supercomputer for analyzing COVID-19 scRNA-seq datasets, and discuss a case study that processed dozens of Terabytes of COVID-19 scRNA-seq raw data.
Palavras-chave: Single-cell RNA-seq, Bioinformatics workflow, COVID-19, High-performance computing

Referências

Aalst, W.M.P.: Flexible workflow management systems: an approach based on generic process models. In: Proceedings of the Database and Expert Systems Applications (DEXA), pp. 186–195 (1999)

Baran, Y., et al.: MetaCell: analysis of single-cell RNA-seq data using K-nn graph partitions. Genome Biol. 20(1), 1–19 (2019)

Bost, P., et al.: Host-viral infection maps reveal signatures of severe COVID-19 patients. Cell 181(7), 1475–1488 (2020)

Clough, E., Barrett, T.: The gene expression omnibus database. In: Mathé, E., Davis, S. (eds.) Statistical Genomics. MMB, vol. 1418, pp. 93–110. Springer, New York (2016). https://doi.org/10.1007/978-1-4939-3578-9_5

Deelman, E., Peterka, T., Altintas, I., et al.: The future of scientific workflows. Int. J. High Perform. Comput. Appl. 32(1), 159–175 (2018)

Fabregat, A., Jupe, S., Matthews, L., Sidiropoulos, K., et al.: The reactome pathway knowledgebase. Nucleic Acids Res. 4(46(D1)), D649–D655 (2018)

Franzén, O., Gan, L.M., Björkegren, J.L.: PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019 (2019)

Hao, Y., et al.: Integrated analysis of multimodal single-cell data. Cell (2021)

Heimberg, G., Bhatnagar, R., El-Samad, H., Thomson, M.: Dimensionality in gene expression data enables the accurate extraction of transcriptional programs from shallow sequencing. Cell Syst. 2(4), 239–250 (2016)

Herring, C.A., Banerjee, A., McKinley, E.T., et al.: Unsupervised trajectory analysis of single-cell RNA-seq and imaging data reveals alternative tuft cell origins in the gut. Cell Syst. 6(1), 37–51 (2018)

Huang, D., Sherman, B., Lempicki, R.: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 (2009)

Hwang, B., Lee, J., Bang, D.: Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp. Mol. Med. 50, 1–14 (2018)

Islam, S., et al.: Highly multiplexed and strand-specific single-cell RNA 5$$^\prime $$ end sequencing. Nat. Protoc. 7(5), 813–828 (2012)

Kanz, C., Aldebert, P., Althorpe, N., et al.: The EMBL nucleotide sequence database. Nucleic Acids Res. 33(Suppl$$\_$$1), D29–D33 (2005)

Kuchina, A., et al.: Microbial single-cell RNA sequencing by split-pool barcoding. Science (2020)

Kuleshov, M.V., et al.: Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 44(W1), W90–W97 (2016)

Liao, M., et al.: Single-cell landscape of bronchoalveolar immune cells in patients with COVID-19. Nat. Med. 26(6), 842–844 (2020)

Liberzon, A., Subramanian, A., Pinchback, R., Thorvaldsdóttir, H., Tamayo, P., Mesirov, J.P.: Molecular signatures database (MSigDB) 3.0. Bioinformatics 27(12), 1739–1740 (2011)

Luecken, M.D., Theis, F.J.: Current best practices in single-cell RNA-seq analysis: a tutorial. Mol. Syst. Biol. 15(e8746), 1–23 (2019)

Ma, F., Salome, P.A., Merchant, S.S., Pellegrini, M.: Single-cell RNA sequencing of batch chlamydomonas cultures reveals heterogeneity in their diurnal cycle phase. Plant Cell 33(4), 1042–1057 (2021)

Macosko, E.Z., et al.: Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161(5), 1202–1214 (2015)

Malone, J., et al.: Modeling sample variables with an experimental factor ontology. Bioinformatics 26(8), 1112–1118 (2010)

Mi, H., Ebert, D., Muruganujan, A., et al.: PANTHER version 16: a revised family classification, tree-based classification tool, enhancer regions and extensive API. Nucleic Acids Res. 49(D1), D394–D403 (2020)

Papatheodorou, I., Moreno, P., Manning, J., Fuentes, et al.: Expression atlas update: from tissues to single cells. Nucleic Acids Res. 48(D1), D77–D83 (2019)

Schulte-Schrepping, J., Reusch, N., Paclik, D., et al.: Severe COVID-19 is marked by a dysregulated myeloid cell compartment. Cell 182(6), 1419–1440 (2020)

Silvin, A., Chapuis, N., Dunsmore, G., et al.: Elevated calprotectin and abnormal myeloid cell subsets discriminate severe from mild COVID-19. Cell 182(6) (2020)

Song, E., Bartley, C.M., Chow, R.D.: Divergent and self-reactive immune responses in the CNS of COVID-19 patients with neurological symptoms. Cell Rep. Med. 2(5) (2021)

Street, K., Risso, D., Fletcher, R., et al.: Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics 19(477), 1–16 (2018)

Stuart, T., et al.: Comprehensive integration of single-cell data. Cell 177(7), 1888–1902 (2019)

SRA Toolkit Development Team: Sra toolkit. http://ncbi.github.io/sra-tools/. Accessed Aug 2021

Vigneron, A., et al.: Single-cell RNA sequencing of trypanosoma brucei from tsetse salivary glands unveils metacyclogenesis and identifies potential transmission blocking antigens. Proc. Natl. Acad. Sci. 117(5), 2613–2621 (2020)

Viteri, J.G.G., Sidiropoulos, K., et al.: ReactomeGSA - efficient multi-omics comparative pathway analysis. Mol. Cell. Proteomics 19(12), 2115–2125 (2020)

Wolf, F.A., Hamey, F.K., Plass, M., et al.: PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 20(59), 1–9 (2019)

Yao, C., Bora, S.A., Parimon, T., et al.: Cell-type-specific immune dysregulation in severely ill COVID-19 patients. Cell Rep. 34(1) (2020)

Zheng, G.X., et al.: Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8(1), 1–12 (2017)
Publicado
22/11/2021
SILVA, Vanessa S. et al. CellHeap: A Workflow for Optimizing COVID-19 Single-Cell RNA-Seq Data Processing in the Santos Dumont Supercomputer. In: SIMPÓSIO BRASILEIRO DE BIOINFORMÁTICA (BSB), 14. , 2021, Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 . p. 41-52. ISSN 2316-1248.