A computational pipeline for species- and strain-level classification of metagenomic sequences

Arthur Henrique Barrios Solano; João Carlos Setubal

doi:10.5753/bsb.2024.245597

Arthur Henrique Barrios Solano USP http://orcid.org/0000-0001-5252-6491
João Carlos Setubal USP https://orcid.org/0000-0001-9174-2816

DOI: https://doi.org/10.5753/bsb.2024.245597

Resumo

We present a pipeline for exploring genomic diversity in metagenomic datasets at the species and strain levels. To achieve accurate classifications independent of taxonomy labels, we introduce the concept of Genome Reference Set (GRS), modeled using the Maximal Independent Set problem for undirected graphs. For a given user-defined target genus, we build its GRS from GenBank genomes and use it for metagenomic contig classification using BLASTn. Additional phylogenetic processing allows the identification of putative novel species. We show that our pipeline can achieve better results than general-purpose tools, and apply the pipeline to the MetaSUB dataset, identifying two putative novel strains and one putative new species of Acinetobacter.

Referências

Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997). Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic acids research, 25(17):3389–3402.

Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P., and Parks, D. H. (2022). Gtdb-tk v2: memory friendly classification with the genome taxonomy database. Bioinformatics, 38(23):5315–5316.

Chklovski, A., Parks, D. H., Woodcroft, B. J., and Tyson, G. W. (2023). Checkm2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning. Nature Methods, 20(8):1203–1212.

Contreras-Moreira, B. and Vinuesa, P. (2013). Get homologues, a versatile software package for scalable and robust microbial pangenome analysis. Applied and environmental microbiology, 79(24):7696–7701.

Danko, D., Bezdan, D., Afshin, E. E., Ahsanuddin, S., Bhattacharya, C., Butler, D. J., Chng, K. R., Donnellan, D., Hecht, J., Jackson, K., et al. (2021). A global metagenomic map of urban microbiomes and antimicrobial resistance. Cell, 184(13):3376–3393.

Garey, M. R. and Johnson, D. S. (1979). Computers and intractability, volume 174. freeman San Francisco.

Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T., and Aluru, S. (2018). High throughput ani analysis of 90k prokaryotic genomes reveals clear species boundaries. Nature communications, 9(1):5114.

Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K., Von Haeseler, A., and Jermiin, L. S. (2017). Modelfinder: fast model selection for accurate phylogenetic estimates. Nature methods, 14(6):587–589.

Kang, D. D., Li, F., Kirton, E., Thomas, A., Egan, R., An, H., and Wang, Z. (2019). Metabat 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ, 7:e7359.

Katoh, K., Misawa, K., Kuma, K.-i., and Miyata, T. (2002). Mafft: a novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic acids research, 30(14):3059–3066.

Li, L., Stoeckert, C. J., and Roos, D. S. (2003). Orthomcl: identification of ortholog groups for eukaryotic genomes. Genome research, 13(9):2178–2189.

Luby, M. (1985). A simple parallel algorithm for the maximal independent set problem. In Proceedings of the seventeenth annual ACM symposium on Theory of computing, pages 1–10.

Meyer, F., Fritz, A., Deng, Z.-L., Koslicki, D., Lesker, T. R., Gurevich, A., Robertson, G., Alser, M., Antipov, D., Beghini, F., et al. (2022). Critical assessment of metagenome interpretation: the second round of challenges. Nature methods, 19(4):429–440.

Minh, B. Q., Schmidt, H. A., Chernomor, O., Schrempf, D., Woodhams, M. D., von Haeseler, A., and Lanfear, R. (2020). IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Molecular Biology and Evolution, 37(5):1530–1534.

Nayfach, S., Roux, S., Seshadri, R., Udwary, D., Varghese, N., Schulz, F., Wu, D., Paez-Espino, D., Chen, I.-M., Huntemann, M., et al. (2021). A genomic catalog of earth’s microbiomes. Nature biotechnology, 39(4):499–509.

Steen, A. D., Crits-Christoph, A., Carini, P., DeAngelis, K. M., Fierer, N., Lloyd, K. G., and Thrash, J. C. (2019). High proportions of bacteria and archaea across most biomes remain uncultured. The ISME journal, 13(12):3126–3130.

Steinegger, M. and Söding, J. (2017). Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature biotechnology, 35(11):1026–1028.

Tatusova, T., DiCuccio, M., Badretdin, A., Chetvernin, V., Nawrocki, E. P., Zaslavsky, L., Lomsadze, A., Pruitt, K. D., Borodovsky, M., and Ostell, J. (2016). Ncbi prokaryotic genome annotation pipeline. Nucleic acids research, 44(14):6614–6624.

Wood, D. E., Lu, J., and Langmead, B. (2019). Improved metagenomic analysis with kraken 2. Genome biology, 20:1–13.