A Classification of de Bruijn Graph Approaches for De Novo Fragment Assembly

de Armas, Elvismary Molina; Holanda, Maristela; de Oliveira, Daniel; Almeida, Nalvo F.; Lifschitz, Sérgio

doi:10.1007/978-3-030-65775-8_1

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 12558))

Included in the following conference series:

Brazilian Symposium on Bioinformatics

513 Accesses

Abstract

Research in bioinformatics has changed rapidly since the advent of next-generation sequencing (NGS). Despite the positive impact on cost reduction, assembling the generated reads remains a challenge. This paper presents in detail the main ideas related to de novo assembly, the technologies involved, and theoretical concepts about the de Bruijn graph structure. We also explain the existing approaches to minimize the memory requirements for de Bruijn graph construction. Finally, we propose a comparative view of several solutions, including the k-mers codification and the data structures used to represent and persist them.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bankevich, A., et al.: Spades: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol.: J. Comput. Mol. Cell Biol. 19(5), 455–477 (2012). https://doi.org/10.1089/cmb.2012.0021. https://pubmed.ncbi.nlm.nih.gov/22506599
Article CAS Google Scholar
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970). https://doi.org/10.1145/362686.362692
Article Google Scholar
Boucher, C., Bowe, A., Gagie, T., Puglisi, S., Sadakane, K.: Variable-order de Bruijn graphs. In: Data Compression Conference Proceedings 2015 (2014). https://doi.org/10.1109/DCC.2015.70
Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de Bruijn graphs. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 225–235. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33122-0_18
Chapter Google Scholar
Butler, J.: ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 18(5), 810–820 (2008). https://doi.org/10.1101/gr.7337908
Article CAS PubMed PubMed Central Google Scholar
Chapman, J.A., Ho, I., Sunkara, S., Luo, S., Schroth, G.P., Rokhsar, D.S.: Meraculous: de novo genome assembly with short paired-end reads. PLoS ONE 6(8), e23501 (2011). https://doi.org/10.1371/journal.pone.0023501
Article CAS PubMed PubMed Central Google Scholar
Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., Medvedev, P.: On the representation of de Bruijn graphs. In: Sharan, R. (ed.) RECOMB 2014. LNCS, vol. 8394, pp. 35–55. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05269-4_4
Chapter Google Scholar
Chikhi, R., Limasset, A., Medvedev, P.: Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32(12), i201 (2016). https://doi.org/10.1093/bioinformatics/btw279
Article CAS PubMed PubMed Central Google Scholar
Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol. Biol. 8(1), 22 (2013). https://doi.org/10.1186/1748-7188-8-22
Article CAS PubMed PubMed Central Google Scholar
Claros, M.G., Bautista, R., Guerrero-Fernández, D., Benzerki, H., Seoane, P., Fernández-Pozo, N.: Why assembling plant genome sequences is so challenging. Biology 1(2), 439 (2012). https://doi.org/10.3390/biology1020439
Article PubMed PubMed Central Google Scholar
Conway, T.C., Bromage, A.J.: Succinct data structures for assembling large genomes. Bioinformatics 27(4), 479–486 (2011). https://doi.org/10.1093/bioinformatics/btq697
Article CAS PubMed Google Scholar
de Armas, E.M., Castro, L.C., Holanda, M., Lifschitz, S.: A new approach for de Bruijn graph construction in de novo genome assembling. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine, pp. 1842–1849 (2019)
Google Scholar
de Armas, E.M., Ferreira, P.C.G., Haeusler, E.H., de Holanda, M.T., Lifschitz, S.: K-mer mapping and RDBMS indexes. In: Kowada, L., de Oliveira, D. (eds.) BSB 2019. LNCS, vol. 11347, pp. 70–82. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-46417-2_7
Chapter Google Scholar
de Armas, E.M., Haeusler, E.H., Lifschitz, S., de Holanda, M.T., da Silva, W.M.C., Ferreira, P.C.G.: K-mer Mapping and de Bruijn graphs: the case for velvet fragment assembly. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 882–889 (2016). https://doi.org/10.1109/BIBM.2016.7822642
de Armas, E.M., Silva, M.V.M., Lifschitz, S.: A study of index structures for K-mer mapping. In: Proceedings Satellite Events of the 32nd Brazilian Symposium on Databases. Databases Meet Bioinformatics Workshop, pp. 326–333 (2017)
Google Scholar
Deorowicz, S., Debudaj-Grabysz, A., Grabowski, S.: Disk-based k-mer counting on a PC. BMC Bioinform. 14(1) (2013). https://doi.org/10.1186/1471-2105-14-160
Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569 (2015). https://doi.org/10.1093/bioinformatics/btv022
Article CAS PubMed Google Scholar
El-Metwally, S., Hamza, T., Zakaria, M., Helmy, M.: Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput. Biol. 9(12), 1–19 (2013). https://doi.org/10.1371/journal.pcbi.1003345
Article CAS Google Scholar
Erbert, M., Rechner, S., Müller-Hannemann, M.: Gerbil: a fast and memory-efficient k-mer counter with GPU-support. Algorithms Mol. Biol. 12(1), 9:1–9:12 (2017). https://doi.org/10.1186/s13015-017-0097-9
Article Google Scholar
Ghosh, P., Kalyanaraman, A.: A fast sketch-based assembler for genomes. In: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. BCB ’16, pp. 241–250. ACM, New York, NY, USA (2016). https://doi.org/10.1145/2975167.2975192
Ghosh, P., Kalyanaraman, A.: FastEtch: a fast sketch-based assembler for genomes. IEEE/ACM Trans. Comput. Biol. Bioinform. 16(4), 1091–1106 (2019). https://doi.org/10.1109/TCBB.2017.2737999
Article CAS PubMed Google Scholar
Gnerre, S., et al.: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. U.S.A. 108(4), 1513–1518 (2011). https://doi.org/10.1073/pnas.1017351108. 21187386[pmid]
Article CAS PubMed Google Scholar
Jackman, S.D., Birol, I.: Assembling genomes using short-read sequencing technology. Genome Biol. 11(1), 202 (2010). https://doi.org/10.1186/gb-2010-11-1-202. https://www.ncbi.nlm.nih.gov/pubmed/20128932, 20128932[pmid]
Article CAS PubMed PubMed Central Google Scholar
Kelley, D.R., Schatz, M.C., Salzberg, S.L.: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11(11), R116 (2010). https://doi.org/10.1186/gb-2010-11-11-r116
Article CAS PubMed PubMed Central Google Scholar
Kleftogiannis, D., Kalnis, P., Bajic, V.B.: Comparing memory-efficient genome assemblers on stand-alone and cloud infrastructures. PLoS ONE 8(9) (2013). https://doi.org/10.1371/journal.pone.0075505
Kokot, M., Dlugosz, M., Deorowicz, S.: KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33(17), 2759–2761 (2017). https://doi.org/10.1093/bioinformatics/btx304
Article CAS PubMed Google Scholar
Li, Y., Yan, X.: MSPKmerCounter: a fast and memory efficient approach for K-mer Counting. arXiv e-prints (2015)
Google Scholar
Li, Y., Kamousi, P., Han, F., Yang, S., Yan, X., Suri, S.: Memory efficient minimum substring partitioning. PVLDB 6(3), 169–180 (2013). https://doi.org/10.14778/2535569.2448951
Article Google Scholar
Luo, R., et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1(1), 1–6 (2012). https://doi.org/10.1186/2047-217X-1-18
Article Google Scholar
Mamun, A.A., Pal, S., Rajasekaran, S.: KCMBT: a k-mer counter based on multiple burst trees. Bioinformatics 32(18), 2783 (2016). https://doi.org/10.1093/bioinformatics/btw345
Article CAS PubMed PubMed Central Google Scholar
Marcais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011). https://doi.org/10.1093/bioinformatics/btr011
Article CAS PubMed PubMed Central Google Scholar
McVicar, N., Lin, C., Hauck, S.: K-Mer counting using bloom filters with an FPGA-attached HMC. In: 25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2017, Napa, CA, USA, 30 April–2 May 2017, pp. 203–210 (2017). https://doi.org/10.1109/FCCM.2017.23
Melsted, P., Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinform. 12(1), 333 (2011). https://doi.org/10.1186/1471-2105-12-333
Article CAS Google Scholar
Miller, J.R., Koren, S., Sutton, G.: Assembly algorithms for next-generation sequencing data. Genomics 95(6), 315–327 (2010). https://doi.org/10.1016/j.ygeno.2010.03.001. 20211242[pmid]
Article CAS PubMed PubMed Central Google Scholar
Myers, E.W.: Toward simplifying and accurately formulating fragment assembly. J. Comput. Biol. 2(2), 275–290 (1995). https://doi.org/10.1089/cmb.1995.2.275. pMID: 7497129
Article CAS PubMed Google Scholar
Niedringhaus, T.P., Milanova, D., Kerby, M.B., Snyder, M.P., Barron, A.E.: Landscape of next-generation sequencing technologies. Anal. Chem. 83(12), 4327–4341 (2011). https://doi.org/10.1021/ac2010857
Article CAS PubMed PubMed Central Google Scholar
Pandey, P., Bender, M.A., Johnson, R., Patro, R.: deBGR: an efficient and near-exact representation of the weighted de Bruijn graph. Bioinformatics 33(14), i133–i141 (2017). https://doi.org/10.1093/bioinformatics/btx261
Article CAS PubMed PubMed Central Google Scholar
Rahman, M.M., Sharker, R., Biswas, S., Rahman, M.: HaVec: an efficient de Bruijn graph construction algorithm for genome assembly. Int. J. Genom. 2017, 1–12 (2017). https://doi.org/10.1155/2017/6120980
Article CAS Google Scholar
Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics 29(5), 652–653 (2013). https://doi.org/10.1093/bioinformatics/btt020
Article CAS PubMed Google Scholar
Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading bloom filters to improve the memory usage for de Brujin graphs. Algorithms Mol. Biol.: AMB 9, 2 (2014). https://doi.org/10.1186/1748-7188-9-2
Sanger, F., Coulson, A., Barrell, B., Smith, A., Roe, B.: Cloning in single-stranded bacteriophage as an aid to rapid DNA sequencing. J. Mol. Biol. 143(2), 161–178 (1980). https://doi.org/10.1016/0022-2836(80)90196-5
Article CAS PubMed Google Scholar
Schatz, M.C., Delcher, A.L., Salzberg, S.L.: Assembly of large genomes using second-generation sequencing. Genome Res. 20(9), 1165–1173 (2010). https://doi.org/10.1101/gr.101360.109
Article CAS PubMed PubMed Central Google Scholar
Silva, M.V.M., de Holanda, M.T., Haeusler, E.H., de Armas, E.M., Lifschitz, S.: VelvetH-DB: Persistência de Dados no Processo de Montagem de Fragmentos de Sequências Biológicas. In: Proceedings Satellite Events of the 32nd Brazilian Symposium on Databases. Databases Meet Bioinformatics Workshop, pp. 334–341 (2017)
Google Scholar
Simpson, J.T., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics (Oxford, England) 26(12), i367–i373 (2010). https://doi.org/10.1093/bioinformatics/btq217
Article CAS Google Scholar
Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009). https://doi.org/10.1101/gr.089532.108
Article CAS PubMed PubMed Central Google Scholar
Titus Brown, C., Howe, A., Zhang, Q., Pyrkosz, A.B., Brom, T.H.: A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv e-prints arXiv:1203.4802 (2012)
Ye, C., Sam Ma, Z., Cannon, C., Pop, M., Yu, D.: Exploiting sparseness in de novo genome assembly. BMC Bioinform. 13(Suppl. 6), S1 (2012). https://doi.org/10.1186/1471-2105-13-S6-S1
Zerbino, D.: Velvet software. EMBL-EBI. https://www.ebi.ac.uk/zerbino/velvet/ (2016). Accessed 15 June 2019
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008). https://doi.org/10.1101/gr.074492.107
Article CAS PubMed PubMed Central Google Scholar
Zhang, Q., Pell, J., Canino-Koning, R., Howe, A.C., Brown, C.T.: These are not the K-mers you are looking for: efficient online K-mer Counting Using a Probabilistic Data Structure. PLoS ONE 9(7), 1–13 (2014). https://doi.org/10.1371/journal.pone.0101271
Article CAS Google Scholar

Download references

Acknowledgments

The authors are partially supported by grants from FUNDECT, FAPERJ, CNPq, CAPES and INCA, Brazilian Public Funding Agencies and Research Institutes.

Author information

Authors and Affiliations

Depto. Informática, PUC-Rio, Rio de Janeiro, Brazil
Elvismary Molina de Armas & Sérgio Lifschitz
Depto. Ciência da Computação, UNB, Brasília, Brazil
Maristela Holanda
Instituto de Computação, UFF, Niterói, Brazil
Daniel de Oliveira
Faculdade de Computação, UFMS, Campo Grande, MS, Brazil
Nalvo F. Almeida

Authors

Elvismary Molina de Armas
View author publications
You can also search for this author in PubMed Google Scholar
Maristela Holanda
View author publications
You can also search for this author in PubMed Google Scholar
Daniel de Oliveira
View author publications
You can also search for this author in PubMed Google Scholar
Nalvo F. Almeida
View author publications
You can also search for this author in PubMed Google Scholar
Sérgio Lifschitz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sérgio Lifschitz .

Editor information

Editors and Affiliations

University of São Paulo, São Paulo, Brazil
João C. Setubal
Instituto Federal de Goiás, Formosa, Goiás, Brazil
Waldeyr Mendes Silva

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de Armas, E.M., Holanda, M., de Oliveira, D., Almeida, N.F., Lifschitz, S. (2020). A Classification of de Bruijn Graph Approaches for De Novo Fragment Assembly. In: Setubal, J.C., Silva, W.M. (eds) Advances in Bioinformatics and Computational Biology. BSB 2020. Lecture Notes in Computer Science(), vol 12558. Springer, Cham. https://doi.org/10.1007/978-3-030-65775-8_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-65775-8_1
Published: 20 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-65774-1
Online ISBN: 978-3-030-65775-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics