Skip to main content

A Classification of de Bruijn Graph Approaches for De Novo Fragment Assembly

  • Conference paper
  • First Online:
Book cover Advances in Bioinformatics and Computational Biology (BSB 2020)

Abstract

Research in bioinformatics has changed rapidly since the advent of next-generation sequencing (NGS). Despite the positive impact on cost reduction, assembling the generated reads remains a challenge. This paper presents in detail the main ideas related to de novo assembly, the technologies involved, and theoretical concepts about the de Bruijn graph structure. We also explain the existing approaches to minimize the memory requirements for de Bruijn graph construction. Finally, we propose a comparative view of several solutions, including the k-mers codification and the data structures used to represent and persist them.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bankevich, A., et al.: Spades: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol.: J. Comput. Mol. Cell Biol. 19(5), 455–477 (2012). https://doi.org/10.1089/cmb.2012.0021. https://pubmed.ncbi.nlm.nih.gov/22506599

    Article  CAS  Google Scholar 

  2. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970). https://doi.org/10.1145/362686.362692

    Article  Google Scholar 

  3. Boucher, C., Bowe, A., Gagie, T., Puglisi, S., Sadakane, K.: Variable-order de Bruijn graphs. In: Data Compression Conference Proceedings 2015 (2014). https://doi.org/10.1109/DCC.2015.70

  4. Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de Bruijn graphs. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 225–235. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33122-0_18

    Chapter  Google Scholar 

  5. Butler, J.: ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 18(5), 810–820 (2008). https://doi.org/10.1101/gr.7337908

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Chapman, J.A., Ho, I., Sunkara, S., Luo, S., Schroth, G.P., Rokhsar, D.S.: Meraculous: de novo genome assembly with short paired-end reads. PLoS ONE 6(8), e23501 (2011). https://doi.org/10.1371/journal.pone.0023501

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., Medvedev, P.: On the representation of de Bruijn graphs. In: Sharan, R. (ed.) RECOMB 2014. LNCS, vol. 8394, pp. 35–55. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05269-4_4

    Chapter  Google Scholar 

  8. Chikhi, R., Limasset, A., Medvedev, P.: Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32(12), i201 (2016). https://doi.org/10.1093/bioinformatics/btw279

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol. Biol. 8(1), 22 (2013). https://doi.org/10.1186/1748-7188-8-22

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Claros, M.G., Bautista, R., Guerrero-Fernández, D., Benzerki, H., Seoane, P., Fernández-Pozo, N.: Why assembling plant genome sequences is so challenging. Biology 1(2), 439 (2012). https://doi.org/10.3390/biology1020439

    Article  PubMed  PubMed Central  Google Scholar 

  11. Conway, T.C., Bromage, A.J.: Succinct data structures for assembling large genomes. Bioinformatics 27(4), 479–486 (2011). https://doi.org/10.1093/bioinformatics/btq697

    Article  CAS  PubMed  Google Scholar 

  12. de Armas, E.M., Castro, L.C., Holanda, M., Lifschitz, S.: A new approach for de Bruijn graph construction in de novo genome assembling. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine, pp. 1842–1849 (2019)

    Google Scholar 

  13. de Armas, E.M., Ferreira, P.C.G., Haeusler, E.H., de Holanda, M.T., Lifschitz, S.: K-mer mapping and RDBMS indexes. In: Kowada, L., de Oliveira, D. (eds.) BSB 2019. LNCS, vol. 11347, pp. 70–82. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-46417-2_7

    Chapter  Google Scholar 

  14. de Armas, E.M., Haeusler, E.H., Lifschitz, S., de Holanda, M.T., da Silva, W.M.C., Ferreira, P.C.G.: K-mer Mapping and de Bruijn graphs: the case for velvet fragment assembly. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 882–889 (2016). https://doi.org/10.1109/BIBM.2016.7822642

  15. de Armas, E.M., Silva, M.V.M., Lifschitz, S.: A study of index structures for K-mer mapping. In: Proceedings Satellite Events of the 32nd Brazilian Symposium on Databases. Databases Meet Bioinformatics Workshop, pp. 326–333 (2017)

    Google Scholar 

  16. Deorowicz, S., Debudaj-Grabysz, A., Grabowski, S.: Disk-based k-mer counting on a PC. BMC Bioinform. 14(1) (2013). https://doi.org/10.1186/1471-2105-14-160

  17. Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569 (2015). https://doi.org/10.1093/bioinformatics/btv022

    Article  CAS  PubMed  Google Scholar 

  18. El-Metwally, S., Hamza, T., Zakaria, M., Helmy, M.: Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput. Biol. 9(12), 1–19 (2013). https://doi.org/10.1371/journal.pcbi.1003345

    Article  CAS  Google Scholar 

  19. Erbert, M., Rechner, S., Müller-Hannemann, M.: Gerbil: a fast and memory-efficient k-mer counter with GPU-support. Algorithms Mol. Biol. 12(1), 9:1–9:12 (2017). https://doi.org/10.1186/s13015-017-0097-9

    Article  Google Scholar 

  20. Ghosh, P., Kalyanaraman, A.: A fast sketch-based assembler for genomes. In: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. BCB ’16, pp. 241–250. ACM, New York, NY, USA (2016). https://doi.org/10.1145/2975167.2975192

  21. Ghosh, P., Kalyanaraman, A.: FastEtch: a fast sketch-based assembler for genomes. IEEE/ACM Trans. Comput. Biol. Bioinform. 16(4), 1091–1106 (2019). https://doi.org/10.1109/TCBB.2017.2737999

    Article  CAS  PubMed  Google Scholar 

  22. Gnerre, S., et al.: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. U.S.A. 108(4), 1513–1518 (2011). https://doi.org/10.1073/pnas.1017351108. 21187386[pmid]

    Article  CAS  PubMed  Google Scholar 

  23. Jackman, S.D., Birol, I.: Assembling genomes using short-read sequencing technology. Genome Biol. 11(1), 202 (2010). https://doi.org/10.1186/gb-2010-11-1-202. https://www.ncbi.nlm.nih.gov/pubmed/20128932, 20128932[pmid]

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Kelley, D.R., Schatz, M.C., Salzberg, S.L.: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11(11), R116 (2010). https://doi.org/10.1186/gb-2010-11-11-r116

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Kleftogiannis, D., Kalnis, P., Bajic, V.B.: Comparing memory-efficient genome assemblers on stand-alone and cloud infrastructures. PLoS ONE 8(9) (2013). https://doi.org/10.1371/journal.pone.0075505

  26. Kokot, M., Dlugosz, M., Deorowicz, S.: KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33(17), 2759–2761 (2017). https://doi.org/10.1093/bioinformatics/btx304

    Article  CAS  PubMed  Google Scholar 

  27. Li, Y., Yan, X.: MSPKmerCounter: a fast and memory efficient approach for K-mer Counting. arXiv e-prints (2015)

    Google Scholar 

  28. Li, Y., Kamousi, P., Han, F., Yang, S., Yan, X., Suri, S.: Memory efficient minimum substring partitioning. PVLDB 6(3), 169–180 (2013). https://doi.org/10.14778/2535569.2448951

    Article  Google Scholar 

  29. Luo, R., et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1(1), 1–6 (2012). https://doi.org/10.1186/2047-217X-1-18

    Article  Google Scholar 

  30. Mamun, A.A., Pal, S., Rajasekaran, S.: KCMBT: a k-mer counter based on multiple burst trees. Bioinformatics 32(18), 2783 (2016). https://doi.org/10.1093/bioinformatics/btw345

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Marcais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011). https://doi.org/10.1093/bioinformatics/btr011

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. McVicar, N., Lin, C., Hauck, S.: K-Mer counting using bloom filters with an FPGA-attached HMC. In: 25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2017, Napa, CA, USA, 30 April–2 May 2017, pp. 203–210 (2017). https://doi.org/10.1109/FCCM.2017.23

  33. Melsted, P., Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinform. 12(1), 333 (2011). https://doi.org/10.1186/1471-2105-12-333

    Article  CAS  Google Scholar 

  34. Miller, J.R., Koren, S., Sutton, G.: Assembly algorithms for next-generation sequencing data. Genomics 95(6), 315–327 (2010). https://doi.org/10.1016/j.ygeno.2010.03.001. 20211242[pmid]

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Myers, E.W.: Toward simplifying and accurately formulating fragment assembly. J. Comput. Biol. 2(2), 275–290 (1995). https://doi.org/10.1089/cmb.1995.2.275. pMID: 7497129

    Article  CAS  PubMed  Google Scholar 

  36. Niedringhaus, T.P., Milanova, D., Kerby, M.B., Snyder, M.P., Barron, A.E.: Landscape of next-generation sequencing technologies. Anal. Chem. 83(12), 4327–4341 (2011). https://doi.org/10.1021/ac2010857

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Pandey, P., Bender, M.A., Johnson, R., Patro, R.: deBGR: an efficient and near-exact representation of the weighted de Bruijn graph. Bioinformatics 33(14), i133–i141 (2017). https://doi.org/10.1093/bioinformatics/btx261

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Rahman, M.M., Sharker, R., Biswas, S., Rahman, M.: HaVec: an efficient de Bruijn graph construction algorithm for genome assembly. Int. J. Genom. 2017, 1–12 (2017). https://doi.org/10.1155/2017/6120980

    Article  CAS  Google Scholar 

  39. Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics 29(5), 652–653 (2013). https://doi.org/10.1093/bioinformatics/btt020

    Article  CAS  PubMed  Google Scholar 

  40. Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading bloom filters to improve the memory usage for de Brujin graphs. Algorithms Mol. Biol.: AMB 9, 2 (2014). https://doi.org/10.1186/1748-7188-9-2

  41. Sanger, F., Coulson, A., Barrell, B., Smith, A., Roe, B.: Cloning in single-stranded bacteriophage as an aid to rapid DNA sequencing. J. Mol. Biol. 143(2), 161–178 (1980). https://doi.org/10.1016/0022-2836(80)90196-5

    Article  CAS  PubMed  Google Scholar 

  42. Schatz, M.C., Delcher, A.L., Salzberg, S.L.: Assembly of large genomes using second-generation sequencing. Genome Res. 20(9), 1165–1173 (2010). https://doi.org/10.1101/gr.101360.109

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Silva, M.V.M., de Holanda, M.T., Haeusler, E.H., de Armas, E.M., Lifschitz, S.: VelvetH-DB: Persistência de Dados no Processo de Montagem de Fragmentos de Sequências Biológicas. In: Proceedings Satellite Events of the 32nd Brazilian Symposium on Databases. Databases Meet Bioinformatics Workshop, pp. 334–341 (2017)

    Google Scholar 

  44. Simpson, J.T., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics (Oxford, England) 26(12), i367–i373 (2010). https://doi.org/10.1093/bioinformatics/btq217

    Article  CAS  Google Scholar 

  45. Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009). https://doi.org/10.1101/gr.089532.108

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Titus Brown, C., Howe, A., Zhang, Q., Pyrkosz, A.B., Brom, T.H.: A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv e-prints arXiv:1203.4802 (2012)

  47. Ye, C., Sam Ma, Z., Cannon, C., Pop, M., Yu, D.: Exploiting sparseness in de novo genome assembly. BMC Bioinform. 13(Suppl. 6), S1 (2012). https://doi.org/10.1186/1471-2105-13-S6-S1

  48. Zerbino, D.: Velvet software. EMBL-EBI. https://www.ebi.ac.uk/zerbino/velvet/ (2016). Accessed 15 June 2019

  49. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008). https://doi.org/10.1101/gr.074492.107

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Zhang, Q., Pell, J., Canino-Koning, R., Howe, A.C., Brown, C.T.: These are not the K-mers you are looking for: efficient online K-mer Counting Using a Probabilistic Data Structure. PLoS ONE 9(7), 1–13 (2014). https://doi.org/10.1371/journal.pone.0101271

    Article  CAS  Google Scholar 

Download references

Acknowledgments

The authors are partially supported by grants from FUNDECT, FAPERJ, CNPq, CAPES and INCA, Brazilian Public Funding Agencies and Research Institutes.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sérgio Lifschitz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

de Armas, E.M., Holanda, M., de Oliveira, D., Almeida, N.F., Lifschitz, S. (2020). A Classification of de Bruijn Graph Approaches for De Novo Fragment Assembly. In: Setubal, J.C., Silva, W.M. (eds) Advances in Bioinformatics and Computational Biology. BSB 2020. Lecture Notes in Computer Science(), vol 12558. Springer, Cham. https://doi.org/10.1007/978-3-030-65775-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-65775-8_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-65774-1

  • Online ISBN: 978-3-030-65775-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics