Abstract
Research in bioinformatics has changed rapidly since the advent of next-generation sequencing (NGS). Despite the positive impact on cost reduction, assembling the generated reads remains a challenge. This paper presents in detail the main ideas related to de novo assembly, the technologies involved, and theoretical concepts about the de Bruijn graph structure. We also explain the existing approaches to minimize the memory requirements for de Bruijn graph construction. Finally, we propose a comparative view of several solutions, including the k-mers codification and the data structures used to represent and persist them.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bankevich, A., et al.: Spades: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol.: J. Comput. Mol. Cell Biol. 19(5), 455–477 (2012). https://doi.org/10.1089/cmb.2012.0021. https://pubmed.ncbi.nlm.nih.gov/22506599
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970). https://doi.org/10.1145/362686.362692
Boucher, C., Bowe, A., Gagie, T., Puglisi, S., Sadakane, K.: Variable-order de Bruijn graphs. In: Data Compression Conference Proceedings 2015 (2014). https://doi.org/10.1109/DCC.2015.70
Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de Bruijn graphs. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 225–235. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33122-0_18
Butler, J.: ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Res. 18(5), 810–820 (2008). https://doi.org/10.1101/gr.7337908
Chapman, J.A., Ho, I., Sunkara, S., Luo, S., Schroth, G.P., Rokhsar, D.S.: Meraculous: de novo genome assembly with short paired-end reads. PLoS ONE 6(8), e23501 (2011). https://doi.org/10.1371/journal.pone.0023501
Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., Medvedev, P.: On the representation of de Bruijn graphs. In: Sharan, R. (ed.) RECOMB 2014. LNCS, vol. 8394, pp. 35–55. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05269-4_4
Chikhi, R., Limasset, A., Medvedev, P.: Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32(12), i201 (2016). https://doi.org/10.1093/bioinformatics/btw279
Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol. Biol. 8(1), 22 (2013). https://doi.org/10.1186/1748-7188-8-22
Claros, M.G., Bautista, R., Guerrero-Fernández, D., Benzerki, H., Seoane, P., Fernández-Pozo, N.: Why assembling plant genome sequences is so challenging. Biology 1(2), 439 (2012). https://doi.org/10.3390/biology1020439
Conway, T.C., Bromage, A.J.: Succinct data structures for assembling large genomes. Bioinformatics 27(4), 479–486 (2011). https://doi.org/10.1093/bioinformatics/btq697
de Armas, E.M., Castro, L.C., Holanda, M., Lifschitz, S.: A new approach for de Bruijn graph construction in de novo genome assembling. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine, pp. 1842–1849 (2019)
de Armas, E.M., Ferreira, P.C.G., Haeusler, E.H., de Holanda, M.T., Lifschitz, S.: K-mer mapping and RDBMS indexes. In: Kowada, L., de Oliveira, D. (eds.) BSB 2019. LNCS, vol. 11347, pp. 70–82. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-46417-2_7
de Armas, E.M., Haeusler, E.H., Lifschitz, S., de Holanda, M.T., da Silva, W.M.C., Ferreira, P.C.G.: K-mer Mapping and de Bruijn graphs: the case for velvet fragment assembly. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 882–889 (2016). https://doi.org/10.1109/BIBM.2016.7822642
de Armas, E.M., Silva, M.V.M., Lifschitz, S.: A study of index structures for K-mer mapping. In: Proceedings Satellite Events of the 32nd Brazilian Symposium on Databases. Databases Meet Bioinformatics Workshop, pp. 326–333 (2017)
Deorowicz, S., Debudaj-Grabysz, A., Grabowski, S.: Disk-based k-mer counting on a PC. BMC Bioinform. 14(1) (2013). https://doi.org/10.1186/1471-2105-14-160
Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569 (2015). https://doi.org/10.1093/bioinformatics/btv022
El-Metwally, S., Hamza, T., Zakaria, M., Helmy, M.: Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput. Biol. 9(12), 1–19 (2013). https://doi.org/10.1371/journal.pcbi.1003345
Erbert, M., Rechner, S., Müller-Hannemann, M.: Gerbil: a fast and memory-efficient k-mer counter with GPU-support. Algorithms Mol. Biol. 12(1), 9:1–9:12 (2017). https://doi.org/10.1186/s13015-017-0097-9
Ghosh, P., Kalyanaraman, A.: A fast sketch-based assembler for genomes. In: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. BCB ’16, pp. 241–250. ACM, New York, NY, USA (2016). https://doi.org/10.1145/2975167.2975192
Ghosh, P., Kalyanaraman, A.: FastEtch: a fast sketch-based assembler for genomes. IEEE/ACM Trans. Comput. Biol. Bioinform. 16(4), 1091–1106 (2019). https://doi.org/10.1109/TCBB.2017.2737999
Gnerre, S., et al.: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl. Acad. Sci. U.S.A. 108(4), 1513–1518 (2011). https://doi.org/10.1073/pnas.1017351108. 21187386[pmid]
Jackman, S.D., Birol, I.: Assembling genomes using short-read sequencing technology. Genome Biol. 11(1), 202 (2010). https://doi.org/10.1186/gb-2010-11-1-202. https://www.ncbi.nlm.nih.gov/pubmed/20128932, 20128932[pmid]
Kelley, D.R., Schatz, M.C., Salzberg, S.L.: Quake: quality-aware detection and correction of sequencing errors. Genome Biol. 11(11), R116 (2010). https://doi.org/10.1186/gb-2010-11-11-r116
Kleftogiannis, D., Kalnis, P., Bajic, V.B.: Comparing memory-efficient genome assemblers on stand-alone and cloud infrastructures. PLoS ONE 8(9) (2013). https://doi.org/10.1371/journal.pone.0075505
Kokot, M., Dlugosz, M., Deorowicz, S.: KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33(17), 2759–2761 (2017). https://doi.org/10.1093/bioinformatics/btx304
Li, Y., Yan, X.: MSPKmerCounter: a fast and memory efficient approach for K-mer Counting. arXiv e-prints (2015)
Li, Y., Kamousi, P., Han, F., Yang, S., Yan, X., Suri, S.: Memory efficient minimum substring partitioning. PVLDB 6(3), 169–180 (2013). https://doi.org/10.14778/2535569.2448951
Luo, R., et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1(1), 1–6 (2012). https://doi.org/10.1186/2047-217X-1-18
Mamun, A.A., Pal, S., Rajasekaran, S.: KCMBT: a k-mer counter based on multiple burst trees. Bioinformatics 32(18), 2783 (2016). https://doi.org/10.1093/bioinformatics/btw345
Marcais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011). https://doi.org/10.1093/bioinformatics/btr011
McVicar, N., Lin, C., Hauck, S.: K-Mer counting using bloom filters with an FPGA-attached HMC. In: 25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2017, Napa, CA, USA, 30 April–2 May 2017, pp. 203–210 (2017). https://doi.org/10.1109/FCCM.2017.23
Melsted, P., Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinform. 12(1), 333 (2011). https://doi.org/10.1186/1471-2105-12-333
Miller, J.R., Koren, S., Sutton, G.: Assembly algorithms for next-generation sequencing data. Genomics 95(6), 315–327 (2010). https://doi.org/10.1016/j.ygeno.2010.03.001. 20211242[pmid]
Myers, E.W.: Toward simplifying and accurately formulating fragment assembly. J. Comput. Biol. 2(2), 275–290 (1995). https://doi.org/10.1089/cmb.1995.2.275. pMID: 7497129
Niedringhaus, T.P., Milanova, D., Kerby, M.B., Snyder, M.P., Barron, A.E.: Landscape of next-generation sequencing technologies. Anal. Chem. 83(12), 4327–4341 (2011). https://doi.org/10.1021/ac2010857
Pandey, P., Bender, M.A., Johnson, R., Patro, R.: deBGR: an efficient and near-exact representation of the weighted de Bruijn graph. Bioinformatics 33(14), i133–i141 (2017). https://doi.org/10.1093/bioinformatics/btx261
Rahman, M.M., Sharker, R., Biswas, S., Rahman, M.: HaVec: an efficient de Bruijn graph construction algorithm for genome assembly. Int. J. Genom. 2017, 1–12 (2017). https://doi.org/10.1155/2017/6120980
Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics 29(5), 652–653 (2013). https://doi.org/10.1093/bioinformatics/btt020
Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading bloom filters to improve the memory usage for de Brujin graphs. Algorithms Mol. Biol.: AMB 9, 2 (2014). https://doi.org/10.1186/1748-7188-9-2
Sanger, F., Coulson, A., Barrell, B., Smith, A., Roe, B.: Cloning in single-stranded bacteriophage as an aid to rapid DNA sequencing. J. Mol. Biol. 143(2), 161–178 (1980). https://doi.org/10.1016/0022-2836(80)90196-5
Schatz, M.C., Delcher, A.L., Salzberg, S.L.: Assembly of large genomes using second-generation sequencing. Genome Res. 20(9), 1165–1173 (2010). https://doi.org/10.1101/gr.101360.109
Silva, M.V.M., de Holanda, M.T., Haeusler, E.H., de Armas, E.M., Lifschitz, S.: VelvetH-DB: Persistência de Dados no Processo de Montagem de Fragmentos de Sequências Biológicas. In: Proceedings Satellite Events of the 32nd Brazilian Symposium on Databases. Databases Meet Bioinformatics Workshop, pp. 334–341 (2017)
Simpson, J.T., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics (Oxford, England) 26(12), i367–i373 (2010). https://doi.org/10.1093/bioinformatics/btq217
Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009). https://doi.org/10.1101/gr.089532.108
Titus Brown, C., Howe, A., Zhang, Q., Pyrkosz, A.B., Brom, T.H.: A reference-free algorithm for computational normalization of shotgun sequencing data. arXiv e-prints arXiv:1203.4802 (2012)
Ye, C., Sam Ma, Z., Cannon, C., Pop, M., Yu, D.: Exploiting sparseness in de novo genome assembly. BMC Bioinform. 13(Suppl. 6), S1 (2012). https://doi.org/10.1186/1471-2105-13-S6-S1
Zerbino, D.: Velvet software. EMBL-EBI. https://www.ebi.ac.uk/zerbino/velvet/ (2016). Accessed 15 June 2019
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18(5), 821–829 (2008). https://doi.org/10.1101/gr.074492.107
Zhang, Q., Pell, J., Canino-Koning, R., Howe, A.C., Brown, C.T.: These are not the K-mers you are looking for: efficient online K-mer Counting Using a Probabilistic Data Structure. PLoS ONE 9(7), 1–13 (2014). https://doi.org/10.1371/journal.pone.0101271
Acknowledgments
The authors are partially supported by grants from FUNDECT, FAPERJ, CNPq, CAPES and INCA, Brazilian Public Funding Agencies and Research Institutes.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
de Armas, E.M., Holanda, M., de Oliveira, D., Almeida, N.F., Lifschitz, S. (2020). A Classification of de Bruijn Graph Approaches for De Novo Fragment Assembly. In: Setubal, J.C., Silva, W.M. (eds) Advances in Bioinformatics and Computational Biology. BSB 2020. Lecture Notes in Computer Science(), vol 12558. Springer, Cham. https://doi.org/10.1007/978-3-030-65775-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-65775-8_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-65774-1
Online ISBN: 978-3-030-65775-8
eBook Packages: Computer ScienceComputer Science (R0)