Skip to main content

Efficient Out-of-Core Contig Generation

  • Conference paper
  • First Online:
Book cover Advances in Bioinformatics and Computational Biology (BSB 2020)

Abstract

Genome sequencing involves splitting a genome into a set reads that are assembled into contigs that are eventually ordered and organized as scaffolds. There are many programs that consider the use of the de Bruijn Graph (dBG) but they must deal with a high computational cost, mainly due to internal RAM consumption. We propose to use an external memory approach to deal with the de Bruijn graph construction focusing on contig generation. Our proposed algorithms are based on well-known I/O efficient methods that identify unitigs and remove errors such as tips and bubbles. Our analytical evaluation shows that it becomes feasible to generate de Bruijn graphs to obtain the needed contigs, independently of the available memory.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Anderson, R.J., Miller, G.L.: A simple randomized parallel algorithm for list-ranking. Inf. Process. Lett. 33(5), 269–273 (1990)

    Article  Google Scholar 

  2. Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de Bruijn Graphs. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 225–235. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33122-0_18

    Chapter  Google Scholar 

  3. Chapman, J.A., et al.: Meraculous: de novo genome assembly with short paired-end reads. PLoS One 6(8), e23501 (2011)

    Article  CAS  Google Scholar 

  4. Chiang, Y.J., et al.: External-memory graph algorithms. Procs. ACM/SIAM Symp. Discr. Algorithm. (SODA) 95, 139–149 (1995)

    Google Scholar 

  5. Chikhi, R., Limasset, A., Medvedev, P.: Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32(12), i201–i208 (2016)

    Article  CAS  Google Scholar 

  6. Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a bloom filter. Algorithm. Mol. Biol. 8(1), 22 (2013)

    Article  Google Scholar 

  7. Jackman, S.D., et al.: Abyss 2.0: resource-efficient assembly of large genomes using a bloom filter. Genome Res. 27(5), 768–777 (2017)

    Article  CAS  Google Scholar 

  8. Kundeti, V.K., et al.: Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs. BMC Bioinf. 11(1), 560 (2010)

    Article  Google Scholar 

  9. Kyrola, A., Blelloch, G., Guestrin, C.: Graphchi: large-scale graph computation on just a PC. In: USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 31–46 (2012)

    Google Scholar 

  10. Kyrola, A., Shun, J., Blelloch, G.: Beyond synchronous: new techniques for external-memory graph connectivity and minimum spanning forest. In: Gudmundsson, J., Katajainen, J. (eds.) Experimental Algorithms — SEA 2014. Lecture Notes in Computer Science, vol. 8504, pp. 123–137. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07959-2_11

    Chapter  Google Scholar 

  11. Li, Y., Kamousi, P., Han, F., Yang, S., Yan, X., Suri, S.: Memory efficient minimum substring partitioning. Proc. VLDB Endow. 6(3), 169–180 (2013)

    Article  Google Scholar 

  12. Malewicz, G., et al.: Pregel: a system for large-scale graph processing. In: Process ACM SIGMOD Intl. Conf. on Manage. Data, pp. 135–146 (2010)

    Google Scholar 

  13. McCune, R.R., Weninger, T., Madey, G.: Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Comput. Surv. (CSUR) 48(2), 25:1–25:39 (2015)

    Article  Google Scholar 

  14. Medvedev, P., Georgiou, K., Myers, G., Brudno, M.: Computability of models for sequence assembly. In: Giancarlo, R., Hannenhalli, S. (eds.) Algorithms in Bioinformatics — WABI 2007. Lecture Notes in Computer Science, pp. 289–301. Springer, Cham (2007). https://doi.org/10.1007/978-3-540-74126-8_27

    Chapter  Google Scholar 

  15. Meng, J., Seo, S., Balaji, P., Wei, Y., Wang, B., Feng, S.: Swap-assembler 2: optimization of de novo genome assembler at extreme scale. In: Proceedings of the 45th ICPP, pp. 195–204. IEEE (2016)

    Google Scholar 

  16. Miller, J.R., Koren, S., Sutton, G.: Assembly algorithms for next-generation sequencing data. Genomics 95(6), 315–327 (2010)

    Article  CAS  Google Scholar 

  17. Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading bloom filters to improve the memory usage for de Brujin graphs. Algorithm. Mol. Biol. 9(1), 2 (2014)

    Article  Google Scholar 

  18. Simpson, J.T., Pop, M.: The theory and practice of genome sequence assembly. Ann. Rev. Genomics Hum. Genet. 16, 153–172 (2015)

    Article  CAS  Google Scholar 

  19. Sohn, J., Nam, J.W.: The present and future of de novo whole-genome assembly. Briefings Bioinf. 19(1), 23–40 (2016)

    Google Scholar 

  20. Stephens, Z.D., et al.: Big data: astronomical or genomical? PLoS Bio. 13(7), e1002195 (2015)

    Article  Google Scholar 

  21. Ye, C., Ma, Z.S., Cannon, C.H., Pop, M., Douglas, W.Y.: Exploiting sparseness in de novo genome assembly. BMC (BioMed Central) Bioinf. 13, S1 (2012)

    Article  Google Scholar 

  22. Zeh, N.: I/o-efficient graph algorithms. In: EEF Summer School on Massive Data Sets (2002)

    Google Scholar 

  23. Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Gen. Res. 821–829 (2008)

    Google Scholar 

Download references

Acknowledgments

This work is partially supported by grants from CNPq, CAPES and FAPERJ, Brazilian Public Funding Agencies.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sérgio Lifschitz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Entenza, J.O.P., Haeusler, E.H., Lifschitz, S. (2020). Efficient Out-of-Core Contig Generation. In: Setubal, J.C., Silva, W.M. (eds) Advances in Bioinformatics and Computational Biology. BSB 2020. Lecture Notes in Computer Science(), vol 12558. Springer, Cham. https://doi.org/10.1007/978-3-030-65775-8_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-65775-8_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-65774-1

  • Online ISBN: 978-3-030-65775-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics