Abstract
Genome sequencing involves splitting a genome into a set reads that are assembled into contigs that are eventually ordered and organized as scaffolds. There are many programs that consider the use of the de Bruijn Graph (dBG) but they must deal with a high computational cost, mainly due to internal RAM consumption. We propose to use an external memory approach to deal with the de Bruijn graph construction focusing on contig generation. Our proposed algorithms are based on well-known I/O efficient methods that identify unitigs and remove errors such as tips and bubbles. Our analytical evaluation shows that it becomes feasible to generate de Bruijn graphs to obtain the needed contigs, independently of the available memory.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anderson, R.J., Miller, G.L.: A simple randomized parallel algorithm for list-ranking. Inf. Process. Lett. 33(5), 269–273 (1990)
Bowe, A., Onodera, T., Sadakane, K., Shibuya, T.: Succinct de Bruijn Graphs. In: Raphael, B., Tang, J. (eds.) WABI 2012. LNCS, vol. 7534, pp. 225–235. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33122-0_18
Chapman, J.A., et al.: Meraculous: de novo genome assembly with short paired-end reads. PLoS One 6(8), e23501 (2011)
Chiang, Y.J., et al.: External-memory graph algorithms. Procs. ACM/SIAM Symp. Discr. Algorithm. (SODA) 95, 139–149 (1995)
Chikhi, R., Limasset, A., Medvedev, P.: Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32(12), i201–i208 (2016)
Chikhi, R., Rizk, G.: Space-efficient and exact de Bruijn graph representation based on a bloom filter. Algorithm. Mol. Biol. 8(1), 22 (2013)
Jackman, S.D., et al.: Abyss 2.0: resource-efficient assembly of large genomes using a bloom filter. Genome Res. 27(5), 768–777 (2017)
Kundeti, V.K., et al.: Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs. BMC Bioinf. 11(1), 560 (2010)
Kyrola, A., Blelloch, G., Guestrin, C.: Graphchi: large-scale graph computation on just a PC. In: USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 31–46 (2012)
Kyrola, A., Shun, J., Blelloch, G.: Beyond synchronous: new techniques for external-memory graph connectivity and minimum spanning forest. In: Gudmundsson, J., Katajainen, J. (eds.) Experimental Algorithms — SEA 2014. Lecture Notes in Computer Science, vol. 8504, pp. 123–137. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07959-2_11
Li, Y., Kamousi, P., Han, F., Yang, S., Yan, X., Suri, S.: Memory efficient minimum substring partitioning. Proc. VLDB Endow. 6(3), 169–180 (2013)
Malewicz, G., et al.: Pregel: a system for large-scale graph processing. In: Process ACM SIGMOD Intl. Conf. on Manage. Data, pp. 135–146 (2010)
McCune, R.R., Weninger, T., Madey, G.: Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. ACM Comput. Surv. (CSUR) 48(2), 25:1–25:39 (2015)
Medvedev, P., Georgiou, K., Myers, G., Brudno, M.: Computability of models for sequence assembly. In: Giancarlo, R., Hannenhalli, S. (eds.) Algorithms in Bioinformatics — WABI 2007. Lecture Notes in Computer Science, pp. 289–301. Springer, Cham (2007). https://doi.org/10.1007/978-3-540-74126-8_27
Meng, J., Seo, S., Balaji, P., Wei, Y., Wang, B., Feng, S.: Swap-assembler 2: optimization of de novo genome assembler at extreme scale. In: Proceedings of the 45th ICPP, pp. 195–204. IEEE (2016)
Miller, J.R., Koren, S., Sutton, G.: Assembly algorithms for next-generation sequencing data. Genomics 95(6), 315–327 (2010)
Salikhov, K., Sacomoto, G., Kucherov, G.: Using cascading bloom filters to improve the memory usage for de Brujin graphs. Algorithm. Mol. Biol. 9(1), 2 (2014)
Simpson, J.T., Pop, M.: The theory and practice of genome sequence assembly. Ann. Rev. Genomics Hum. Genet. 16, 153–172 (2015)
Sohn, J., Nam, J.W.: The present and future of de novo whole-genome assembly. Briefings Bioinf. 19(1), 23–40 (2016)
Stephens, Z.D., et al.: Big data: astronomical or genomical? PLoS Bio. 13(7), e1002195 (2015)
Ye, C., Ma, Z.S., Cannon, C.H., Pop, M., Douglas, W.Y.: Exploiting sparseness in de novo genome assembly. BMC (BioMed Central) Bioinf. 13, S1 (2012)
Zeh, N.: I/o-efficient graph algorithms. In: EEF Summer School on Massive Data Sets (2002)
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Gen. Res. 821–829 (2008)
Acknowledgments
This work is partially supported by grants from CNPq, CAPES and FAPERJ, Brazilian Public Funding Agencies.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Entenza, J.O.P., Haeusler, E.H., Lifschitz, S. (2020). Efficient Out-of-Core Contig Generation. In: Setubal, J.C., Silva, W.M. (eds) Advances in Bioinformatics and Computational Biology. BSB 2020. Lecture Notes in Computer Science(), vol 12558. Springer, Cham. https://doi.org/10.1007/978-3-030-65775-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-65775-8_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-65774-1
Online ISBN: 978-3-030-65775-8
eBook Packages: Computer ScienceComputer Science (R0)