Abstract
De novo genome assembly of sequenced reads is a fundamental problem in bioinformatics. When there is no reference genome sequence to guide the process, many assemblers programs consider using the de Bruijn Graph data structure to improve performances. However, the construction of such a graph has a high computational cost, mainly due to internal RAM consumption in the presence of very large and repeated read datasets. Building a de Bruijn Graph relies on a broad set of k-mers. Some existing approaches use external memory processing to make it feasible. This work proposes an approach for constructing the de Bruijn graph that does not generate all k-mers during the execution. An external memory processing allows reducing the high number of duplicate k-mers and, consequently, reduces the total number of k-mers that incur on the number of I/O operations. Some practical experiments are presented, showing the solution’s viability and its improvements over other common assemblers in the literature. Our solution reduces the computational requirements and enables execution feasibility.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aggarwal, A., Vitter, J.S.: The input/output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (1988). https://doi.org/10.1145/48529.48535. https://doi.acm.org/10.1145/48529.48535
de Armas, E.M., Ferreira, P.C.G., Haeusler, E.H., de Holanda, M.T., Lifschitz, S.: K-mer mapping and RDBMS indexes. In: Kowada, L., de Oliveira, D. (eds.) BSB 2019. LNCS, vol. 11347, pp. 70–82. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-46417-2_7
Bradnam, K.R., et al.: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2(1), 1–31 (2013)
Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., Medvedev, P.: On the representation of de Bruijn graphs. In: Sharan, R. (ed.) RECOMB 2014. LNCS, vol. 8394, pp. 35–55. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05269-4_4
Chikhi, R., Limasset, A., Medvedev, P.: Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32(12), i201 (2016)
Cook, J.J., Zilles, C.: Characterizing and optimizing the memory footprint of de novo short read DNA sequence assembly. In: International Symposium on Performance Analysis of Systems and Software, ISPASS 2009, pp. 143–152 (2009). https://doi.org/10.1109/ISPASS.2009.4919646
de Armas, E.M., Castro, L.C., Holanda, M., Lifschitz, S.: A new approach for de bruijn graph construction in de novo genome assembling. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1842–1849 (2019)
de Armas, E.M., Haeusler, E.H., Lifschitz, S., de Holanda, M.T., da Silva, W.M.C., Ferreira, P.C.G.: K-mer Mapping and de Bruijn graphs: the case for velvet fragment assembly. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 882–889 (2016). https://doi.org/10.1109/BIBM.2016.7822642
de Armas, E.M., Silva, M.V.M., Lifschitz, S.: A study of index structures for K-mer mapping. In: Proceedings Satellite Events of the 32nd Brazilian Symposium on Databases. Databases Meet Bioinformatics Workshop, pp. 326–333 (2017)
Demaine, E.: Lecture notes in Advanced Data Structures, MIT course number 6.851 (Spring 2012). https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-851-advanced-data-structures-spring-2012/calendar-and-notes/MIT6_851S12_L7.pdf
Kundeti, V., Rajasekaran, S., Dinh, H.: Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs. arXiv e-prints (2010)
Li, R., et al.: De novo assembly of human genomes with massively parallel short read sequencing. Genome Research (2009)
Li, Y., Kamousi, P., Han, F., Yang, S., Yan, X., Suri, S.: Memory efficient minimum substring partitioning. Proc. VLDB Endow. 6(3), 169–180 (2013). https://doi.org/10.14778/2535569.2448951. https://dx.doi.org/10.14778/2535569.2448951
Salzberg, S.L., et al.: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22(3), 557–567 (2012)
Schatz, M.C., Delcher, A.L., Salzberg, S.L.: Assembly of large genomes using second-generation sequencing. Genome Res. 20(9), 1165–1173 (2010)
Silva, M.V.M., de Holanda, M.T., Haeusler, E.H., de Armas, E.M., Lifschitz, S.: VelvetH-DB: data persistency during fragment assembly. In: Proceedings Satellite Events of the 32nd Brazilian Symposium on Databases. Databases Meet Bioinformatics Workshop, pp. 334–341 (2017). (in Portuguese)
Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)
Zerbino, D.: Velvet software. EMBL-EBI (2016). https://www.ebi.ac.uk/zerbino/velvet/. Accessed 15 June 2019
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
de Armas, E.M., Lifschitz, S. (2022). An External Memory Approach for Large Genome De Novo Assembly. In: Scherer, N.M., de Melo-Minardi, R.C. (eds) Advances in Bioinformatics and Computational Biology. BSB 2022. Lecture Notes in Computer Science(), vol 13523. Springer, Cham. https://doi.org/10.1007/978-3-031-21175-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-21175-1_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21174-4
Online ISBN: 978-3-031-21175-1
eBook Packages: Computer ScienceComputer Science (R0)