Skip to main content

An External Memory Approach for Large Genome De Novo Assembly

  • Conference paper
  • First Online:
Advances in Bioinformatics and Computational Biology (BSB 2022)

Abstract

De novo genome assembly of sequenced reads is a fundamental problem in bioinformatics. When there is no reference genome sequence to guide the process, many assemblers programs consider using the de Bruijn Graph data structure to improve performances. However, the construction of such a graph has a high computational cost, mainly due to internal RAM consumption in the presence of very large and repeated read datasets. Building a de Bruijn Graph relies on a broad set of k-mers. Some existing approaches use external memory processing to make it feasible. This work proposes an approach for constructing the de Bruijn graph that does not generate all k-mers during the execution. An external memory processing allows reducing the high number of duplicate k-mers and, consequently, reduces the total number of k-mers that incur on the number of I/O operations. Some practical experiments are presented, showing the solution’s viability and its improvements over other common assemblers in the literature. Our solution reduces the computational requirements and enables execution feasibility.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Aggarwal, A., Vitter, J.S.: The input/output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (1988). https://doi.org/10.1145/48529.48535. https://doi.acm.org/10.1145/48529.48535

  2. de Armas, E.M., Ferreira, P.C.G., Haeusler, E.H., de Holanda, M.T., Lifschitz, S.: K-mer mapping and RDBMS indexes. In: Kowada, L., de Oliveira, D. (eds.) BSB 2019. LNCS, vol. 11347, pp. 70–82. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-46417-2_7

    Chapter  Google Scholar 

  3. Bradnam, K.R., et al.: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2(1), 1–31 (2013)

    Google Scholar 

  4. Chikhi, R., Limasset, A., Jackman, S., Simpson, J.T., Medvedev, P.: On the representation of de Bruijn graphs. In: Sharan, R. (ed.) RECOMB 2014. LNCS, vol. 8394, pp. 35–55. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-05269-4_4

    Chapter  Google Scholar 

  5. Chikhi, R., Limasset, A., Medvedev, P.: Compacting de Bruijn graphs from sequencing data quickly and in low memory. Bioinformatics 32(12), i201 (2016)

    Google Scholar 

  6. Cook, J.J., Zilles, C.: Characterizing and optimizing the memory footprint of de novo short read DNA sequence assembly. In: International Symposium on Performance Analysis of Systems and Software, ISPASS 2009, pp. 143–152 (2009). https://doi.org/10.1109/ISPASS.2009.4919646

  7. de Armas, E.M., Castro, L.C., Holanda, M., Lifschitz, S.: A new approach for de bruijn graph construction in de novo genome assembling. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 1842–1849 (2019)

    Google Scholar 

  8. de Armas, E.M., Haeusler, E.H., Lifschitz, S., de Holanda, M.T., da Silva, W.M.C., Ferreira, P.C.G.: K-mer Mapping and de Bruijn graphs: the case for velvet fragment assembly. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 882–889 (2016). https://doi.org/10.1109/BIBM.2016.7822642

  9. de Armas, E.M., Silva, M.V.M., Lifschitz, S.: A study of index structures for K-mer mapping. In: Proceedings Satellite Events of the 32nd Brazilian Symposium on Databases. Databases Meet Bioinformatics Workshop, pp. 326–333 (2017)

    Google Scholar 

  10. Demaine, E.: Lecture notes in Advanced Data Structures, MIT course number 6.851 (Spring 2012). https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-851-advanced-data-structures-spring-2012/calendar-and-notes/MIT6_851S12_L7.pdf

  11. Kundeti, V., Rajasekaran, S., Dinh, H.: Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs. arXiv e-prints (2010)

    Google Scholar 

  12. Li, R., et al.: De novo assembly of human genomes with massively parallel short read sequencing. Genome Research (2009)

    Google Scholar 

  13. Li, Y., Kamousi, P., Han, F., Yang, S., Yan, X., Suri, S.: Memory efficient minimum substring partitioning. Proc. VLDB Endow. 6(3), 169–180 (2013). https://doi.org/10.14778/2535569.2448951. https://dx.doi.org/10.14778/2535569.2448951

  14. Salzberg, S.L., et al.: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22(3), 557–567 (2012)

    Article  CAS  Google Scholar 

  15. Schatz, M.C., Delcher, A.L., Salzberg, S.L.: Assembly of large genomes using second-generation sequencing. Genome Res. 20(9), 1165–1173 (2010)

    Article  CAS  Google Scholar 

  16. Silva, M.V.M., de Holanda, M.T., Haeusler, E.H., de Armas, E.M., Lifschitz, S.: VelvetH-DB: data persistency during fragment assembly. In: Proceedings Satellite Events of the 32nd Brazilian Symposium on Databases. Databases Meet Bioinformatics Workshop, pp. 334–341 (2017). (in Portuguese)

    Google Scholar 

  17. Simpson, J.T., Wong, K., Jackman, S.D., Schein, J.E., Jones, S.J., Birol, I.: ABySS: a parallel assembler for short read sequence data. Genome Res. 19(6), 1117–1123 (2009)

    Article  CAS  Google Scholar 

  18. Zerbino, D.: Velvet software. EMBL-EBI (2016). https://www.ebi.ac.uk/zerbino/velvet/. Accessed 15 June 2019

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elvismary Molina de Armas .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

de Armas, E.M., Lifschitz, S. (2022). An External Memory Approach for Large Genome De Novo Assembly. In: Scherer, N.M., de Melo-Minardi, R.C. (eds) Advances in Bioinformatics and Computational Biology. BSB 2022. Lecture Notes in Computer Science(), vol 13523. Springer, Cham. https://doi.org/10.1007/978-3-031-21175-1_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21175-1_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21174-4

  • Online ISBN: 978-3-031-21175-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics