An External Memory Approach for Large Genome De Novo Assembly

  • Elvismary Molina de Armas Pontifical Catholic University of Rio de Janeiro (PUC-Rio) https://orcid.org/0000-0002-0606-5994
  • Sérgio Lifschitz Pontifical Catholic University of Rio de Janeiro (PUC-Rio)

Abstract

De novo genome assembly of sequenced reads is a fundamental problem in bioinformatics. When there is no reference genome sequence to guide the process, many assemblers programs consider using the de Bruijn Graph data structure to improve performances. However, the construction of such a graph has a high computational cost, mainly due to internal RAM consumption in the presence of very large and repeated read datasets. Building a de Bruijn Graph relies on a broad set of k-mers. Some existing approaches use external memory processing to make it feasible. This work proposes an approach for constructing the de Bruijn graph that does not generate all k-mers during the execution. An external memory processing allows reducing the high number of duplicate k-mers and, consequently, reduces the total number of k-mers that incur on the number of I/O operations. Some practical experiments are presented, showing the solution’s viability and its improvements over other common assemblers in the literature. Our solution reduces the computational requirements and enables execution feasibility.

Published
2022-09-21
How to Cite
DE ARMAS, Elvismary Molina; LIFSCHITZ, Sérgio. An External Memory Approach for Large Genome De Novo Assembly. Proceedings of the Brazilian Symposium on Bioinformatics (BSB), [S.l.], p. 79-90, sep. 2022. ISSN 2316-1248. Available at: <https://sol.sbc.org.br/index.php/bsb/article/view/22860>. Date accessed: 17 may 2024.