An External Memory Approach for Large Genome De Novo Assembly

  • Elvismary Molina de Armas Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio) https://orcid.org/0000-0002-0606-5994
  • Sérgio Lifschitz Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio)

Resumo


De novo genome assembly of sequenced reads is a fundamental problem in bioinformatics. When there is no reference genome sequence to guide the process, many assemblers programs consider using the de Bruijn Graph data structure to improve performances. However, the construction of such a graph has a high computational cost, mainly due to internal RAM consumption in the presence of very large and repeated read datasets. Building a de Bruijn Graph relies on a broad set of k-mers. Some existing approaches use external memory processing to make it feasible. This work proposes an approach for constructing the de Bruijn graph that does not generate all k-mers during the execution. An external memory processing allows reducing the high number of duplicate k-mers and, consequently, reduces the total number of k-mers that incur on the number of I/O operations. Some practical experiments are presented, showing the solution’s viability and its improvements over other common assemblers in the literature. Our solution reduces the computational requirements and enables execution feasibility.

Palavras-chave: de Bruijn graph, k-mer, External memory processing, De novo assembly
Publicado
21/09/2022
DE ARMAS, Elvismary Molina; LIFSCHITZ, Sérgio. An External Memory Approach for Large Genome De Novo Assembly. In: SIMPÓSIO BRASILEIRO DE BIOINFORMÁTICA (BSB), 15. , 2022, Búzios/RJ. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2022 . p. 79-90. ISSN 2316-1248.