An External Memory Approach for Large Genome De Novo Assembly
Resumo
De novo genome assembly of sequenced reads is a fundamental problem in bioinformatics. When there is no reference genome sequence to guide the process, many assemblers programs consider using the de Bruijn Graph data structure to improve performances. However, the construction of such a graph has a high computational cost, mainly due to internal RAM consumption in the presence of very large and repeated read datasets. Building a de Bruijn Graph relies on a broad set of k-mers. Some existing approaches use external memory processing to make it feasible. This work proposes an approach for constructing the de Bruijn graph that does not generate all k-mers during the execution. An external memory processing allows reducing the high number of duplicate k-mers and, consequently, reduces the total number of k-mers that incur on the number of I/O operations. Some practical experiments are presented, showing the solution’s viability and its improvements over other common assemblers in the literature. Our solution reduces the computational requirements and enables execution feasibility.