Redundancy Treatment of NGS Contigs in Microbial Genome Finishing with Hashing-Based Approach
Resumo
Repetitive DNA sequences longer than reads’ length produce assembly gaps. In addition, repetition can cause complex and misassembled rearrangements that creates branches in assembler graphs. Algorithms must decide which way is the best. Incorrect decisions create false associations, called chimeric contigs. Reads coming from different copies of a repetitive region on genome may be wrongly assembled as a unique contig, a repetitive contig. Furthermore, the growth of hybrid assembling approaches using different sequencing platforms data, different fragment sizes or even data from distinct assemblers are responsible for significantly increasing in the amount of generated contigs and therefore subsequent redundancy on data. Thus, this work presents a hybrid computational method to detect and eliminate redundant contigs from microbial genome assemblies. It consists of two Hashing-Based techniques: a Bloom Filter to detect duplicated contigs and a LocalitySensitive Hashing (LSH) to remove similar contigs. The redundancy reduction facilitates downstream analysis and diminishes the required time to finishing and curate genomic assemblies. The hybrid assembly of GAGE-B dataset was performed with SPAdes (De Bruijn Graph) assembler and Fermi (OLC) assembler. The proposed pipeline was applied to the resulting contigs and the performance compared to other similar tools such as HSBLASTN, Simplifier and CD-HIT. Results are presented.
Palavras-chave:
NGS contigs, Redundancy detection, Genome finishing, Bloom filter, LSH
Publicado
23/11/2020
Como Citar
BRAGA, Marcus; PINHEIRO, Kenny; ARAÚJO, Fabrício; MIRANDA, Fábio; SILVA, Artur; RAMOS, Rommel.
Redundancy Treatment of NGS Contigs in Microbial Genome Finishing with Hashing-Based Approach. In: SIMPÓSIO BRASILEIRO DE BIOINFORMÁTICA (BSB), 13. , 2020, Evento Online.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2020
.
p. 13-24.
ISSN 2316-1248.