Abstract
Repetitive DNA sequences longer than reads’ length produce assembly gaps. In addition, repetition can cause complex and misassembled rearrangements that creates branches in assembler graphs. Algorithms must decide which way is the best. Incorrect decisions create false associations, called chimeric contigs. Reads coming from different copies of a repetitive region on genome may be wrongly assembled as a unique contig, a repetitive contig. Furthermore, the growth of hybrid assembling approaches using different sequencing platforms data, different fragment sizes or even data from distinct assemblers are responsible for significantly increasing in the amount of generated contigs and therefore subsequent redundancy on data. Thus, this work presents a hybrid computational method to detect and eliminate redundant contigs from microbial genome assemblies. It consists of two Hashing-Based techniques: a Bloom Filter to detect duplicated contigs and a Locality-Sensitive Hashing (LSH) to remove similar contigs. The redundancy reduction facilitates downstream analysis and diminishes the required time to finishing and curate genomic assemblies. The hybrid assembly of GAGE-B dataset was performed with SPAdes (De Bruijn Graph) assembler and Fermi (OLC) assembler. The proposed pipeline was applied to the resulting contigs and the performance compared to other similar tools such as HSBLASTN, Simplifier and CD-HIT. Results are presented.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ambardar, S., Gupta, R., Trakroo, D., Lal, R., Vakhlu, J.: High throughput sequencing: an overview of sequencing chemistry. Indian J. Microbiol. 56(4), 394–404 (2016)
Nagarajan, N., Pop, M.: Sequence assembly demystified. Nat. Rev. Genet. 14(3), 157–167 (2013)
El-Metwally, S., Hamza, T., Zakaria, M., Helmy, M.: Next-generation sequence assembly: four stages of data processing and computational challenges. PLoS Comput. Biol. 9(12), e1003345 (2013)
Martin, J.A., Wang, Z.: Next-generation transcriptome assembly. Nat. Rev. Genet. 12(10), 671–682 (2011)
Goswami, M., et al.: Distance sensitive bloom filters without false negatives. In: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’17, USA, 2017, pp. 257–269. Society for Industrial and Applied Mathematics (2017)
Tang, L., Li, M., Fang-Xiang, W., Pan, Y., Wang, J.: MAC: Merging assemblies by using adjacency algebraic model and classification. Front. Genet. 10, 1396 (2020)
de Sousa Paz, H.E.: reSHAPE : montagem hibrida de genomas com foco em organismos bacterianos combinando ferramentas de novo. Dissertacao (2018)
Treangen, T.J., Salzberg, S.L.: Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13(1), 36–46 (2011)
Batzer, M.A., Deininger, P.L.: Alu repeats and human genomic diversity. Nat. Rev. Genet. 3(5), 370–379 (2002)
Zavodna, M., Bagshaw, A., Brauning, R., Gemmell, N.J.: The accuracy, feasibility and challenges of sequencing short tandem repeats using next-generation sequencing platforms. PLoS ONE 9(12), e113862 (2014)
Phillippy, A.M., Schatz, M.C., Pop, M.: Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 9(3), R55 (2008)
Wetzel, J., Kingsford, C., Pop, M.: Assessing the benefits of using mate-pairs to resolve repeats in de novo short-read prokaryotic assemblies. BMC Bioinform. 12(1), 95 (2011)
Bradnam, K.R., et al.: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2(1), 2047–217X (2013)
Nagarajan, N., Pop, M.: Parametric complexity of sequence assembly: theory and applications to next generation sequencing. J. Comput. Biol. 16(7), 897–908 (2009)
Ramos, R.T.J., Carneiro, A.R., Azevedo, V., Schneider, M.P., Barh, D., Silva, A.: Simplifier: a web tool to eliminate redundant NGS contigs. Bioinformation 8(20), 996–999 (2012)
Galil, Z., Giancarlo, R.: Data structures and algorithms for approximate string matching. J. Complex. 4(1), 33–72 (1988)
Pandiselvam, P., Marimuthu, T., Lawrance, R.: A comparative study on string matching algorithm of biological sequences (2014)
Al-Khamaiseh, K., ALShagarin, S.: A survey of string matching algorithms. Int. J. Eng. Res. Appl. 4, 144–156 (2014)
Wang, J., Shen, H.T., Song, J., Ji, J.: Hashing for similarity search: a survey (2014)
Chauhan, S.S., Batra, S.: Finding similar items using lsh and bloom filter. In: 2014 IEEE International Conference on Advanced Communications, Control and Computing Technologies, pp. 1662–1666 (2014)
Bender, M.A., et al.: Don’t thrash: how to cache your hash on flash. Proc. VLDB Endow. 5, 1627–1637 (2012)
Slaney, M., Casey, M.: Locality-sensitive hashing for finding nearest neighbors [lecture notes]. Signal Process. Mag. IEEE 25, 128–131 (2008)
Baluja, S., Covell, M.: Learning to hash: forgiving hash functions and applications. Data Min. Knowl. Disc. 17(3), 402–430 (2008)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Conference Proceedings of the Annual ACM Symposium on Theory of Computing, pp. 604–613 (2000)
Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of the Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), pp. 21–29 (1997)
Jain, R., Rawat, M., Jain, S.: Data optimization techniques using bloom filter in big data. Int. J. Comput. Appl. 142, 23–27 (2016)
Stephens, Z.D., et al.: Big data: astronomical or genomical? PLoS Biol. 13(7), e1002195 (2015)
Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)
Ding, K., Huo, C., Fan, B., Xiang, S., Pan, C.: In defense of locality-sensitive hashing. IEEE Trans. Neural Netw. Learn. Syst. 29(1), 87–103 (2018)
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Broder, A., Mitzenmacher, M.: Survey: Network applications of bloom filters: a survey. Internet Math. 1, 11 (2003)
Naor, M., Yogev, E.: Tight bounds for sliding bloom filters. Algorithmica 73(4), 652–672 (2015)
Magoc, T., et al.: GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinformatics 29(14), 1718–1725 (2013)
Aronesty, E.: Comparison of sequencing utility programs. Open Bioinform. J. 7(1), 1–8 (2013)
Bankevich, A., et al.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)
Li, H.: Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics (Oxford, England) 28(14), 1838–1844 (2012)
Chikhi, R., Medvedev, P.: Informed and automated k-mer size selection for genome assembly. Bioinformatics 30(1), 31–37 (2013)
Gurevich, A., Saveliev, V., Vyahhi, N., Tesler, G.: QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8), 1072–1075 (2013)
Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)
Chen, Y., Ye, W., Zhang, Y., Yuesheng, X.: High speed BLASTN: an accelerated MegaBLAST search tool. Nucleic Acids Res. 43(16), 7762–7768 (2015)
Li, W., Godzik, A.: CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22(13), 1658–1659 (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Braga, M., Pinheiro, K., Araújo, F., Miranda, F., Silva, A., Ramos, R. (2020). Redundancy Treatment of NGS Contigs in Microbial Genome Finishing with Hashing-Based Approach. In: Setubal, J.C., Silva, W.M. (eds) Advances in Bioinformatics and Computational Biology. BSB 2020. Lecture Notes in Computer Science(), vol 12558. Springer, Cham. https://doi.org/10.1007/978-3-030-65775-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-65775-8_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-65774-1
Online ISBN: 978-3-030-65775-8
eBook Packages: Computer ScienceComputer Science (R0)