Specific Substring Problem: an application in bioinformatics

Lucas B. Rocha; Said Sadique Adi; Eloi Araujo

doi:10.5753/bsb_estendido.2018.8801

Lucas B. Rocha UFMS
Said Sadique Adi UFMS
Eloi Araujo UFMS

DOI: https://doi.org/10.5753/bsb_estendido.2018.8801

Resumo

Given two sets of sequences A and B, the Substring Specific problem is to find all minimum substrings in A having distance at least k for each subsequence in B. This work addresses three new implementations for the Maaß algorithm when the Hamming distance is considered: a naive cubic-time algorithm and two quadratic-time algorithms. We run tests to compare the running time of these implementations and another recently described algorithm implementation that uses the edit distance. In addition, we conducted preliminary testing on a large Tara Ocean database, looking for efficient and effective strategies for finding unique sequences in a set of sequences comparing with the other.

Palavras-chave: Sequence Analysis, Motifs, and Pattern Matching

Referências

Bork, P., Bowler, C., de Vargas, C., Gorsky, G., Karsenti, E., and Wincker, P. (2015). Tara oceans studies plankton at planetary scale. Science, 348(6237):873–873.

Dobre, J. A. (2017). O problema da seleção de segmentos específicos: algoritmos e aplicações. Universidade Federal de Mato Grosso do Sul. Master’s thesis.

Feuser, L. and Moreano, N. (2018). Parallel solutions to the k-difference primer problem. In International Conference on Computational Science, pages 506–523. Springer.

Fu, L., Niu, B., Zhu, Z., Wu, S., and Li, W. (2012). Cd-hit: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23):3150–3152.

Gusfield, D. (1997). Algorithms on strings, trees and sequences: Computer science and computational biology, 1st edition. New York, United States, page 534.

Lanctot, J. K., Li, M., Ma, B., Wang, S., and Zhang, L. (2003). Distinguishing string selection problems. Information and Computation, 185(1):41–55.

Li, W. and Godzik, A. (2006). Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22(13):1658–1659.

Maaß, M. G. (2003). A fast algorithm for the inexact characteristic string problem. Technical Report TUM-I0312, Fakultät für Informatik, TUM München.

Montera, L. and Nicoletti, M. C. (2008). The PCR primer design as a metaheuristic search process. In International Conference on Artificial Intelligence and Soft Computing, pages 963–973. Springer.

Pesant, S., Not, F., Picheral, M., Kandels-Lewis, S., Le Bescot, N., Gorsky, G., Iudicone, D., Karsenti, E., Speich, S., Trouble, R., et al. (2015). Open science resources for the discovery and analysis of tara oceans data. Scientific data, 2:150023.

Shigefuku, R., Watanabe, T., Kanno, Y., Ikeda, H., Nakano, H., Hattori, N., Matsunaga, K., Matsumoto, N., Kanno, S.-i., Nosho, K., et al. (2017). Fusobacterium nucleatum detected simultaneously in a pyogenic liver abscess and advanced sigmoid colon cancer. Anaerobe, 48:144–146.

Zitvogel, L., Ma, Y., Raoult, D., Kroemer, G., and Gajewski, T. F. (2018). The microbiome in cancer immunotherapy: Diagnostic tools and therapeutic strategies. Science, 359(6382):1366–1370.