A New Data Modeling Approach for Alignment-free Biological Applications

Diogo Munaro Vieira; Elvismary Molina de Armas; Maria L. G. Jaramillo; Marcos Catanho; Antonio B. Miranda; Edward Hermann Haeusler; Sérgio Lifschitz

doi:10.5753/sbbd.2023.232471

Diogo Munaro Vieira Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio)
Elvismary Molina de Armas Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio) https://orcid.org/0000-0002-0606-5994
Maria L. G. Jaramillo Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio) https://orcid.org/0000-0001-6649-9738
Marcos Catanho Fundação Oswaldo Cruz (Fiocruz)
Antonio B. Miranda Fundação Oswaldo Cruz (Fiocruz)
Edward Hermann Haeusler Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio)
Sérgio Lifschitz Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio)

DOI: https://doi.org/10.5753/sbbd.2023.232471

Resumo

Encontrar proteínas homólogas e agrupá-las são tarefas de extrema importância para a biologia, que atualmente conta com ferramentas baseadas em informações do DNA ou das sequências de aminoácidos dessas proteínas. Essas tarefas exigem a identificação de padrões evolutivos que são difíceis de obter automaticamente usando métodos tradicionais. Este trabalho propõe uma abordagem de modelagem de dados para alavancar padrões evolutivos em tarefas de busca, classificação e agrupamento de homólogos por meio de um processo alignment-free usando algoritmos de similaridade de imagem. Essa estratégia é valiosa mesmo para homólogos distantes e contribui para a privacidade e segurança dos dados.

Palavras-chave: Data Modeling, Molecular Biology, Homologous Protein, Feature Representation, Computer Vision, Human Visual System, Machine Learning Explainability, Data Masking, Data Privacy

Referências

Alhijawi, B., Awajan, A., and Fraihat, S. (2023). Survey on the Objectives of Recommender Systems: Measures, Solutions, Evaluation Methodology, and New Perspectives. ACM Computing Surveys, 55(5):1–93.

Alsmadi, I. and Nuser, M. (2012). String Matching Evaluation Methods for DNA Comparison. International Journal of Advanced Science and Technology, 47.

Bakurov, I., Buzzelli, M., Schettini, R., Castelli, M., and Vanneschi, L. (2022). Structural similarity index (SSIM) revisited: A data-driven approach. Expert Systems with Applications, 189:116087.

Bernstein, F. C., Koetzle, T. F., Williams, G. J., Meyer, E. F., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T., and Tasumi, M. (1977). The Protein Data Bank: a computer-based archival file for macromolecular structures. Journal of molecular biology, 112(3):535–542.

Bilotta, M., Tradigo, G., and Veltri, P. (2019). Bioinformatics Data Models, Representation and Storage. Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics, 1-3:110–116.

Fitch, W. M. (1970). Distinguishing Homologous from Analogous Proteins. Systematic Zoology, 19(2):99.

Fletcher, W. and Yang, Z. (2009). INDELible: A flexible simulator of biological sequence evolution. Molecular Biology and Evolution, 26(8):1879–1888.

Gibbs, A. J. and Mcintyre, G. A. (1970). The Diagram, a Method for Comparing Sequences: Its Use with Amino Acid and Nucleotide Sequences. European Journal of Biochemistry, 16(1):1–11.

Ginalski, K., Pas, J., Wyrwicz, L. S., von Grotthuss, M., Bujnicki, J. M., and Rychlewski, L. (2003). ORFeus: detection of distant homology using sequence profiles and predicted secondary structure. Nucleic Acids Research, 31(13):3804–3807.

Ginalski, K., von Grotthuss, M., Grishin, N. V., and Rychlewski, L. (2004). Detecting distant homology with Meta-BASIC. Nucleic Acids Research, 32(suppl 2):W576-W581.

Huelsenbeck, J. P. (1995). Performance of Phylogenetic Methods in Simulation. Systematic Biology, 44(1):17–48.

Kania, A. and Sarapata, K. (2021). The robustness of the chaos game representation to mutations and its application in free-alignment methods. Genomics, 113(3):1428–1437.

Leonard, S. A., Littlejohn, T. G., and Baxevanis, A. D. (2006). Common File Formats. Current Protocols in Bioinformatics, 16(1):A.1B.1–A.1B.9.

Lifschitz, S., Haeusler, E. H., Catanho, M., de Miranda, A. B., Molina de Armas, E., Heine, A., Moreira, S. G., and Tristão, C. (2022). Bio-Strings: A Relational Database Data-Type for Dealing with Large Biosequences. BioTech 2022, Vol. 11, Page 31, 11(3):31.

Löchel, H. F. and Heider, D. (2021). Chaos game representation and its applications in bioinformatics. Computational and Structural Biotechnology Journal, 19:6263.

McGinnis, S. and Madden, T. L. (2004). BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Research, 32(Web Server issue):W20.

Mills, L. (2014). Common File Formats. Current Protocols in Bioinformatics, 45(1).

Namiki, Y., Ishida, T., and Akiyama, Y. (2012). Fast DNA Sequence Clustering Based on Longest Common Subsequence. In Communications in Computer and Information Science, volume 304 CCIS, pages 453–460. Springer, Berlin, Heidelberg.

Plataniotis, K. N. and Venetsanopoulos, A. N. (2000). Color Image Processing and Applications. Digital Signal Processing. Springer Berlin Heidelberg, Berlin, Heidelberg.

Robinson, D. F. and Foulds, L. R. (1981). Comparison of phylogenetic trees. Mathematical Biosciences, 53(1-2):131–147.

Siddartha, B. K. and Ravikumar, G. K. (2019). A Novel Data Masking Method for Securing Medical Image. Proceedings of the 2nd International Conference on Smart Systems and Inventive Technology, ICSSIT 2019, pages 30–34.

Siddartha, B. K. and Ravikumar, G. K. (2020). An efficient data masking for securing medical data using DNA encoding and chaotic system. International Journal of Electrical and Computer Engineering (IJECE), 10(6):6008.

Sievers, F. and Higgins, D. G. (2018). Clustal Omega for making accurate alignments of many protein sequences. Protein Science : A Publication of the Protein Society, 27(1):135.

Wang, Z. and Bovik, A. C. (2002). A universal image quality index. IEEE Signal Processing Letters, 9(3):81–84.

Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. (2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612.

Wang, Z., Simoncelli, E. P., and Bovik, A. C. (2003). Multi-scale structural similarity for image quality assessment. In Conference Record of the Asilomar Conference on Signals, Systems and Computers, volume 2, pages 1398–1402.