Anotações de Funções de Proteínas Utilizando Aprendizado de Máquina e Alinhamento Local
Resumo
Com o avanço das tecnologias de sequenciamento, milhões de proteínas tiveram suas sequências determinadas, enquanto a anotação experimental de suas funções permanece limitada. A predição computacional de funções proteicas tornou-se, portanto, um problema central em bioinformática, caracterizado como uma tarefa de classificação multirrótulo hierárquica de larga escala. Nesta tese, propomos dois métodos baseados em aprendizado de máquina utilizando embeddings de modelos Transformers, bem como duas abordagens de ensemble que integram essas predições com alinhamento local de sequências. Avaliados na base derivada do CAFA5, principal conjunto de dados da área, os métodos propostos superaram consistentemente as principais abordagens da literatura, estabelecendo-se como os novos estado da arte para o problema de predição de funções proteicas a partir exclusivamente da sequência de aminoácidos. Além disso, apresentamos versões otimizadas em memória e um servidor Web público para uso da comunidade científica.Referências
Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs. Nucleic Acids Research, 25(17):3389–3402.
Buchfink, B., Reuter, K., and Drost, H.-G. (2021). Sensitive Protein Alignments at Tree-of-Life Scale using DIAMOND. Nature Methods, 18(4):366–368.
Cao, Y. and Shen, Y. (2021). TALE: Transformer-based Protein Function Annotation with Joint Sequence–Label Embedding. Bioinformatics, 37(18):2825–2833.
Chua, Z. M., Rajesh, A., Sinha, S., and Adams, P. D. (2024). PROTGOAT: Improved Automated Protein Function Predictions Using Protein Language Models. bioRxiv, pages 1–15.
Consortium, G. O. (2004). The Gene Ontology (GO) Database and Informatics Resource. Nucleic Acids Research, 32(suppl 1):D258–D261.
Dobson, C. M. (1999). Protein Misfolding, Evolution and Disease. Trends in Biochemical Sciences, 24(9):329–332.
Elnaggar, A., Essam, H., Salah-Eldin, W., Moustafa, W., Elkerdawy, M., Rochereau, C., and Rost, B. (2023). Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling. arXiv:2301.06568, pages 1–29.
Elnaggar, A., Heinzinger, M., Dallago, C., Rehawi, G., Wang, Y., Jones, L., Gibbs, T., Feher, T., Angerer, C., Steinegger, M., Bhowmik, D., and Rost, B. (2021). ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7112–7127.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778. IEEE.
Kulmanov, M. and Hoehndorf, R. (2019). DeepGOPlus: Improved Protein Function Prediction from Sequence. Bioinformatics, 36(2):422–429.
Kulmanov, M., Khan, M. A., and Hoehndorf, R. (2018). DeepGO: Predicting Protein Functions from Sequence and Interactions using a Deep Ontology-Aware Classifier. Bioinformatics, 34(4):660–668.
Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., Costa, A. d. S., Fazel-Zarandi, M., Sercu, T., Candido, S., and Rives, A. (2023). Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model. Science, 379(6637):1123–1130.
Liu, Q., Zhang, C., and Freddolino, L. (2024). InterLabelGO+: unraveling label correlations in protein function prediction. Bioinformatics, 40(11):btae655.
Oliveira, G. B., Pedrini, H., and Dias, Z. (2023). TEMPROT: Protein Function Annotation using Transformers Embeddings and Homology Search. BMC Bioinformatics, 24(1):1–16.
Oliveira, G. B., Pedrini, H., and Dias, Z. (2024). Integrating Transformers and AutoML for Protein Function Prediction. In 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 1–5. IEEE.
Radivojac, P. (2013). A (Not So) Quick Introduction to Protein Function Prediction. Indiana University, USA.
Ranjan, A., Fernández-Baca, D., Tripathi, S., and Deepak, A. (2021). An Ensemble Tf-Idf Based Approach to Protein Function Prediction via Sequence Segmentation. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 19(5):2685–2696.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is All You Need. In 30th Advances in Neural Information Processing Systems (NeurIPS), pages 5998–6008.
Xia, W., Zheng, L., Fang, J., Li, F., Zhou, Y., Zeng, Z., Zhang, B., Li, Z., Li, H., and Zhu, F. (2022). PFmulDL: A Novel Strategy Enabling Multi-Class and Multi-Label Protein Function Annotation by Integrating Diverse Deep Learning Methods. Computers in Biology and Medicine, 145:105465.
Zhapa-Camacho, F., Tang, Z., Kulmanov, M., and Hoehndorf, R. (2024). Predicting Protein Functions using Positive-Unlabeled Ranking with Ontology-Based Priors. bioRxiv, pages 1–9.
Zhu, Y.-H., Zhang, C., Yu, D.-J., and Zhang, Y. (2022). Integrating Unsupervised Language Model with Triplet Neural Networks for Protein Gene Ontology Prediction. PLoS Computational Biology, 18(12):e1010793.
Buchfink, B., Reuter, K., and Drost, H.-G. (2021). Sensitive Protein Alignments at Tree-of-Life Scale using DIAMOND. Nature Methods, 18(4):366–368.
Cao, Y. and Shen, Y. (2021). TALE: Transformer-based Protein Function Annotation with Joint Sequence–Label Embedding. Bioinformatics, 37(18):2825–2833.
Chua, Z. M., Rajesh, A., Sinha, S., and Adams, P. D. (2024). PROTGOAT: Improved Automated Protein Function Predictions Using Protein Language Models. bioRxiv, pages 1–15.
Consortium, G. O. (2004). The Gene Ontology (GO) Database and Informatics Resource. Nucleic Acids Research, 32(suppl 1):D258–D261.
Dobson, C. M. (1999). Protein Misfolding, Evolution and Disease. Trends in Biochemical Sciences, 24(9):329–332.
Elnaggar, A., Essam, H., Salah-Eldin, W., Moustafa, W., Elkerdawy, M., Rochereau, C., and Rost, B. (2023). Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling. arXiv:2301.06568, pages 1–29.
Elnaggar, A., Heinzinger, M., Dallago, C., Rehawi, G., Wang, Y., Jones, L., Gibbs, T., Feher, T., Angerer, C., Steinegger, M., Bhowmik, D., and Rost, B. (2021). ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7112–7127.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778. IEEE.
Kulmanov, M. and Hoehndorf, R. (2019). DeepGOPlus: Improved Protein Function Prediction from Sequence. Bioinformatics, 36(2):422–429.
Kulmanov, M., Khan, M. A., and Hoehndorf, R. (2018). DeepGO: Predicting Protein Functions from Sequence and Interactions using a Deep Ontology-Aware Classifier. Bioinformatics, 34(4):660–668.
Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., Costa, A. d. S., Fazel-Zarandi, M., Sercu, T., Candido, S., and Rives, A. (2023). Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model. Science, 379(6637):1123–1130.
Liu, Q., Zhang, C., and Freddolino, L. (2024). InterLabelGO+: unraveling label correlations in protein function prediction. Bioinformatics, 40(11):btae655.
Oliveira, G. B., Pedrini, H., and Dias, Z. (2023). TEMPROT: Protein Function Annotation using Transformers Embeddings and Homology Search. BMC Bioinformatics, 24(1):1–16.
Oliveira, G. B., Pedrini, H., and Dias, Z. (2024). Integrating Transformers and AutoML for Protein Function Prediction. In 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 1–5. IEEE.
Radivojac, P. (2013). A (Not So) Quick Introduction to Protein Function Prediction. Indiana University, USA.
Ranjan, A., Fernández-Baca, D., Tripathi, S., and Deepak, A. (2021). An Ensemble Tf-Idf Based Approach to Protein Function Prediction via Sequence Segmentation. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 19(5):2685–2696.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is All You Need. In 30th Advances in Neural Information Processing Systems (NeurIPS), pages 5998–6008.
Xia, W., Zheng, L., Fang, J., Li, F., Zhou, Y., Zeng, Z., Zhang, B., Li, Z., Li, H., and Zhu, F. (2022). PFmulDL: A Novel Strategy Enabling Multi-Class and Multi-Label Protein Function Annotation by Integrating Diverse Deep Learning Methods. Computers in Biology and Medicine, 145:105465.
Zhapa-Camacho, F., Tang, Z., Kulmanov, M., and Hoehndorf, R. (2024). Predicting Protein Functions using Positive-Unlabeled Ranking with Ontology-Based Priors. bioRxiv, pages 1–9.
Zhu, Y.-H., Zhang, C., Yu, D.-J., and Zhang, Y. (2022). Integrating Unsupervised Language Model with Triplet Neural Networks for Protein Gene Ontology Prediction. PLoS Computational Biology, 18(12):e1010793.
Publicado
19/07/2026
Como Citar
OLIVEIRA, Gabriel Bianchin de; PEDRINI, Hélio; DIAS, Zanoni.
Anotações de Funções de Proteínas Utilizando Aprendizado de Máquina e Alinhamento Local. In: CONCURSO DE TESES E DISSERTAÇÕES DA SBC (CTD-SBC), 39. , 2026, Gramado/RS.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2026
.
p. 11-20.
ISSN 2763-8820.
DOI: https://doi.org/10.5753/ctd.2026.19540.
