Using Protein Language Models Embeddings to predict O-GlcNAc glycosylation sites

Resumo


O-GlcNAcylation is a post-translational modification (PTM) that involves the covalent bonding of an N-acetylglucosamine (GlcNAc) molecule to serine or threonine amino acid residues in nuclear and cytoplasmic proteins. PTMs dysregulation has been implicated in a wide range of diseases, including cancer, metabolic syndromes, and neurodegenerative disorders. Precise mapping of O-GlcNAc sites is essential for advancing both fundamental understanding and the development of targeted therapeutics. However, their detection remains challenging, which has motivated the development of computational tools to predict these sites with greater accuracy. In this study, we used Protein Language Models (PLMs) to address the challenge of predicting protein residues that are O-GlcNAc modification sites. To evaluate our method, we collected data from the O-GlcNAc Atlas. Our results indicate that our model outperformed competitors in all datasets evaluated. We believe the approach presented here can benefit scientists working on any subject where protein post-translational modifications play a role.

Palavras-chave: O-GlcNAcylation, Machine Learning, Protein Language Models, Embeddings

Referências

Brown, T.B., Mann, B., Ryder, N., et al. 2020. Language Models are Few-Shot Learners. [link].

Bruening, W., Giasson, B.I., Klein-Szanto, A.J.P., Lee, V.M.-Y., Trojanowski, J.Q., and Godwin, A.K. 2000. Synucleins are expressed in the majority of breast and ovarian carcinomas and in preneoplastic lesions of the ovary. Cancer 88, 9, 2154–2163.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. [link].

Elnaggar, A., Essam, H., Salah-Eldin, W., Moustafa, W., Elkerdawy, M., Rochereau, C., and Rost, B. 2023. Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling. [link].

George, J.M. 2001. The synucleins. Genome Biology 3, 1, reviews3002.1.

Gupta, R. 2001. Prediction of glycosylation sites in proteomes: from post-translational modifications to protein function. .

Hart, G.W., Housley, M.P., and Slawson, C. 2007. Cycling of O-linked beta-N-acetylglucosamine on nucleocytoplasmic proteins. Nature 446, 7139, 1017–1022.

Hartono, Hazawa, M., Lim, K.S., Dewi, F.R.P., Kobayashi, A., and Wong, R.W. 2019. Nucleoporin Nup58 localizes to centrosomes and mid-bodies during mitosis. Cell Division 14, 1, 7.

Heinzinger, M., Weissenow, K., Gomez Sanchez, J., Henkel, A., Mirdita, M., Steinegger, M., and Rost, B. 2024. Bilingual language model for protein sequence and structure. NAR Genomics and Bioinformatics 6, 4, lqae150.

Hou, C., Li, W., Li, Y., and Ma, J. 2025. O-GlcNAcAtlas 4.0: An Updated Protein O-GlcNAcylation Database with Site-specific Quantification. Journal of Molecular Biology 437, 15, 169033.

Hu, F., Li, W., Li, Y., Hou, C., Ma, J., and Jia, C. 2023. O-GlcNAcPRED-DL: prediction of protein O-GlcNAcylation sites based on an ensemble model of deep learning. Journal of Proteome Research 23, 1, 95–106.

Khalid, A., Kaleem, A., Qazi, W., Abdullah, R., Iqtedar, M., and Naz, S. 2024. Site-specific prediction of O-GlcNAc modification in proteins using evolutionary scale model. PLOS ONE 19, 12, e0316215.

Lin, Z., Akin, H., Rao, R., et al. 2023. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 6637, 1123–1130.

Meng, E.C., Goddard, T.D., Pettersen, E.F., et al. 2023. UCSF ChimeraX: Tools for structure building and analysis. Protein Science 32, 11, e4792.

Morris, R., Black, K.A., and Stollar, E.J. 2022. Uncovering protein function: from classification to complexes. Essays in Biochemistry 66, 3, 255–285.

Pokharel, S., Pratyush, P., Ismail, H.D., Ma, J., and Kc, D.B. 2023. Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction. International Journal of Molecular Sciences 24, 21, 16000.

Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving Language Understanding by Generative Pre-Training. .

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language Models are Unsupervised Multitask Learners. .

Ruan, H.-B., Han, X., Li, M.-D., et al. 2012. O-GlcNAc Transferase/Host Cell Factor C1 Complex Regulates Gluconeogenesis by Modulating PGC-1α Stability. Cell metabolism 16, 2, 226–237.

Seber, P. and Braatz, R.D. 2024. Recurrent neural network-based prediction of O-GlcNAcylation sites in mammalian proteins. Computers & Chemical Engineering 189, 108818.

Singh, M., Bacolla, A., Chaudhary, S., et al. 2020. Histone Acetyltransferase MOF Orchestrates Outcomes at the Crossroad of Oncogenesis, DNA Damage Response, Proliferation, and Stem Cell Development. Molecular and Cellular Biology 40, 18, e00232-20.

Slawson, C. and Hart, G.W. 2011. O-GlcNAc signalling: implications for cancer cell biology. Nature Reviews. Cancer 11, 9, 678–684.

Sledzieski, S., Kshirsagar, M., Baek, M., Dodhia, R., Lavista Ferres, J., and Berger, B. 2024. Democratizing protein language models with parameter-efficient fine-tuning. Proceedings of the National Academy of Sciences 121, 26, e2405840121.

Smet-Nocca, C., Broncel, M., Wieruszeski, J.-M., et al. 2011. Identification of O-GlcNAc sites within peptides of the Tau protein and their impact on phosphorylation. Molecular bioSystems 7, 5, 1420–1429.

Spoel, S.H. 2018. Orchestrating the proteome with post-translational modifications. Journal of Experimental Botany 69, 19, 4499–4503.

Stollar, E.J. and Smith, D.P. 2020. Uncovering protein structure. Essays in Biochemistry 64, 4, 649–680.

Suzek, B.E., Wang, Y., Huang, H., McGarvey, P.B., Wu, C.H., and the UniProt Consortium. 2015. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 6, 926–932.

The UniProt Consortium. 2023. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Research 51, D1, D523–D531.

Vaswani, A., Shazeer, N., Parmar, N., et al. 2017. Attention is All you Need. Advances in Neural Information Processing Systems, Curran Associates, Inc.

Weissenow, K. and Rost, B. 2025. Are protein language models the new universal key? Current Opinion in Structural Biology 91, 102997.

Yang, X. and Qian, K. 2017. Protein O-GlcNAcylation: emerging mechanisms and functions. Nature reviews. Molecular cell biology 18, 7, 452–465.

Yang, Y., Fu, M., Li, M.-D., et al. 2020. O-GlcNAc transferase inhibits visceral fat lipolysis and promotes diet-induced obesity. Nature Communications 11, 181.

Yang, Y.-H., Wen, R., Yang, N., Zhang, T.-N., and Liu, C.-F. 2023. Roles of protein post-translational modifications in glucose and lipid metabolism: mechanisms and perspectives. Molecular Medicine 29, 1, 93.

Zhang, L., Deng, T., Pan, S., et al. 2024. DeepO-GlcNAc: a web server for prediction of protein O-GlcNAcylation sites using deep learning combined with attention mechanism. Frontiers in Cell and Developmental Biology 12.

Zhao, W., Zhou, K., Junyi, L., et al. 2023. A Survey of Large Language Models.
Publicado
29/09/2025
ARCANJO, Adenilson; MARIANO, Diego; BASTOS, Luana L.; BASTOS, Ana L. A.; PIROVANI, Milenna; MELO-MINARDI, Raquel C. de. Using Protein Language Models Embeddings to predict O-GlcNAc glycosylation sites. In: SIMPÓSIO BRASILEIRO DE BIOINFORMÁTICA (BSB), 18. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 198-209. ISSN 2316-1248. DOI: https://doi.org/10.5753/bsb.2025.15151.