Prediction of Gain or Loss of Function in Missense Variants

  • Victor Maricato Oliveira UNIRIO
  • Pedro Nuno de Souza Moura UNIRIO

Resumo


Variantes missense alteram um único aminoácido em proteínas e podem causar perda de função (LOF) ou ganho de função (GOF). A classificação precisa desses efeitos é fundamental para compreender doenças genéticas e adaptar abordagens da medicina de precisão. Propõe-se então o Gain and Loss of Function Dataset (GLOF), o primeiro conjunto de dados contendo variantes LOF, GOF e Neutras anotado por especialistas, curado junto a um dos maiores laboratórios de diagnóstico genético da América Latina. Utilizando embeddings do modelo ESM-1v, nosso modelo Random Forest obtém resultados do estado da arte sem necessidade de engenharia complexa de atributos ou alinhamento de sequências. Espera-se que o conjunto de dados GLOF estimule mais pesquisas sobre a previsão de LOF/GOF em genômica personalizada.

Referências

Adzhubei, I. A., Schmidt, S., Peshkin, L., Ramensky, V. E., Gerasimova, A., Bork, P., Kondrashov, A. S., and Sunyaev, S. R. (2010). A method and server for predicting damaging missense mutations. Nature Methods, 7(4):248–249.

Aggarwal, C. C. (2023). Neural Networks and Deep Learning: A Textbook. Springer Publishing Company, Incorporated, 2nd edition.

Aidoo, M. e. a. (2002). Protective effects of the sickle cell gene against malaria morbidity and mortality. Lancet, 359(9314):1311–1312.

Bergstra, J., Komer, B., Eliasmith, C., Yamins, D., and Cox, D. D. (2015). Hyperopt: a python library for model selection and hyperparameter optimization. Computational Science & Discovery, 8(1):014008.

Branden, C.-I. and Tooze, J. (2012). Introduction to Protein Structure. Garland Science, 2nd edition.

Chen, S. et al. (2024). A genomic mutational constraint map using variation in 76,156 human genomes. Nature, 625(7993):492–503.

Cheng, J. et al. (2023). Accurate proteome-wide missense variant effect prediction with alphamissense. Science, 382(6664):eadg7492.

Eilbeck, K., Quinlan, A., and Yandell, M. (2017). Settling the score: variant prioritization and mendelian disease. Nature Reviews Genetics, 18(10):599–612. ISSN: 1471-0064.

Fowler, D. M. and Fields, S. (2014). Deep mutational scanning: a new style of protein science. Nature Methods, 11(8):801–807.

Frazer, J., Notin, P., Dias, M., Gomez, A., Min, J. K., Brock, K., Gal, Y., and Marks, D. S. (2021). Disease variant prediction with deep generative models of evolutionary data. Nature, 599(7883):91–95.

Griffiths, A. J. F., Wessler, S. R., Carroll, D. S. B., and Doebley, J. (2015). Introduction to Genetic Analysis. W.H. Freeman, 11th edition.

Johnson, J. O. et al. (2021). Association of variants in the sptlc1 gene with juvenile amyotrophic lateral sclerosis. JAMA Neurology, 78(10):1236–1248.

Jumper, J. M. et al. (2021). Highly accurate protein structure prediction with alphafold. Nature, 596:583 – 589.

Jung, S., Lee, S., Kim, S., and Nam, H. (2015). Identification of genomic features in the classification of loss- and gain-of-function mutation. BMC Medical Informatics and Decision Making, 15(Suppl 1):S6.

Karki, R., Pandya, D., Elston, R. C., and Ferlini, C. (2015). Defining ”mutation” and ”polymorphism” in the era of personal genomics. BMC Medical Genomics, 8(1):37.

Landrum, M. J., Lee, J. M., Riley, G. R., Jang, W., Rubinstein, W. S., Church, D. M., and Maglott, D. R. (2014). Clinvar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Research, 42(D1):D980–D985.

Lin, Z. et al. (2022). Evolutionary-scale prediction of atomic level protein structure with a language model. bioRxiv.

Lopes, F., Silva, L., and Breternitz, V. (2016). Research and Education in Data Science: Challenges for the Area of Information Systems, chapter 14, pages 176–184. Sociedade Brasileira de Computação.

Mardis, E. R. (2008). Next-generation dna sequencing methods. Annual Review of Genomics and Human Genetics, 9:387–402.

Meier, J., Rao, R., Verkuil, R., Liu, J., Sercu, T., and Rives, A. (2021). Language models enable zero-shot prediction of the effects of mutations on protein function. bioRxiv.

Montavon, G., Samek, W., and Müller, K.-R. (2018). Methods for interpreting and understanding deep neural networks. Digital Signal Processing, 73:1–15.

Ng, P. C. and Henikoff, S. (2003). Sift: predicting amino acid changes that affect protein function. Nucleic Acids Research, 31(13):3812–3814.

Parvizi, J. and Kim, G. K. (2010). High Yield Orthopaedics. Saunders/Elsevier, Philadelphia, PA.

Paszke, A. et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32.

Pedregosa, F. et al. (2018). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12:2825–2830.

Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J., and Kircher, M. (2018). Cadd: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Research, 47(D1):D886–D894.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ”why should i trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 1135–1144, New York, NY, USA. Association for Computing Machinery.

Rives, A. et al. (2021). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118.

Rotthier, A. et al. (2010). Mutations in the sptlc2 subunit of serine palmitoyltransferase cause hereditary sensory and autonomic neuropathy type i. American Journal of Human Genetics, 87(4):513–522. ISSN: 0002-9297.

Shastry, B. S. (2009). SNPs: Impact on Gene Function and Phenotype, pages 3–22. Humana Press, Totowa, NJ.

Stein, D., Liang, J., Abrusán, G., and Itan, Y. (2023). Genome-wide prediction of pathogenic gain- and loss-of-function variants from ensemble learning of a diverse feature set. Genome Medicine, 15(1):1–19.

Teng, S., Srivastava, A. K., and Wang, L. (2010). Structural assessment of the effects of amino acid substitutions on protein stability and protein protein interaction. International Journal of Computational Biology and Drug Design, 3(4):334–349.

Wolf, T. et al. (2020). Transformers: State-of-the-art natural language processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45.
Publicado
19/05/2025
OLIVEIRA, Victor Maricato; MOURA, Pedro Nuno de Souza. Prediction of Gain or Loss of Function in Missense Variants. In: CONCURSO DE TESES, DISSERTAÇÕES E TCCS EM SISTEMAS DE INFORMAÇÃO - SIMPÓSIO BRASILEIRO DE SISTEMAS DE INFORMAÇÃO (SBSI), 21. , 2025, Recife/PE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 132-141. DOI: https://doi.org/10.5753/sbsi_estendido.2025.246764.