DEP: Dual-Path Embeddings for Protein Toxicity Classification
Resumo
The reliance of protein predictors on computationally expensive 3D structures or MSAs severely limits large-scale screening. To overcome this limitation, we propose a lightweight, dual-path architecture operating exclusively on 1D sequence embeddings. Our model fuses representations from a Local-Hierarchical Path, designed to capture functional motifs, with a Global-Holistic Path that models long-range dependencies. Evaluated on a benchmark toxicity dataset, our model establishes a new state-of-the-art with an AUC-ROC of 0.966, surpassing complex models that require structural inputs. These results show that a well-designed, sequence-only approach can be a faster, more scalable, and even better-performing alternative to structure-based methods.
Referências
Buzelin, A., Dutenhefner, P. R., Rezende, T., Porfirio, L. G., Bento, P., Aquino, Y., Fernandes, J., Santana, C., Miana, G., Pappa, G. L., Ribeiro, A., and Jr, W. M. (2025). A cnn-based local-global self-attention via averaged window embeddings for hierarchical ecg analysis.
ESM Team (2024). Esm cambrian: Revealing the mysteries of proteins with unsupervised learning.
Gupta, S., Kapoor, P., Chaudhary, K., Gautam, A., and Kumar, R. G. P. S. (2013). Toxinpred: a web server for the prediction of toxic peptides and proteins. Nucleic Acids Research, 41(W1):W196–W203.
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., Petersen, S., Reiman, D., Clancy, E., Zielinski, M., Steinegger, M., Pacholska, M., Berghammer, T., Bodenstein, S., Silver, D., Vinyals, O., Senior, A. W., Kavukcuoglu, K., Kohli, P., and Hassabis, D. (2021). Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589.
Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., and Rives, A. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130.
Morozov, V., Rodrigues, C. H. M., and Ascher, D. B. (2023). Csm-toxin: A web-server for predicting protein toxicity. Pharmaceutics, 15(2):431.
Naamati, G., Winter, E., and Linial, M. (2009). Clantox: a classifier of animal toxins. Nucleic Acids Research, 37(Web Server issue):W602–W607.
Pan, X., Zuallaert, J., Wang, X., Shen, H.-B., Campos, E. P., Marushchak, D. O., and Neve, W. D. (2020). Toxdl: deep learning using primary structure and domain embeddings for assessing protein toxicity. Bioinformatics, 36(21):5159–5168.
Rappuoli, R., Mandl, C. W., Black, S., and Gregorio, E. D. (2011). Vaccines for the twenty-first century society. Nature Reviews Immunology, 11(12):865–872.
Saha, S. and Raghava, G. P. (2007a). Btxpred: Support vector machine-based method for predicting bacterial toxins. BMC Bioinformatics, 8:463.
Saha, S. and Raghava, G. P. (2007b). Ntxpred: A svm-based method for predicting neurotoxins. BMC Bioinformatics, 8:463.
Sharma, N., Devi, N. L., Jain, S., and Raghava, G. P. (2022). Toxinpred2: an improved method for predicting toxicity of proteins. Briefings in Bioinformatics, 23(5):bbac174.
Sharma, N. and Raghava, G. P. (2024). Toxinpred 3.0: A deep learning-based model for peptide and protein toxicity prediction. Manuscript accessed via Elsevier; exact citation pending journal confirmation.
Zhu, L., Fang, Y., Liu, S., Shen, H.-B., Neve, W. D., and Pan, X. (2025). Toxdl 2.0: Protein toxicity prediction using a pretrained language model and graph neural networks. Computational and Structural Biotechnology Journal, 27:1538–1549.
