BioPrediction: Democratizing Machine Learning in the Study of Molecular Interactions
Abstract
Com o crescente número de sequências biológicas armazenadas em bancos de dados, existe uma grande fonte de informações que pode beneficiar diversos setores, como agricultura e saúde. Algoritmos de Aprendizado de Máquina (AM) podem extrair informações úteis e novas a partir delas, resultando em benefícios e produtividade. No entanto, a natureza categórica e não-estruturada dificulta esse processo, requerendo conhecimento especializado. Neste trabalho, é proposto um framework fim-a-fim baseado em AM automatizado, chamado BioPrediction, capaz de identificar interações implícitas entre sequências, por exemplo, pares de RNA longo não-codificante e proteínas, sem a necessidade de conhecimento especializado em AM de ponta a ponta. Como resultado, obteve-se um modelo robusto com acurácia balanceada entre 77% e 91% nos conjuntos de dados utilizados para validação, além de apresentar resultados competitivos com as ferramentas estado-da-arte.
References
Bonidia, R. P., Domingues, D. S., Sanches, D. S., and de Carvalho, A. C. P. L. F. (2021). Mathfeature: feature extraction package for dna, rna and protein sequences based on mathematical descriptors. Briefings in Bioinformatics, page bbab434.
Bonidia, R. P., Sampaio, L. D. H., Domingues, D. S., Paschoal, A. R., Lopes, F. M., de Leon Ferreira de Carvalho, A. C. P., and Sanches, D. S. (2020). Feature extraction approaches for biological sequences: A comparative study of mathematical models. bioRxiv.
Bonidia, R. P., Santos, A. P. A., de Almeida, B. L. S., Stadler, P. F., da Rocha, U. N., Sanches, D. S., and de Carvalho, A. C. P. L. F. (2022). BioAutoML: automated feature engineering and metalearning to predict noncoding RNAs in bacteria. Briefings in Bioinformatics, 23(4).
Bowyer, K. W., Chawla, N. V., Hall, L. O., and Kegelmeyer, W. P. (2011). SMOTE: synthetic minority over-sampling technique. CoRR, abs/1106.1813.
Cantile, M., Di Bonito, M., Tracey De Bellis, M., and Botti, G. (2021). Functional interaction among lncrna hotair and micrornas in cancer and other human diseases. Cancers, 13(3).
Chen, Z., Zhao, P., Li, F., et al. (2018). ifeature: a python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics, 34(14):2499–2502.
Chicco, D. (2017). Ten quick tips for machine learning in computational biology. BioData Mining, 10(35).
Cole, B. S., Hall, M. A., Urbanowicz, R. J., Gilbert-Diamond, D., and Moore, J. H. (2017). Analysis of gene-gene interactions. Curr. Protoc. Hum. Genet., 95(1):1.14.1–1.14.10.
Ferrè, F., Colantoni, A., and Helmer-Citterich, M. (2016). Revealing protein-lncRNA interaction. Brief. Bioinform., 17(1):106–116.
Frazier, P. I. (2018). A tutorial on bayesian optimization.
Han, Y. and Zhang, S.-W. (2023). ncRPI-LGAT: Prediction of ncRNA-protein interactions with line graph attention network framework. Comput. Struct. Biotechnol. J., 21:2286–2295.
Hashemi, F. S. G., Ismail, M. R., Yusop, M. R., Hashemi, M. S. G., Shahraki, M. H. N., Rastegari, H., Miah, G., and Aslani, F. (2018). Intelligent mining of large-scale bio-data: Bioinformatics applications. Biotechnology & Biotechnological Equipment, 32(1):10–29.
Hasib, K. M., Iqbal, M. S., Shah, F. M., Mahmud, J. A., Popel, M. H., Showrov, M. I. H., Ahmed, S., and Rahman, O. (2020). A survey of methods for managing the classification and solution of data imbalance problem. CoRR, abs/2012.11870.
Jiang, P., Sinha, S., Aldape, K., et al. (2022). Big data in basic and translational cancer research. Nature Reviews Cancer, 22:625–639.
Ke, G., Meng, Q., Finley, T., et al. (2017). Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, volume 30.
Kopp, F. and Mendell, J. T. (2018). Functional classification and experimental dissection of long noncoding rnas. Cell, 172(3):393–407.
Kreuzberger, D., Kühl, N., and Hirschl, S. (2023). Machine learning operations (mlops): Overview, definition, and architecture. IEEE Access, 11:31866–31879.
Li, A., Li, M. K., Crowther, M., and Vazquez, S. R. (2020). Drug-drug interactions with direct oral anticoagulants associated with adverse events in the real world: A systematic review. Thromb. Res., 194:240–245.
Liaw, A. and Wiener, M. (2002). Classification and regression by random forest. R News, 2(3):18–22.
Liu, B., Liu, F., Fang, L., et al. (2014). repdna: a python package to generate various modes of feature vectors for dna sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics, 31(8):1307–1309.
Lundberg, Scott M e Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc.
Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., and Lee, S.-I. (2020). From local explanations to global understanding with explainable ai for trees. Nature Machine Intelligence, 2(1):2522–5839.
Ma, Y., Zhang, H., Jin, C., and Kang, C. (2023). Predicting lncRNA-protein interactions with bipartite graph embedding and deep graph neural networks. Front. Genet., 14:1136672.
Mingyue, C., Le, C., and Kang, N. (2019). Microbiome big-data mining and applications using single-cell technologies and metagenomics approaches toward precision medicine. Frontiers in Genetics, 10.
Muhammod, R., Ahmed, S., Md Farid, D., Shatabda, S., Sharma, A., and Dehzangi, A. (2019). PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences. Bioinformatics, 35(19):3831–3833.
P, B. and M., G. (2021). Worldwide protein data bank (wwpdb): A virtual treasure for research in biotechnology. Eur J Microbiol Immunol (Bp), 11(4):77–86.
Patel, H., Rajput, D. S., Reddy, G. T., Iwendi, C., Bashir, A. K., and Jo, O. (2020). A review on classification of imbalanced data for wireless sensor networks. International Journal of Distributed Sensor Networks, 16(4):1550147720916404.
Peng, L., Tan, J., Tian, X., and Zhou, L. (2022). EnANNDeep: An ensemble-based lncRNA-protein interaction prediction framework with adaptive k-nearest neighbor classifier and deep models. Interdiscip. Sci., 14(1):209–232.
Prokhorenkova, L., Gusev, G., Vorobev, A., et al. (2018). Catboost: unbiased boosting with categorical features. pages 6638–6648.
Ribeiro, M., Singh, S., and Guestrin, C. (2016). “why should i trust you?”: Explaining the predictions of any classifier. pages 97–101.
Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1(5):206–215.
Statello, L., Guo, C.-J., Chen, L.-L., and Huarte, M. (2021). Gene regulation by long non-coding rnas and its biological functions. Nature reviews Molecular cell biology, 22(2):96–118.
Wang, L., Han, M., Li, X., Zhang, N., and Cheng, H. (2021). Review of classification methods on unbalanced data sets. IEEE Access, 9:64606–64628.
Waring, J., Lindvall, C., and Umeton, R. (2020). Automated machine learning: Review of the state-of-the-art and opportunities for healthcare. Artificial Intelligence in Medicine, 104:101822.
Xu, J., Xu, J., Liu, X., and et al. (2022). The role of lncrna-mediated cerna regulatory networks in pancreatic cancer. Cell Death Discovery, 8:287.
Yu, H., Shen, Z.-A., Zhou, Y.-K., and Du, P.-F. (2022). Recent advances in predicting protein-lncRNA interactions using machine learning methods. Curr. Gene Ther., 22(3):228–244.
Zhang, W., Wang, J., Li, B., Sun, B., Yu, S., Wang, X., and Zan, L. (2023). Long non-coding rna bnip3 inhibited the proliferation of bovine intramuscular preadipocytes via cell cycle. International Journal of Molecular Sciences, 24(4).
Zhang, W., Yue, X., Tang, G., Wu, W., Huang, F., and Zhang, X. (2018). SFPEL-LPI: Sequence-based feature projection ensemble learning for predicting LncRNA protein interactions. PLoS Comput. Biol., 14(12):e1006616.
Zhou, L., Wang, Z., Tian, X., et al. (2021). LPI-deepGBDT: a multiple-layer deep framework based on gradient boosting decision trees for lncrna–protein interaction identification. BMC Bioinformatics, 22:479.
