BioPrediction: Democratizing Machine Learning in the Study of Molecular Interactions

Bruno Rafael Florentino; Natan Henrique Sanches; Robson Parmezan Bonidia; André C. P. L. F. de Carvalho

doi:10.5753/eniac.2023.234271

Bruno Rafael Florentino Universidade de São Paulo
Natan Henrique Sanches Universidade de São Paulo
Robson Parmezan Bonidia Universidade de São Paulo
André C. P. L. F. de Carvalho Universidade de São Paulo

DOI: https://doi.org/10.5753/eniac.2023.234271

Resumo

Given the increasing number of biological sequences stored in databases, there is a large source of information that can benefit several sectors such as agriculture and health. Machine Learning (ML) algorithms can extract useful and new information from these data, increasing social and economic benefits, in addition to productivity. However, the categorical and unstructured nature of biological sequences makes this process difficult, requiring ML expertise. In this paper, we propose and experimentally evaluate an end-to-end automated ML-based framework, named BioPrediction, able to identify implicit interactions between sequences, e.g., long non-coding RNA and protein pairs, without the need for end-to-end ML expertise. Our experimental results show that the proposed framework can induce ML models with high predictive accuracy, between 77% and 91%, which are competitive with state-of-the-art tools.

Palavras-chave: Machine Learning, Bioinformatics, Molecular Interactions, Democratizing Machine Learning, Biological Sequences

Referências

Binois, M. and Wycoff, N. (2022). A survey on high-dimensional gaussian process modeling with application to bayesian optimization. 2(2).

Bonidia, R. P., Domingues, D. S., Sanches, D. S., and de Carvalho, A. C. P. L. F. (2021). Mathfeature: feature extraction package for dna, rna and protein sequences based on mathematical descriptors. Briefings in Bioinformatics, page bbab434.

Bonidia, R. P., Sampaio, L. D. H., Domingues, D. S., Paschoal, A. R., Lopes, F. M., de Leon Ferreira de Carvalho, A. C. P., and Sanches, D. S. (2020). Feature extraction approaches for biological sequences: A comparative study of mathematical models. bioRxiv.

Bonidia, R. P., Santos, A. P. A., de Almeida, B. L. S., Stadler, P. F., da Rocha, U. N., Sanches, D. S., and de Carvalho, A. C. P. L. F. (2022). BioAutoML: automated feature engineering and metalearning to predict noncoding RNAs in bacteria. Briefings in Bioinformatics, 23(4).

Bowyer, K. W., Chawla, N. V., Hall, L. O., and Kegelmeyer, W. P. (2011). SMOTE: synthetic minority over-sampling technique. CoRR, abs/1106.1813.

Cantile, M., Di Bonito, M., Tracey De Bellis, M., and Botti, G. (2021). Functional interaction among lncrna hotair and micrornas in cancer and other human diseases. Cancers, 13(3).

Chen, Z., Zhao, P., Li, F., et al. (2018). ifeature: a python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics, 34(14):2499–2502.

Chicco, D. (2017). Ten quick tips for machine learning in computational biology. BioData Mining, 10(35).

Cole, B. S., Hall, M. A., Urbanowicz, R. J., Gilbert-Diamond, D., and Moore, J. H. (2017). Analysis of gene-gene interactions. Curr. Protoc. Hum. Genet., 95(1):1.14.1–1.14.10.

Ferrè, F., Colantoni, A., and Helmer-Citterich, M. (2016). Revealing protein-lncRNA interaction. Brief. Bioinform., 17(1):106–116.

Frazier, P. I. (2018). A tutorial on bayesian optimization.

Han, Y. and Zhang, S.-W. (2023). ncRPI-LGAT: Prediction of ncRNA-protein interactions with line graph attention network framework. Comput. Struct. Biotechnol. J., 21:2286–2295.

Hashemi, F. S. G., Ismail, M. R., Yusop, M. R., Hashemi, M. S. G., Shahraki, M. H. N., Rastegari, H., Miah, G., and Aslani, F. (2018). Intelligent mining of large-scale bio-data: Bioinformatics applications. Biotechnology & Biotechnological Equipment, 32(1):10–29.

Hasib, K. M., Iqbal, M. S., Shah, F. M., Mahmud, J. A., Popel, M. H., Showrov, M. I. H., Ahmed, S., and Rahman, O. (2020). A survey of methods for managing the classification and solution of data imbalance problem. CoRR, abs/2012.11870.

Jiang, P., Sinha, S., Aldape, K., et al. (2022). Big data in basic and translational cancer research. Nature Reviews Cancer, 22:625–639.

Ke, G., Meng, Q., Finley, T., et al. (2017). Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, volume 30.

Kopp, F. and Mendell, J. T. (2018). Functional classification and experimental dissection of long noncoding rnas. Cell, 172(3):393–407.

Kreuzberger, D., Kühl, N., and Hirschl, S. (2023). Machine learning operations (mlops): Overview, definition, and architecture. IEEE Access, 11:31866–31879.

Li, A., Li, M. K., Crowther, M., and Vazquez, S. R. (2020). Drug-drug interactions with direct oral anticoagulants associated with adverse events in the real world: A systematic review. Thromb. Res., 194:240–245.

Liaw, A. and Wiener, M. (2002). Classification and regression by random forest. R News, 2(3):18–22.

Liu, B., Liu, F., Fang, L., et al. (2014). repdna: a python package to generate various modes of feature vectors for dna sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics, 31(8):1307–1309.

Lundberg, Scott M e Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc.

Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., and Lee, S.-I. (2020). From local explanations to global understanding with explainable ai for trees. Nature Machine Intelligence, 2(1):2522–5839.

Ma, Y., Zhang, H., Jin, C., and Kang, C. (2023). Predicting lncRNA-protein interactions with bipartite graph embedding and deep graph neural networks. Front. Genet., 14:1136672.

Mingyue, C., Le, C., and Kang, N. (2019). Microbiome big-data mining and applications using single-cell technologies and metagenomics approaches toward precision medicine. Frontiers in Genetics, 10.

Muhammod, R., Ahmed, S., Md Farid, D., Shatabda, S., Sharma, A., and Dehzangi, A. (2019). PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences. Bioinformatics, 35(19):3831–3833.

P, B. and M., G. (2021). Worldwide protein data bank (wwpdb): A virtual treasure for research in biotechnology. Eur J Microbiol Immunol (Bp), 11(4):77–86.

Patel, H., Rajput, D. S., Reddy, G. T., Iwendi, C., Bashir, A. K., and Jo, O. (2020). A review on classification of imbalanced data for wireless sensor networks. International Journal of Distributed Sensor Networks, 16(4):1550147720916404.

Peng, L., Tan, J., Tian, X., and Zhou, L. (2022). EnANNDeep: An ensemble-based lncRNA-protein interaction prediction framework with adaptive k-nearest neighbor classifier and deep models. Interdiscip. Sci., 14(1):209–232.

Prokhorenkova, L., Gusev, G., Vorobev, A., et al. (2018). Catboost: unbiased boosting with categorical features. pages 6638–6648.

Ribeiro, M., Singh, S., and Guestrin, C. (2016). “why should i trust you?”: Explaining the predictions of any classifier. pages 97–101.

Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1(5):206–215.

Statello, L., Guo, C.-J., Chen, L.-L., and Huarte, M. (2021). Gene regulation by long non-coding rnas and its biological functions. Nature reviews Molecular cell biology, 22(2):96–118.

Wang, L., Han, M., Li, X., Zhang, N., and Cheng, H. (2021). Review of classification methods on unbalanced data sets. IEEE Access, 9:64606–64628.

Waring, J., Lindvall, C., and Umeton, R. (2020). Automated machine learning: Review of the state-of-the-art and opportunities for healthcare. Artificial Intelligence in Medicine, 104:101822.

Xu, J., Xu, J., Liu, X., and et al. (2022). The role of lncrna-mediated cerna regulatory networks in pancreatic cancer. Cell Death Discovery, 8:287.

Yu, H., Shen, Z.-A., Zhou, Y.-K., and Du, P.-F. (2022). Recent advances in predicting protein-lncRNA interactions using machine learning methods. Curr. Gene Ther., 22(3):228–244.

Zhang, W., Wang, J., Li, B., Sun, B., Yu, S., Wang, X., and Zan, L. (2023). Long non-coding rna bnip3 inhibited the proliferation of bovine intramuscular preadipocytes via cell cycle. International Journal of Molecular Sciences, 24(4).

Zhang, W., Yue, X., Tang, G., Wu, W., Huang, F., and Zhang, X. (2018). SFPEL-LPI: Sequence-based feature projection ensemble learning for predicting LncRNA protein interactions. PLoS Comput. Biol., 14(12):e1006616.

Zhou, L., Wang, Z., Tian, X., et al. (2021). LPI-deepGBDT: a multiple-layer deep framework based on gradient boosting decision trees for lncrna–protein interaction identification. BMC Bioinformatics, 22:479.