BioAutoML: Democratizing Machine Learning in Life Sciences

  • Robson Parmezan Bonidia USP / UTFPR
  • André Carlos Ponce de Leon Ferreira de Carvalho USP

Resumo


Recent technological advances have allowed an exponential expansion of biological sequence data, and the extraction of meaningful information through Machine Learning (ML) algorithms. This knowledge improved the understanding of the mechanisms related to several fatal diseases, e.g., Cancer and COVID-19, helping to develop innovative solutions, such as CRISPR-based gene editing, coronavirus vaccine, and precision medicine. These advances benefit our society and economy, directly impacting people’s lives in various areas, such as health care, drug discovery, forensic analysis, and food analysis. Nevertheless, ML approaches to biological data require representative, quantitative, and informative features. Necessarily, as many ML algorithms can handle only numerical data, sequences need to be translated into a feature vector. This process, known as feature extraction, is a fundamental step for elaborating high-quality ML-based models in bioinformatics, by allowing the feature engineering stage, with the design and selection of suitable features. Feature engineering, ML algorithm selection, and hyperparameter tuning are often time-consuming processes that require extensive domain knowledge and are performed by a human expert. To deal with this problem, we developed a new package, BioAutoML, which automatically runs an end-to-end ML pipeline. BioAutoML extracts numerical and informative features from biological sequence databases, automating feature selection, recommendation of ML algorithm(s), and tuning of hyperparameters, using Automated ML (AutoML). Our experimental results demonstrate the robustness of our proposal across various domains, such as SARS-CoV-2, anticancer peptides, HIV sequences, and non-coding RNAs. BioAutoML has a high potential to significantly reduce the expertise required to use ML pipelines, aiding researchers in combating diseases, particularly in low- and middle-income countries. This initiative can provide biologists, physicians, epidemiologists, and other stakeholders with an opportunity for widespread use of these techniques to enhance the health and well-being of their communities.

Referências

Alkhnbashi, O. S., Mitrofanov, A., Bonidia, R., et al. (2021). CRISPRloci: comprehensive and accurate annotation of CRISPR–Cas systems. Nucleic Acids Research, 49(W1):W125–W130.

Bonidia, R. P., Avila Santos, A. P., de Almeida, B. L., Stadler, P. F., Nunes da Rocha, U., Sanches, D. S., and De Carvalho, A. C. (2022a). Information theory for biological sequence classification: A novel feature extraction technique based on tsallis entropy. Entropy, 24(10):1398.

Bonidia, R. P., Domingues, D. S., Sanches, D. S., and de Carvalho, A. C. (2022b). Mathfeature: feature extraction package for dna, rna and protein sequences based on mathematical descriptors. Briefings in Bioinformatics, 23(1):bbab434.

Bonidia, R. P., Machida, J. S., Negri, T. C., Alves, W. A. L., Kashiwabara, A. Y., Domingues, D. S., De Carvalho, A., Paschoal, A. R., and Sanches, D. S. (2020). A novel decomposing model with evolutionary algorithms for feature selection in long non-coding rnas. IEEE Access, 8:181683–181697.

Bonidia, R. P., Sampaio, L. D. H., Domingues, D. S., Paschoal, A. R., Lopes, F. M., de Carvalho, A. C. P. L. F., and Sanches, D. S. (2021). Feature extraction approaches for biological sequences: a comparative study of mathematical features. Briefings in Bioinformatics. bbab011.

Bonidia, R. P., Santos, A. P. A., de Almeida, B. L. S., Stadler, P. F., da Rocha, U. N., Sanches, D. S., and de Carvalho, A. C. P. L. F. (2022c). BioAutoML: automated feature engineering and metalearning to predict noncoding RNAs in bacteria. Briefings in Bioinformatics, 23(4):bbac218.

Chen, Z., Zhao, P., Li, C., Li, F., Xiang, D., Chen, Y.-Z., Akutsu, T., Daly, R., Webb, G., Zhao, Q., Kurgan, L., and Song, J. (2021). iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Research. gkab122.

Chen, Z., Zhao, P., Li, F., Marquez-Lago, T. T., Leier, A., Revote, J., Zhu, Y., Powell, D. R., Akutsu, T., Webb, G. I., Chou, K.-C., Smith, A. I., Daly, R. J., Li, J., and Song, J. (2019). iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Briefings in Bioinformatics, 21(3):1047–1057.

Jing, R., Li, Y., Xue, L., Liu, F., Li, M., and Luo, J. (2020). autobioseqpy: a deep learning tool for the classification of biological sequences. Journal of Chemical Information and Modeling, 60(8):3755–3764.

Kamalov, F., Cherukuri, A. K., Sulieman, H., Thabtah, F., and Hossain, A. (2023). Machine learning applications for covid-19: a state-of-the-art review. Data Science for Genomics, pages 277–289.

Liu, D., Xu, C., He, W., Xu, Z., Fu, W., Zhang, L., Yang, J., Wang, Z., Liu, B., Peng, G., et al. (2021). Autogenome: an automl tool for genomic research. Artificial Intelligence in the Life Sciences, 1:100017.

Mitrofanov, A., Alkhnbashi, O. S., Shmakov, S. A., Makarova, K., Koonin, E., and Backofen, R. (2020). CRISPRidentify: identification of CRISPR arrays using machine learning approach. Nucleic Acids Research, 49(4):e20–e20.

Painuli, D., Bhardwaj, S., et al. (2022). Recent advancement in cancer diagnosis using machine learning and deep learning techniques: A comprehensive review. Computers in Biology and Medicine, 146:105580.

Rubeis, G., Dubbala, K., and Metzler, I. (2022). “democratizing” artificial intelligence in medicine and healthcare: Mapping the uses of an elusive term. Frontiers in Genetics, 13:902542.

Sharma, M. et al. (2021). Emerging trends of bioinformatics in health informatics. In Computational Intelligence in Healthcare, pages 343–367. Springer.

Volkamer, A., Riniker, S., Nittinger, E., Lanini, J., Grisoni, F., Evertsson, E., Rodŕıguez-Pérez, R., and Schneider, N. (2023). Machine learning for small molecule drug discovery in academia and industry. Artificial Intelligence in the Life Sciences, 3:100056.
Publicado
25/06/2024
BONIDIA, Robson Parmezan; CARVALHO, André Carlos Ponce de Leon Ferreira de. BioAutoML: Democratizing Machine Learning in Life Sciences. In: PRÊMIO ARTUR ZIVIANI - CONCURSO DE TESES E DISSERTAÇÕES (DOUTORADO) - SIMPÓSIO BRASILEIRO DE COMPUTAÇÃO APLICADA À SAÚDE (SBCAS), 24. , 2024, Goiânia/GO. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 85-90. ISSN 2763-8987. DOI: https://doi.org/10.5753/sbcas_estendido.2024.2184.