Skip to main content

Feature Importance Analysis of Non-coding DNA/RNA Sequences Based on Machine Learning Approaches

  • Conference paper
  • First Online:
Advances in Bioinformatics and Computational Biology (BSB 2021)

Abstract

Non-coding sequences have been gained increasing space in scientific areas related to bioinformatics, due to essential roles played in different biological processes. Elucidating the function of these non-coding regions is a relevant challenge, which has been addressed by several Machine Learning (ML) studies in various fields of ncRNA, e.g., small non-coding RNAs (sRNAs) and Circular RNAs (circRNAs). The identification of these biological sequences is possible through feature engineering techniques, which can help point out specifics in different types of problems with ML. Thereby, there are recent studies focusing on interpretable computational methods, i.e., the best features based on feature importance analysis. For that reason, in this study we have proposed to explore different features descriptors and the degree of importance involved for classification task, using two case studies: (1) prediction of sRNAs in Bacteria and (2) prediction of circRNA in Humans. We developed a general pipeline using hybrid feature vectors with mathematical and conventional descriptors. In addition, these vectors were generated with MathFeature package and feature selection techniques in both case studies. Finally, our experiments results reported high predictive performance and the relevance of combining conventional and mathematical descriptors in different organisms.

B. L. S. de Almeida, A. P. Queiroz and A. P. A. Santos—The authors wish it to be known that, in their opinion, the first three authors should be regarded as Joint First Authors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631 (2019)

    Google Scholar 

  2. Amin, N., McGrath, A., Chen, Y.P.P.: Fexrna: exploratory data analysis and feature selection of non-coding rna. IEEE/ACM Trans. Comput. Biol. Bioinform. 1 (2021). https://doi.org/10.1109/TCBB.2021.3057128

  3. Anastasiadou, E., Jacob, L.S., Slack, F.J.: Non-coding RNA networks in cancer. Nat. Rev. Canc. 18(1), 5–18 (2018)

    Article  CAS  Google Scholar 

  4. Arnedo, J., Romero-Zaliz, R., Zwir, I., Del Val, C.: A multiobjective method for robust identification of bacterial small non-coding RNAs. Bioinformatics 30(20), 2875–2882 (2014)

    Article  CAS  Google Scholar 

  5. Barman, R.K., Mukhopadhyay, A., Das, S.: An improved method for identification of small non-coding RNAs in bacteria using support vector machine. Sci. Rep. 7(1), 1–8 (2017)

    Article  CAS  Google Scholar 

  6. Becht, E., et al.: Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotech. 37(1), 38–44 (2019)

    Article  CAS  Google Scholar 

  7. Bommert, A., Sun, X., Bischl, B., Rahnenführer, J., Lang, M.: Benchmark for filter methods for feature selection in high-dimensional classification data. Comput. Stat. Data Anal. 143, 106839 (2020). https://doi.org/10.1016/j.csda.2019.106839

  8. Bonidia, R.P., et al.: A novel decomposing model with evolutionary algorithms for feature selection in long non-coding RNAs. IEEE Access 8, 181683–181697 (2020). https://doi.org/10.1109/ACCESS.2020.3028039

    Article  Google Scholar 

  9. Bonidia, R.P., et al.: Feature extraction approaches for biological sequences: a comparative study of mathematical features. Briefings Bioinform. 22(5), bbab011 (2021). https://doi.org/10.1093/bib/bbab011

  10. Bonidia, R.P., Sanches, D.S., de Carvalho, A.C.: Mathfeature: feature extraction package for biological sequences based on mathematical descriptors. bioRxiv (2020)

    Google Scholar 

  11. Carvalho, D.V., Pereira, E.M., Cardoso, J.S.: Machine learning interpretability: a survey on methods and metrics. Electronics 8(8), 832 (2019)

    Article  Google Scholar 

  12. Chantsalnyam, T., Siraj, A., Tayara, H., Chong, K.T.: ncRDense: a novel computational approach for classification of non-coding RNA family by deep learning. Genomics 113(5), 3030–3038 (2021). https://doi.org/10.1016/j.ygeno.2021.07.004

    Article  CAS  PubMed  Google Scholar 

  13. Chen, L., et al.: Discriminating cirRNAs from other lncRNAs using a hierarchical extreme learning machine (H-ELM) algorithm with feature selection. Mol. Gen. Genomics 293(1), 137–149 (2018)

    Article  CAS  Google Scholar 

  14. Chen, L., et al.: The bioinformatics toolbox for circRNA discovery and analysis. Briefings Bioinform. 22(2), 1706–1728 (2020). https://doi.org/10.1093/bib/bbaa001

  15. Chen, T., Guestrin, C.: XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. KDD 2016, ACM, New York, NY, USA (2016). https://doi.org/10.1145/2939672.2939785

  16. Ekundayo, I.: OPTUNA Optimization Based CNN-LSTM Model for Predicting Electric Power Consumption. Ph.D. thesis, Dublin, National College of Ireland (2020)

    Google Scholar 

  17. Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (2015)

    Google Scholar 

  18. Griffiths-Jones, S., Moxon, S., Marshall, M., Khanna, A., Eddy, S.R., Bateman, A.: Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 33(suppl\(\_\)1), D121–D124 (2005)

    Google Scholar 

  19. Lin, L., Wang, D., Zhao, S., Chen, L., Huang, N.: Power quality disturbance feature selection and pattern recognition based on image enhancement techniques. IEEE Access 7, 67889–67904 (2019). https://doi.org/10.1109/ACCESS.2019.2917886

    Article  Google Scholar 

  20. McInnes, L., Healy, J., Melville, J.: Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018)

  21. Niu, M., et al.: CirRNAPL: a web server for the identification of circRNA based on extreme learning machine. Comput. Struct. Biotechn. J. 18, 834–842 (2020)

    Article  CAS  Google Scholar 

  22. Noviello, T.M.R., Ceccarelli, F., Ceccarelli, M., Cerulo, L.: Deep learning predicts short non-coding RNA functions from only raw sequence data. PLoS Computat. Biol. 16(11), e1008415 (2020)

    Google Scholar 

  23. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    Google Scholar 

  24. Pisignano, G., Ladomery, M.: Post-transcriptional regulation through long non-coding RNAs (lncRNAs). Non-Coding RNA 7(2) (2021). https://doi.org/10.3390/ncrna7020029

  25. Rice, P., Longden, I., Bleasby, A.: Emboss: the European molecular biology open software suite. Trends Genet. 16(6), 276–277 (2000)

    Article  CAS  Google Scholar 

  26. Rong, D., et al.: Epigenetics: roles and therapeutic implications of non-coding RNA modifications in human cancers. Mol. Ther.-Nucleic Acids (2021)

    Google Scholar 

  27. Ross, B.C.: Mutual information between discrete and continuous data sets. PloS One 9(2), e87357 (2014)

    Google Scholar 

  28. Strobel, E.J., Watters, K.E., Loughrey, D., Lucks, J.B.: Rna systems biology: uniting functional discoveries and structural tools to understand global roles of RNAs. Curr. Opin. Biotechnol. 39, 182–191 (2016). https://doi.org/10.1016/j.copbio.2016.03.019, systems biology \(\bullet \) Nanobiotechnology

  29. Tang, G., Shi, J., Wu, W., Yue, X., Zhang, W.: Sequence-based bacterial small RNAs prediction using ensemble learning strategies. BMC Bioinf. 19(20), 13–23 (2018)

    Google Scholar 

  30. Van Der Maaten, L., Postma, E., Van den Herik, J., et al.: Dimensionality reduction: a comparative. J. Mach. Learn Res. 10(66–71), 13 (2009)

    Google Scholar 

  31. Vitsios, D., Dhindsa, R.S., Middleton, L., Gussow, A.B., Petrovski, S.: Prioritizing non-coding regions based on human genomic constraint and sequence context with deep learning. Nat. Commun. 12(1), 1–14 (2021)

    Article  Google Scholar 

  32. Wei, G., Zhao, J., Feng, Y., He, A., Yu, J.: A novel hybrid feature selection method based on dynamic feature importance. Appl. Soft Comput. 93, 106337 (2020). https://doi.org/10.1016/j.asoc.2020.106337

  33. Yamada, M., et al.: Ultra high-dimensional nonlinear feature selection for big biological data. IEEE Trans. Knowl. Data Eng. 30(7), 1352–1365 (2018)

    Article  Google Scholar 

  34. Zhong, L., Zhen, M., Sun, J., Zhao, Q.: Recent advances on the machine learning methods in predicting ncRNA-protein interactions. Mol. Genet. Genomics 296(2), 243–258 (2021)

    Article  CAS  Google Scholar 

  35. Zhou, S., Li, X.: Feature engineering vs. deep learning for paper section identification: toward applications in Chinese medical literature. Inf. Process. Manag. 57(3), 102206 (2020)

    Google Scholar 

Download references

Acknowledgments

The authors would like to thank ICMC-USP, UTFPR, Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) and São Paulo Research Foundation (FAPESP), grant #2021/08561-8, for the financial support given to this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anderson Paulo Avila Santos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

de Almeida, B.L.S. et al. (2021). Feature Importance Analysis of Non-coding DNA/RNA Sequences Based on Machine Learning Approaches. In: Stadler, P.F., Walter, M.E.M.T., Hernandez-Rosales, M., Brigido, M.M. (eds) Advances in Bioinformatics and Computational Biology. BSB 2021. Lecture Notes in Computer Science(), vol 13063. Springer, Cham. https://doi.org/10.1007/978-3-030-91814-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-91814-9_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-91813-2

  • Online ISBN: 978-3-030-91814-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics