Abstract
The use of machine learning approaches in studying cancer through omics datasets has been an important research tool since the advent of high-throughput technologies. However, these datasets present an intrinsic data complexity that may hinder model development despite their information richness. This work, therefore, aims to study the characteristics of different omics data commonly employed for clinical predictive analysis using a broad set of data complexity measures tailored for imbalanced domains. We focus on the task of cancer survival prediction in eight tumor types based on four types of omics data (i.e., copy number variation, gene expression, microRNA expression, and DNA methylation) and the combination among them (i.e., multi-omics approach). We found that F1-MaxDr, F3_partial, F4_partial, and N3_partial could be used as predictors of performance in this scenario. Furthermore, our experiments suggested that the studied omics data types are strongly correlated in terms of data complexity, including the multi-omics approach. All eight cancer types appeared to be highly correlated with each other, except for Adrenocortical Carcinoma (ACC), which showed a significantly lower complexity than the others in the analyzed data.
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001, and by grants from the Fundação de Amparo á Pesquisa do Estado do Rio Grande do Sul (FAPERGS) [21/2551-0002052-0] and Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) [308075/2021-8].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
The raw results of our experiments can be found in the project Github repository: https://github.com/carlosdanielandrade/complexity-of-omics-data-in-cancer.
References
Barella, V.H., Garcia, L.P., de Souto, M.C., Lorena, A.C., de Carvalho, A.C.: Assessing the data complexity of imbalanced datasets. Inf. Sci. 553, 83–109 (2021)
Barella, V.H., Garcia, L.P., de Souto, M.P., Lorena, A.C., de Carvalho, A.: Data complexity measures for imbalanced classification tasks. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2018)
Bolón-Canedo, V., Moran-Fernandez, L., Alonso-Betanzos, A.: An insight on complexity measures and classification in microarray data. In: 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2015)
Duan, R., et al.: Evaluation and comparison of multi-omics data integration methods for cancer subtyping. PLOS Comput. Biol. 17(8), 1–33 (2021)
Ho, T.K., Basu, M.: Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 289–300 (2002)
Li, J., et al.: Predicting breast cancer 5-year survival using machine learning: a systematic review. PLOS ONE 16(4), 1–23 (2021)
Lorena, A.C., Costa, I.G., Spolaôr, N., De Souto, M.C.: Analysis of complexity indices for classification problems: cancer gene expression data. Neurocomputing 75(1), 33–42 (2012)
Lorena, A.C., Garcia, L.P., Lehmann, J., Souto, M.C., Ho, T.K.: How complex is your classification problem? a survey on measuring classification complexity. ACM Comput. Surv. 52(5), 1–34 (2019)
Lorena, A.C., Spolaor, N., Costa, I.G., Souto, M.C.P.: On the complexity of gene marker selection. In: 2010 Eleventh Brazilian Symposium on Neural Networks, pp. 85–90 (2010)
Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.: Can classification performance be predicted by complexity measures? a study using microarray data. Knowl. Inf. Syst. 51(3), 1067–1090 (2017)
Okun, O., Priisalu, H.: Dataset complexity in gene expression based cancer classification using ensembles of k-nearest neighbors. Artif. Intell. Med. 45(2–3), 151–162 (2009)
Olivier, M., Asmis, R., Hawkins, G.A., Howard, T.D., Cox, L.A.: The need for multi-omics biomarker signatures in precision medicine. Int. J. Molec. Sci. 20(19), 4781 (2019)
Sánchez, J.S., García, V.: Addressing the links between dimensionality and data characteristics in gene-expression microarrays. In: Proceedings of the International Conference on Learning and Optimization Algorithms: Theory and Applications, pp. 1–6 (2018)
de Souto, M.C.P., Lorena, A.C., Spolaôr, N., Costa, I.G.: Complexity measures of supervised classifications tasks: a case study for cancer gene expression data. In: The 2010 International Joint Conference on Neural Networks (IJCNN), pp. 1–7 (2010)
Zhao, D., et al.: Pan-cancer survival classification with clinicopathological and targeted gene expression features. Cancer Inf. 20, 11769351211035137 (2021). pMID: 34376966
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Andrade, C.D., Fontanari, T., Recamonde-Mendoza, M. (2022). Study on the Complexity of Omics Data: An Analysis for Cancer Survival Prediction. In: Scherer, N.M., de Melo-Minardi, R.C. (eds) Advances in Bioinformatics and Computational Biology. BSB 2022. Lecture Notes in Computer Science(), vol 13523. Springer, Cham. https://doi.org/10.1007/978-3-031-21175-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-21175-1_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21174-4
Online ISBN: 978-3-031-21175-1
eBook Packages: Computer ScienceComputer Science (R0)