Study on the Complexity of Omics Data: An Analysis for Cancer Survival Prediction

Carlos Daniel Andrade; Thomas Fontanari; Mariana Recamonde-Mendoza

Carlos Daniel Andrade Universidade Federal do Rio Grande do Sul (UFRGS) https://orcid.org/0000-0002-7760-9303
Thomas Fontanari Universidade Federal do Rio Grande do Sul (UFRGS) / Hospital de Clínicas de Porto Alegre (HCPA) http://orcid.org/0000-0002-9054-9093
Mariana Recamonde-Mendoza Universidade Federal do Rio Grande do Sul (UFRGS) / Hospital de Clínicas de Porto Alegre (HCPA) http://orcid.org/0000-0003-2800-1032

Resumo

The use of machine learning approaches in studying cancer through omics datasets has been an important research tool since the advent of high-throughput technologies. However, these datasets present an intrinsic data complexity that may hinder model development despite their information richness. This work, therefore, aims to study the characteristics of diﬀerent omics data commonly employed for clinical predictive analysis using a broad set of data complexity measures tailored for imbalanced domains. We focus on the task of cancer survival prediction in eight tumor types based on four types of omics data (i.e., copy number variation, gene expression, microRNA expression, and DNA methylation) and the combination among them (i.e., multi-omics approach). We found that F1-MaxDr, F3 partial, F4 partial, and N3 partial could be used as predictors of performance in this scenario. Furthermore, our experiments suggested that the studied omics data types are strongly correlated in terms of data complexity, including the multi-omics approach. All eight cancer types appeared to be highly correlated with each other, except for Adrenocortical Carcinoma (ACC), which showed a signiﬁcantly lower complexity than the others in the analyzed data.

Palavras-chave: Complexity measures, Omics data, Multi-omics, Cancer