Um método de Estimação de Expressões Gênicas de Câncer de Mama com Base em Correlação
Abstract
Gene expression data often suffer from lost value problems for various experimental reasons. In breast cancer databases, subsequent analysis and subtyping can suffer heavily from missing data, so addressing this issue is paramount. Several approaches for estimating these values in gene expression data have been developed. Still, the task is difficult due to factors such as the existence or not of a correlation structure in the data and the high dimensionality (number of genes x number of samples) of the data. In this research, we developed a method to treat missing values in breast cancer gene expressions, which deals with the high dimensionality of the data, performing the selection of genes that best characterize breast cancer based on the use of correlation information between genes. The method was evaluated using the RMSE and MAE metrics.
References
D’haeseleer, P. (2005). How does gene expression clustering work? Nature biotechnology, 23(12):1499.
Drucker, H., Burges, C. J., Kaufman, L., Smola, A., and Vapnik, V. (1996). Support vector regression machines. Advances in neural information processing systems, 9.
Dunham, I., Kundaje, A., and Bernstein, B. E. (2012). An integrated encyclopedia of dna elements in the human genome. Nature, 489(7414):57–74.
Edwards, N. J., Oberti, M., Thangudu, R. R., Cai, S., McGarvey, P. B., Jacob, S., Madhavan, S., and Ketchum, K. A. (2015). The cptac data portal: a resource for cancer proteomics research. Journal of proteome research, 14(6):2707–2713.
Hood, L. and Rowen, L. (2013). The human genome project: big science transforms biology and medicine. Genome medicine, 5:1–8.
Huang, M., Wang, J., Torre, E., Dueck, H., Shaffer, S., Bonasio, R., Murray, J. I., Raj, A., Li, M., and Zhang, N. R. (2018). Saver: gene expression recovery for single-cell rna sequencing. Nature methods, 15(7):539–542.
INCA (2021). Instituto nacional do câncer estatísticas.
Kang, H. (2013). The prevention and handling of the missing data. Korean journal of anesthesiology, 64(5):402–406.
König, I. R., Auerbach, J., Gola, D., Held, E., Holzinger, E. R., Legault, M.-A., Sun, R., Tintle, N., and Yang, H.-C. (2016). Machine learning and data mining in complex genomic data—a review on the lessons learned in genetic analysis workshop 19. BMC genetics, 17(2):49–56.
Mendonca-Neto, R., Li, Z., Fenyö, D., Silva, C. T., Nakamura, F. G., and Nakamura, E. F. (2021). A gene selection method based on outliers for breast cancer subtype classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 19(5):2547–2559.
Mendonca-Neto, R., Reis, J., Okimoto, L., Fenyö, D., Silva, C., Nakamura, F., and Nakamura, E. (2022). Classification of breast cancer subtypes: A study based on representative genes. Journal of the Brazilian Computer Society, 28(1):59–68.
Mertins, P., Mani, D., Ruggles, K. V., Gillette, M. A., Clauser, K. R., Wang, P., Wang, X., Qiao, J. W., Cao, S., Petralia, F., et al. (2016). Proteogenomics connects somatic mutations to signalling in breast cancer. Nature, 534(7605):55–62.
Sefidian, A. M. and Daneshpour, N. (2020). Estimating missing data using novel correlation maximization based methods. Applied Soft Computing, 91.
Tan, A. C. and Gilbert, D. (2003). Ensemble machine learning on gene expression data for cancer classification.
Volgin, D. V. (2014). Gene expression: analysis and quantitation. In Animal Biotechnology, pages 307–325. Elsevier.
Xie, H., Li, J., Zhang, Q., and Wang, Y. (2016). Comparison among dimensionality reduction techniques based on random projection for cancer classification. Computational biology and chemistry, 65:165–172.
