Classification of breast cancer subtypes: A study based on representative genes




Breast Cancer, Gene Expression, Subtypes Classification


Breast cancer is the second most common cancer type and is the leading cause of cancer-related deaths worldwide. Since it is a heterogeneous disease, subtyping breast cancer plays an important role in performing a specific treatment. In this work, we propose an evaluation framework that uses different machine learning techniques for a broader analysis of the PAM50 list in the classification of breast cancer subtypes. The experiments show that the best method to be used in the classification of breast cancer subtypes is the SVM with linear kernel, which presented an F1 score of 0.98 for the Basal subtype and 0.90 for the Her 2 subtype, the two subtypes with worse prognosis, respectively. We also presented a gene analysis for the classification methods using SHAP values, where we found which genes are important for the classification of each subtype.


Download data is not yet available.


Alanni, R., Hou, J., Azzawi, H., and Xiang, Y. (2019). Deep gene selection method to select genes from microarray datasets for cancer classification. BMC bioinformatics, 20(608):1-15.

Badve, S., Turbin, D., Thorat, M. A., Morimiya, A., Nielsen, T. O., Perou, C. M., Dunn, S., Huntsman, D. G., and Nakshatri, H. (2007). Foxa1 expression in breast cancer—correlation with luminal subtype a and survival. Clinical cancer research, 13(15):4415-4421.

Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A., and Nielsen, H. (2000). Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16(5):412-424.

Bergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. The Journal of Machine Learning Research, 13(1):281-305.

Bi, Y., Xiang, D., Ge, Z., Li, F., Jia, C., and Song, J. (2020). An interpretable prediction model for identifying n7-methylguanosine sites based on xgboost and shap. Molecular Therapy-Nucleic Acids, 22:362-372.

Bray, F., Ferlay, J., Soerjomataram, I., L. Siegel, R., Torre, L., and Jemal, A. (2018). Global cancer statistics 2018. CA: A Cancer Journal for Clinicians, 68:394-424. DOI: 10.3322/caac.21492.

Chen, X., Hu, H., He, L., Yu, X., Liu, X., Zhong, R., and Shu, M. (2016). A novel subtype classification and risk of breast cancer by histone modification profiling. Breast cancer research and treatment, 157(2):267-279.

Chia, S. K., Bramwell, V. H., Tu, D., et al. (2012). A 50-gene intrinsic subtype classifier for prognosis and prediction of benefit from adjuvant tamoxifen. Clinical cancer research, 18(16):4465-4472.

Chicco, D. (2017). Ten quick tips for machine learning in computational biology. BioData mining, 10(1):1-17.

Dai, X., Li, T., Bai, Z., Yang, Y., Liu, X., Zhan, J., and Shi, B. (2015). Breast cancer intrinsic subtype classification, clinical use and future trends. American journal of cancer research, 5(10):2929.

Díaz-Uriarte, R. and De Andres, S. (2006). Gene selection and classification of microarray data using random forest. BMC bioinformatics, 7(3):13.

Dwivedi, S., Purohit, P., Misra, R., Lingeswaran, M., et al. (2019). Application of single-cell omics in breast cancer. In Single-Cell Omics, volume 2, pages 69-103. Elsevier.

Edwards, N. J., Oberti, M., Thangudu, R. R., Cai, S., McGarvey, P. B., Jacob, S., Madhavan, S., and Ketchum, K. A. (2015). The cptac data portal: a resource for cancer proteomics research. Journal of proteome research, 14(6):2707-2713.

Gatto, B. B., Santos, E. M. d., Koerich, A. L., Fukui, K., and Junior, W. S. (2021). Tensor analysis with n-mode generalized difference subspace. Expert Systems with Applications, 171:1-11.

Graudenzi, A., Cava, C., Bertoli, G., Fromm, B., et al. (2017). Pathway-based classification of breast cancer subtypes. Front Biosci, 22(10):1697-1712.

Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine learning, 46(1):389-422.

He, J., Yang, J., Chen, W., Wu, H., Yuan, Z., Wang, K., Li, G., Sun, J., and Yu, L. (2015). Molecular features of triple negative breast cancer: microarray evidence and further integrated analysis. PloS one, 10(6):e0129842.

Jiang, D., Tang, C., and Zhang, A. (2004). Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge & Data Engineering, 16(11):1370-1386.

Johnson, W. E., Li, C., and Rabinovic, A. (2007). Adjusting batch effects in microarray expression data using empirical bayes methods. Biostatistics, 8(1):118-127.

Kurozumi, S., Matsumoto, H., Hayashi, Y., Tozuka, K., Inoue, K., Horiguchi, J., Takeyoshi, I., Oyama, T., and Kurosumi, M. (2017). Power of pgr expression as a prognostic factor for er-positive/her2-negative breast cancer patients at intermediate risk classified by the ki67 labeling index. BMC cancer, 17(1):1-9.

Lee, S., Lim, S., Lee, T., Sung, I., and Kim, S. (2020). Cancer subtype classification and modeling by pathway attention and propagation. Bioinformatics, 36(12):3818-3824.

Li, Y., Kang, K., Krahn, J. M., Croutwater, N., et al. (2017). A comprehensive genomic pan-cancer classification using the cancer genome atlas gene expression data. BMC genomics, 18(1):508.

Lockhart, D. J., Dong, H., Byrne, M. C., Follettie, M. T., Gallo, M. V., Chee, M. S., Mittmann, M., Wang, C., Kobayashi, M., Norton, H., et al. (1996). Expression monitoring by hybridization to high-density oligonucleotide arrays. Nature biotechnology, 14(13):1675-1680.

Lundberg, S. M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Proceedings of the 31st international conference on neural information processing systems, pages 4768-4777.

Lyu, B. and Haque, A. (2018). Deep learning based tumor type classification using gene expression data. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pages 89-96. ACM.

Mendoncaneto, R., Fenyo, D., Li, Z., Nakamura, E. F., Nakamura, F. G., and Silva, C. T. (2021). A gene selection method based on outliers for breast cancer subtype classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics.

Messalas, A., Kanellopoulos, Y., and Makris, C. (2019). Model-agnostic interpretability with shapley values. In 2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA), pages 1-7. IEEE.

Mostavi, M., Chiu, Y.-C., et al. (2020). Convolutional neural network models for cancer type prediction based on gene expression. BMC Medical Genomics, 13(44):1-13.

Nguyen, D. V. and Rocke, D. M. (2002). Multi-class cancer classification via partial least squares with gene expression profiles. Bioinformatics, 18(9):1216-1226.

Parikh, R., Mathai, A., Parikh, S., Sekhar, G. C., and Thomas, R. (2008). Understanding and using sensitivity, specificity and predictive values. Indian journal of ophthalmology, 56(1):45-50.

Parker, J. S., Mullins, M., Cheang, M. C., et al. (2009). Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of Clinical Oncology, 27(8):1160-1167.

Schena, M., Shalon, D., Davis, R. W., and Brown, P. O. (1995). Quantitative monitoring of gene expression patterns with a complementary dna microarray. Science, 270(5235):467-470.

Shukla, A. K., Singh, P., and Vardhan, M. (2018). A hybrid gene selection method for microarray recognition. Biocybernetics and Biomedical Engineering, 38(4):975-991.

Tarek, S., Elwahab, R. A., and Shoman, M. (2017). Gene expression based cancer classification. Egyptian Informatics Journal, 18(3):151-159.

Turner, N. C., Swift, C., Kilburn, L., Fribbens, C., Beaney, M., Garcia-Murillas, I., Budzar, A. U., Robertson, J. F., Gradishar, W., Piccart, M., et al. (2020). Esr1 mutations and overall survival on fulvestrant versus exemestane in advanced hormone receptor-positive breast cancer: A combined analysis of the phase iii sofea and efect trials. Clinical Cancer Research, 26(19):5172-5177.

Yip, W.-K., Amin, S. B., and Li, C. (2011). A survey of classification techniques for microarray data analysis. In Handbook of Statistical Bioinformatics, pages 193-223. Springer.




How to Cite

Mendonca-Neto, R., Reis, J., Okimoto, L., Fenyö, D., Silva, C., Nakamura, F., & Nakamura, E. (2022). Classification of breast cancer subtypes: A study based on representative genes. Journal of the Brazilian Computer Society, 28(1), 59–68.