Matched-Pair Analysis Using Machine Learning to Predict 1-year Mortality in Newborn Twins

  • Everton Jesus UFBA
  • Lucas Calais-Ferreira UMelbourne
  • Marcos Barreto UFBA

Abstract


Twin pair analysis is a valuable tool for assessing familial risk factors related to several outcomes, including diseases. Machine learning models are standard, powerful tools for prediction, although their use for twin pair analysis is not fully suitable as most models do not account for the existing correlation between twin pairs. In this paper, we have focused on assessing the suitability of machine learning models to predict 1-year mortality using twin data extracted from Brazilian healthcare databases. We have evaluated five models and also used a proposed strategy for matched pair analysis to build an alternative dataset supposed to provide improvements for classification tasks. Our results showed that i) Gradient Boosting was the best classification model, and ii) the matched-pair strategy used did not improve our results as expected.

References

Adler, W., Brenning, A., Potapov, S., Schmid, M., and Lausen, B. (2011). Ensemble classification of paired data. Comput. Stat. Data Anal., 55(5):1933–1941.

Bogl, L. H., Jelenkovic, A., Vuoksimaa, E., et al. (2017). Does the sex of one’s co-twin affect height and BMI in adulthood?: A study of dizygotic adult twins from 31 cohorts. Biology of Sex Differences, 8(1):14.

Carlin, J. B., Gurrin, L. C., Sterne, J. A. C., Morley, R., and Dwyer, T. (2005). Regression models for twin studies: A critical review. International Journal of Epidemiology, 34(5).

Cherny, S. S., DeFries, J. C., and Fulker, D. W. (1992). Multiple regression analysis of twin data: a model-fitting approach. Behavior Genetics, 22(4):489–497.

Christodoulou, E., Ma, J., Collins, G. S., Steyerberg, E. W., Verbakel, J. Y., and Calster, B. V. (2019). A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of Clinical Epidemiology, 110:12 – 22.

Costa, H. d. L. F. F., Rocha, A. C. O., Galvao, A. F., Souza, J. d. A., Rigaard, A. C. d. O., and Costa, L. O. B. F. (1998). E pior o prognóstico do segundo gemelar? Revista Brasileira de Ginecologia e Obstetrícia, 20(5).

Denaxas, S., Stenetorp, P., Riedel, S., Pikoula, M., Dobson, R., and Hemingway, H. (2018). Application of clinical concept embeddings for heart failure prediction in uk ehr data.

Fava, J. L., Souza, E., and Camano, L. (2001). Intervalo entre o nascimento de gemeos: Morbidade e mortalidade do segundo gemelar. Revista Brasileira de Ginecologia e Obstetrícia, 23(7).

Game, P. S., Vaze, V., and Emmanuel, M. (2019). Optimized Decision tree rules using divergence based grey wolf optimization for big data classification in health care. Evolutionary Intelligence.

Gomes, A. S., Kluck, M. M., Riboldi, J., and Fachel, J. M. G. (2010). Modelo preditivo¨ de obito a partir de dados do Sistema de Informacoes Hospitalares. Revista de Saude Publica , 44:934 – 941.

Hatton, C. M., Paton, L. W., McMillan, D., Cussens, J., Gilbody, S., and Tiffin, P. A. (2019). Predicting persistent depressive symptoms in older adults: A machine learning approach to personalised mental healthcare. Journal of Affective Disorders, 246:857– 860.

Imani, F., Chen, R., Tucker, C., and Yang, H. (2019). Random forest modeling for survival analysis of cancer recurrences. In 2019 IEEE 15th International Conference on Automation Science and Engineering (CASE), pages 399–404.

Kaur, P., Kumar, R., and Kumar, M. (2019). A healthcare monitoring system using random forest and internet of things (IoT). Multimedia Tools and Applications, 78(14):19905–19916.

Liang, S., Ma, A., Yang, S., Wang, Y., and Ma, Q. (2018). A review of matched-pairs feature selection methods for gene expression data analysis. Computational and Structural Biotechnology Journal, 16:88 – 97.

Libbrecht, M. W. and Noble, W. S. (2015). Machine learning applications in genetics and genomics. Nature Reviews Genetics, 16(6):321–332.

Liu, L., Oza, S., Hogan, D., Chu, Y., Perin, J., Zhu, J., Lawn, J., Cousens, S., and Black, R. (2016). Global, regional, and national causes of under-5 mortality in 2000–15: an updated systematic analysis with implications for the sustainable development goals. The Lancet, 388.

Naraei, P., Abhari, A., and Sadeghian, A. (2016). Application of multilayer perceptron neural networks and support vector machines in classification of healthcare data. In 2016 Future Technologies Conference (FTC), pages 848–852.

Podgorelec, V., Kokol, P., Stiglic, B., and Rozman, I. (2002). Decision trees: An overview and their use in medicine. Journal of medical systems, 26:445–63.

Rao, D. C., Vogler, G. P., M., M., and Russell, J. M. (1987). Maximum-likelihood estimation of familial correlations from multivariate quantitative data on pedigrees: A general method and examples. American Journal of Human Genetics, 41:1104–1116.

Razzaghi, T., Roderick, O., Safro, I., and Marko, N. (2016). Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values. PLOS ONE, 11(5):e0155119.

Rittenhouse, K. J., Vwalika, B., Keil, A., Winston, J., Stoner, M., Price, J. T., Kapasa, M., Mubambe, M., Banda, V., Muunga, W., and Stringer, J. S. A. (2019). Improving preterm newborn identification in low-resource settings with machine learning. PLOS ONE, 14(2):1–12.

Shickel, B., Tighe, P. J., Bihorac, A., and Rashidi, P. (2018). Deep ehr: A survey of recent advances in deep learning techniques for electronic health record (ehr) analysis. IEEE Journal of Biomedical and Health Informatics, 22(5):1589–1604.

Silva, C., Alves, J., Braga, O., Junior, J., Andrade, L., and Oliveira, A. (2017). Usandoó classificador naive bayes para geração de alertas de risco de Obito infantil. Revista Eletronica de Sistemas de Informaç ao , 16.

Son, Y.-J., Kim, H.-G., Kim, E.-H., Choi, S., and Lee, S.-K. (2010). Application of support vector machine for prediction of medication adherence in heart failure patients. Healthcare informatics research, 16:253–9.

Sa, R. A. M., Silva, N. R., and Rezende, K. R. F. (2008). Gestacao gemelar: problemas em dobro? Femina, 36(12).

Theiler, J. (2013). Matched-pair machine learning. Technometrics, 55.

van Dongen, J., Slagboom, P. E., Draisma, H. H. M., Martin, N. G., and Boomsma, D. I. (2012). The continuing value of twin studies in the omics era. Nature Reviews Genetics, 13(9):640–653.

Zafar, F., Raza, S., Khalid, M. U., and Tahir, M. A. (2019). Predictive analytics in healthcare for diabetes prediction. In Proceedings of the 2019 9th International Conference on Biomedical Engineering and Technology, ICBET’ 19, page 253–259, New York, NY, USA. Association for Computing Machinery.
Published
2020-09-15
JESUS, Everton; CALAIS-FERREIRA, Lucas; BARRETO, Marcos. Matched-Pair Analysis Using Machine Learning to Predict 1-year Mortality in Newborn Twins. In: BRAZILIAN SYMPOSIUM ON COMPUTING APPLIED TO HEALTH (SBCAS), 20. , 2020, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 215-225. ISSN 2763-8952. DOI: https://doi.org/10.5753/sbcas.2020.11515.