Classifying longevity profiles of longitudinal databases using Random Forest
Abstract
Population studies on human aging often generate longitudinal databases of high dimensionality. The knowledge discovery process needs to be adapted to deal with the special characteristics of these databases, to benefit from their temporal aspect. In this work, we present the results of a knowledge discovery process in databases applied to the data in the English Longitudinal Study of Ageing (ELSA), a prominent British study that follows thousands of individuals over a long period of time, collecting information from different dimensions, such as health, socioeconomic, and well-being. The purpose of our study is to classify participants in the ELSA study, according to the profile presented by them, as long-lived, who are individuals over 82.9 years of age, or non-long-lived. For this, a semi-supervised clustering approach was used to find clusters of profile representatives, and we used these clusters as a database for the execution of a supervised learning algorithm. The classification model had good results, and interpreting this model it was found that aspects of different dimensions influence the differentiation between the profiles.
Keywords:
data mining, knowledge discovery, random forests, supervised machine learning
References
Banks, J., Breeze, E., Lessof, C., and Nazroo, J. The dynamics of ageing: Evidence from the English Longitudinal Study of Ageing 2002-15 (Wave 7). Institute for Fiscal Studies, 7 Ridgmount Street London WC1E 7AE, 2016.
Breiman, L. Random forests. Machine learning 45 (1): 5–32, 2001.
Deng, H. Interpreting tree ensembles with intrees. arXiv preprint arXiv:1408.5456, 2014.
Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD. Vol. 96. pp. 226–231, 1996.
Fernández-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. Do we need hundreds of classifiers to solve real world classification problems. Journal of Machine Learning Research 15 (1): 3133–3181, 2014.
Grira, N., Crucianu, M., and Boujemaa, N. Unsupervised and semi-supervised clustering: a brief survey. A review of machine learning techniques for processing multimedia content, Report of the MUSCLE European Network of Excellence (FP6), 2004.
Hotelling, H. The generalization of student’s ratio. In Breakthroughs in Statistics. Springer, pp. 54–65, 1992.
Kantardzic, M. Data mining: concepts, models, methods, and algorithms. John Wiley & Sons, 2011.
Kohonen, T. and Somervuo, P. Self-organizing maps of symbol strings. Neurocomputing 21 (1): 19–30, 1998.
Last, M., Klein, Y., and Kandel, A. Knowledge discovery in time series databases. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 31 (1): 160–169, 2001.
Louppe, G. Understanding random forests: From theory to practice. arXiv preprint arXiv:1407.7502, 2014.
Malloy-Diniz, L., Fuentes, D., and Cosenza, R. Neuropsicologia do Envelhecimento: Uma Abordagem Multidimensional. Vol. 1, 2013.
Marmot, M., Oldfield, Z., Clemens, S., Blake, M., Phelps, A., Nazroo, J., Steptoe, A., Rogers, N., and Banks, J. English longitudinal study of ageing: Waves 0-6, 1998-2013. [data collection]. 23rd edition, 2015.
Minhas, S., Khanum, A., Riaz, F., Alvi, A., Khan, S. A., Initiative, A. D. N., et al. Early alzheimer’s disease prediction in machine learning setup: Empirical analysis with missing value computation. In International Conference on Intelligent Data Engineering and Automated Learning. Springer, pp. 424–432, 2015.
Oshiro, T. M., Perez, P. S., and Baranauskas, J. A. How many trees in a random forest? In International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer, pp. 154–168, 2012.
Pyle, D. Data preparation for data mining. Vol. 1. Morgan Kaufmann, 1999.
Ribeiro, C. E., Brito, L. H. S., Nobre, C. N., Freitas, A. A., and Zárate, L. E. A revision and analysis of the comprehensiveness of the main longitudinal studies of human aging for data mining research. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 7 (3), 2017.
Ribeiro, C. E. and Zárate, L. E. Data preparation for longitudinal data mining: a case study on human ageing. Journal of Information and Data Management 7 (2): 116, 2017.
Zaït, M. and Messatfa, H. A comparative study of clustering methods. Future Generation Computer Systems 13 (2-3): 149–159, 1997.
Breiman, L. Random forests. Machine learning 45 (1): 5–32, 2001.
Deng, H. Interpreting tree ensembles with intrees. arXiv preprint arXiv:1408.5456, 2014.
Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD. Vol. 96. pp. 226–231, 1996.
Fernández-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. Do we need hundreds of classifiers to solve real world classification problems. Journal of Machine Learning Research 15 (1): 3133–3181, 2014.
Grira, N., Crucianu, M., and Boujemaa, N. Unsupervised and semi-supervised clustering: a brief survey. A review of machine learning techniques for processing multimedia content, Report of the MUSCLE European Network of Excellence (FP6), 2004.
Hotelling, H. The generalization of student’s ratio. In Breakthroughs in Statistics. Springer, pp. 54–65, 1992.
Kantardzic, M. Data mining: concepts, models, methods, and algorithms. John Wiley & Sons, 2011.
Kohonen, T. and Somervuo, P. Self-organizing maps of symbol strings. Neurocomputing 21 (1): 19–30, 1998.
Last, M., Klein, Y., and Kandel, A. Knowledge discovery in time series databases. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 31 (1): 160–169, 2001.
Louppe, G. Understanding random forests: From theory to practice. arXiv preprint arXiv:1407.7502, 2014.
Malloy-Diniz, L., Fuentes, D., and Cosenza, R. Neuropsicologia do Envelhecimento: Uma Abordagem Multidimensional. Vol. 1, 2013.
Marmot, M., Oldfield, Z., Clemens, S., Blake, M., Phelps, A., Nazroo, J., Steptoe, A., Rogers, N., and Banks, J. English longitudinal study of ageing: Waves 0-6, 1998-2013. [data collection]. 23rd edition, 2015.
Minhas, S., Khanum, A., Riaz, F., Alvi, A., Khan, S. A., Initiative, A. D. N., et al. Early alzheimer’s disease prediction in machine learning setup: Empirical analysis with missing value computation. In International Conference on Intelligent Data Engineering and Automated Learning. Springer, pp. 424–432, 2015.
Oshiro, T. M., Perez, P. S., and Baranauskas, J. A. How many trees in a random forest? In International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer, pp. 154–168, 2012.
Pyle, D. Data preparation for data mining. Vol. 1. Morgan Kaufmann, 1999.
Ribeiro, C. E., Brito, L. H. S., Nobre, C. N., Freitas, A. A., and Zárate, L. E. A revision and analysis of the comprehensiveness of the main longitudinal studies of human aging for data mining research. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 7 (3), 2017.
Ribeiro, C. E. and Zárate, L. E. Data preparation for longitudinal data mining: a case study on human ageing. Journal of Information and Data Management 7 (2): 116, 2017.
Zaït, M. and Messatfa, H. A comparative study of clustering methods. Future Generation Computer Systems 13 (2-3): 149–159, 1997.
Published
2018-10-22
How to Cite
RIQUETI, G. A.; RIBEIRO, C. E.; ZÁRATE, L. E..
Classifying longevity profiles of longitudinal databases using Random Forest. In: SYMPOSIUM ON KNOWLEDGE DISCOVERY, MINING AND LEARNING (KDMILE), 6. , 2018, São Paulo/SP.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2018
.
p. 33-40.
ISSN 2763-8944.
DOI: https://doi.org/10.5753/kdmile.2018.27382.
