Classificando perfis de longevidade de bases de dados longitudinais usando Floresta Aleatória
Resumo
Estudos populacionais sobre envelhecimento humano frequentemente geram bases de dados longitudinais de alta dimensionalidade. O processo de descoberta de conhecimento precisa ser adaptado para lidar com as características especiais dessas bases de dados, para se beneficiar do seu aspecto temporal. Neste trabalho, apresentamos os resultados de um processo de descoberta de conhecimento em bases de dados aplicado nos dados do English Longitudinal Study of Ageing (ELSA), um proeminente estudo britânico que acompanha milhares de indivíduos por um longo período de tempo, coletando informações de diferentes dimensões, como saúde, socioeconômica, e bem-estar. O objetivo do nosso estudo é classificar os participantes do estudo ELSA, de acordo com o perfil apresentado por eles, como longevos, que são indivíduos com idade acima de 82,9 anos, ou não-longevos. Para isso, foi utilizada uma abordagem de agrupamento semi-supervisionado para encontrar grupos de representantes dos perfis, e usamos esses grupos como base de dados para a execução de um algoritmo de aprendizado supervisionado. O modelo de classificação teve bons resultados, e interpretando este modelo foi constatado que aspectos de diferentes dimensões influenciam na diferenciação entre os perfis.
Palavras-chave:
data mining, knowledge discovery, random forests, supervised machine learning
Referências
Banks, J., Breeze, E., Lessof, C., and Nazroo, J. The dynamics of ageing: Evidence from the English Longitudinal Study of Ageing 2002-15 (Wave 7). Institute for Fiscal Studies, 7 Ridgmount Street London WC1E 7AE, 2016.
Breiman, L. Random forests. Machine learning 45 (1): 5–32, 2001.
Deng, H. Interpreting tree ensembles with intrees. arXiv preprint arXiv:1408.5456, 2014.
Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD. Vol. 96. pp. 226–231, 1996.
Fernández-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. Do we need hundreds of classifiers to solve real world classification problems. Journal of Machine Learning Research 15 (1): 3133–3181, 2014.
Grira, N., Crucianu, M., and Boujemaa, N. Unsupervised and semi-supervised clustering: a brief survey. A review of machine learning techniques for processing multimedia content, Report of the MUSCLE European Network of Excellence (FP6), 2004.
Hotelling, H. The generalization of student’s ratio. In Breakthroughs in Statistics. Springer, pp. 54–65, 1992.
Kantardzic, M. Data mining: concepts, models, methods, and algorithms. John Wiley & Sons, 2011.
Kohonen, T. and Somervuo, P. Self-organizing maps of symbol strings. Neurocomputing 21 (1): 19–30, 1998.
Last, M., Klein, Y., and Kandel, A. Knowledge discovery in time series databases. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 31 (1): 160–169, 2001.
Louppe, G. Understanding random forests: From theory to practice. arXiv preprint arXiv:1407.7502, 2014.
Malloy-Diniz, L., Fuentes, D., and Cosenza, R. Neuropsicologia do Envelhecimento: Uma Abordagem Multidimensional. Vol. 1, 2013.
Marmot, M., Oldfield, Z., Clemens, S., Blake, M., Phelps, A., Nazroo, J., Steptoe, A., Rogers, N., and Banks, J. English longitudinal study of ageing: Waves 0-6, 1998-2013. [data collection]. 23rd edition, 2015.
Minhas, S., Khanum, A., Riaz, F., Alvi, A., Khan, S. A., Initiative, A. D. N., et al. Early alzheimer’s disease prediction in machine learning setup: Empirical analysis with missing value computation. In International Conference on Intelligent Data Engineering and Automated Learning. Springer, pp. 424–432, 2015.
Oshiro, T. M., Perez, P. S., and Baranauskas, J. A. How many trees in a random forest? In International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer, pp. 154–168, 2012.
Pyle, D. Data preparation for data mining. Vol. 1. Morgan Kaufmann, 1999.
Ribeiro, C. E., Brito, L. H. S., Nobre, C. N., Freitas, A. A., and Zárate, L. E. A revision and analysis of the comprehensiveness of the main longitudinal studies of human aging for data mining research. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 7 (3), 2017.
Ribeiro, C. E. and Zárate, L. E. Data preparation for longitudinal data mining: a case study on human ageing. Journal of Information and Data Management 7 (2): 116, 2017.
Zaït, M. and Messatfa, H. A comparative study of clustering methods. Future Generation Computer Systems 13 (2-3): 149–159, 1997.
Breiman, L. Random forests. Machine learning 45 (1): 5–32, 2001.
Deng, H. Interpreting tree ensembles with intrees. arXiv preprint arXiv:1408.5456, 2014.
Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD. Vol. 96. pp. 226–231, 1996.
Fernández-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. Do we need hundreds of classifiers to solve real world classification problems. Journal of Machine Learning Research 15 (1): 3133–3181, 2014.
Grira, N., Crucianu, M., and Boujemaa, N. Unsupervised and semi-supervised clustering: a brief survey. A review of machine learning techniques for processing multimedia content, Report of the MUSCLE European Network of Excellence (FP6), 2004.
Hotelling, H. The generalization of student’s ratio. In Breakthroughs in Statistics. Springer, pp. 54–65, 1992.
Kantardzic, M. Data mining: concepts, models, methods, and algorithms. John Wiley & Sons, 2011.
Kohonen, T. and Somervuo, P. Self-organizing maps of symbol strings. Neurocomputing 21 (1): 19–30, 1998.
Last, M., Klein, Y., and Kandel, A. Knowledge discovery in time series databases. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 31 (1): 160–169, 2001.
Louppe, G. Understanding random forests: From theory to practice. arXiv preprint arXiv:1407.7502, 2014.
Malloy-Diniz, L., Fuentes, D., and Cosenza, R. Neuropsicologia do Envelhecimento: Uma Abordagem Multidimensional. Vol. 1, 2013.
Marmot, M., Oldfield, Z., Clemens, S., Blake, M., Phelps, A., Nazroo, J., Steptoe, A., Rogers, N., and Banks, J. English longitudinal study of ageing: Waves 0-6, 1998-2013. [data collection]. 23rd edition, 2015.
Minhas, S., Khanum, A., Riaz, F., Alvi, A., Khan, S. A., Initiative, A. D. N., et al. Early alzheimer’s disease prediction in machine learning setup: Empirical analysis with missing value computation. In International Conference on Intelligent Data Engineering and Automated Learning. Springer, pp. 424–432, 2015.
Oshiro, T. M., Perez, P. S., and Baranauskas, J. A. How many trees in a random forest? In International Workshop on Machine Learning and Data Mining in Pattern Recognition. Springer, pp. 154–168, 2012.
Pyle, D. Data preparation for data mining. Vol. 1. Morgan Kaufmann, 1999.
Ribeiro, C. E., Brito, L. H. S., Nobre, C. N., Freitas, A. A., and Zárate, L. E. A revision and analysis of the comprehensiveness of the main longitudinal studies of human aging for data mining research. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 7 (3), 2017.
Ribeiro, C. E. and Zárate, L. E. Data preparation for longitudinal data mining: a case study on human ageing. Journal of Information and Data Management 7 (2): 116, 2017.
Zaït, M. and Messatfa, H. A comparative study of clustering methods. Future Generation Computer Systems 13 (2-3): 149–159, 1997.
Publicado
22/10/2018
Como Citar
RIQUETI, G. A.; RIBEIRO, C. E.; ZÁRATE, L. E..
Classificando perfis de longevidade de bases de dados longitudinais usando Floresta Aleatória. In: SYMPOSIUM ON KNOWLEDGE DISCOVERY, MINING AND LEARNING (KDMILE), 6. , 2018, São Paulo/SP.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2018
.
p. 33-40.
ISSN 2763-8944.
DOI: https://doi.org/10.5753/kdmile.2018.27382.