Dealing with categorical missing data using CleanerR

  • Rafael Pereira LNCC
  • Fabio Porto LNCC

Resumo


Missing data is a common problem in the world of data analysis. They appear in datasets due to a multitude of reasons, from data integration to poor data input. When faced with the problem, the analyst must decide what to do with the missing data since its not always advisable to discard these values from your analysis. On this paper we shall discuss a method that takes into account information theory and functional dependencies to best imput missing values.

Palavras-chave: categorical data, data imputation

Referências

Abdella, M. Marwala, T. (2005). Treatment of missing data using neural networks and genetic algorithms. Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, 1:598–603.

Burgette LF, R. J. (2010). Multiple imputation for missing data via sequential regression trees. American Journal of Epidemiology, 172:1070–1076.

Harrell Jr, F. E., with contributions from Charles Dupont, and many others. (2019). Hmisc: Harrell Miscellaneous. R package version 4.2-0.

Honaker, J., King, G., and Blackwell, M. (2011). Amelia II: A program for missing data. Journal of Statistical Software, 45(7):1–47.

Pereira, R. S. (2019). cleanerR: How to Handle your Missing Data. R package version 0.1.1.

R Core Team (2014). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

Troyanskaya O1, Cantor M, S. G. B. P. H. T. T. R. B. D. A. R. (2001). Missing value estimation methods for dna microarrays. Bioinformatics, 17(06):520–525.

van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45(3):1–67.
Publicado
24/06/2019
Como Citar

Selecione um Formato
PEREIRA, Rafael; PORTO, Fabio. Dealing with categorical missing data using CleanerR. In: BRAZILIAN E-SCIENCE WORKSHOP (BRESCI), 13. , 2019, Belém. Anais do XIII Brazilian e-Science Workshop. Porto Alegre: Sociedade Brasileira de Computação, june 2019 . p. 49-55.