Dealing with categorical missing data using CleanerR

Rafael S. Pereira; Fabio Porto

doi:10.5753/bresci.2019.10032

Rafael S. Pereira
Fabio Porto

DOI: https://doi.org/10.5753/bresci.2019.10032

Resumo

Missing data is a common problem in the world of data analysis. They appear in datasets due to a multitude of reasons, from data integration to poor data input. When faced with the problem, the analyst must decide what to do with the missing data since its not always advisable to discard these values from your analysis. On this paper we shall discuss a method that takes into account information theory and functional dependencies to best imput missing values.

Referências

Abdella, M. Marwala, T. (2005). Treatment of missing data using neural networks and genetic algorithms. Proceedings of the 2005 IEEE International Joint Conference on
Neural Networks, 1:598–603.

Burgette LF, R. J. (2010). Multiple imputation for missing data via sequential regression trees. American Journal of Epidemiology, 172:1070–1076.

Harrell Jr, F. E., with contributions from Charles Dupont, and many others. (2019). Hmisc: Harrell Miscellaneous. R package version 4.2-0.

Honaker, J., King, G., and Blackwell, M. (2011). Amelia II: A program for missing data. Journal of Statistical Software, 45(7):1–47.

Pereira, R. S. (2019). cleanerR: How to Handle your Missing Data. R package version 0.1.1.

R Core Team (2014). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

Troyanskaya O1, Cantor M, S. G. B. P. H. T. T. R. B. D. A. R. (2001). Missing value estimation methods for dna microarrays. Bioinformatics, 17(06):520–525.

van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45(3):1–67.