Supporting the imputation process with machine learning techniques
Abstract
The task of imputation of missing data is an important challenge faced by data scientists. In this context, imputation techniques that improve the quality of the data entered are imperative. Exploring both machine learning techniques and variations of the classical imputation process can improve the quality of the imputed data. Hence, this article aims to evaluate the impact of the use of the k-neighbors algorithm faced to the use of the mean in the global imputation process as well as to explore the use of the hot-deck imputation technique with the clustering algorithm k-Means and imputation with k-NN. Results reveal an interesting reduction of absolute error obtained in the simulation in three databases with different characteristics.
References
Dua, D., Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Farhangfar, A., Kurgan, L., Pedrycz, W. (2007). A novel framework for imputation of missing values in databases. IEEE Transactions on Systems, Man, and Cybernetics.
Ford, B. L. (1983). An Overview of Hot-Deck Procedures. Incomplete Data in Sample Surveys, 1 ed., vol. 2, Academic Press.
Fuller, W. A., Kim, J. K. (2001). Hot Deck Imputation for the Response Model. Survey Methodology, v. 31, n. 2, pp. 139-149.
Han, J., Kamber, M., Pei, J. (2011). Data Mining: Concepts and Techniques, 3ed. Morgan Kaufmann, Waltham, Mass.
Jerez, J. M., Molina, I., García-Laencina, P. J., Alba, E., Ribelles, N., Martín, M., Franco, L. (2010). Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial Intelligence in Medicine.
Little, R. J. and Rubin, D. B. (2002). Statistical Analysis with Missing Data. John Wiley & Sons,New York, 2ed.
Luengo, J., García, S., Herrera, F., (2012), On the choice of the best imputation methods for missing values considering three groups of classification methods, Knowledge and Information Systems, v. 32, n. 1 (Jul.), p. 77–108.
Rubin, D. B. (1988). An overview of multiple imputation. In Proceedings of the Survey Research Section, American Statistical Association, pp. 79–84.
Silva, L. O., Zárate, L. E. (2014). A brief review of the main approaches for treatment of missing data. Intelligent Data Analysis, vol. 18, no. 6, pp. 1177-1198.
Soares, J. (2007). Pré-processamento em Mineração de Dados: um Estudo Comparativo em Complementação. Tese de Doutorado, COPPE/UFRJ.
