Evaluating the Influence of Missing Data on Classification Algorithms in Data Mining Applications
Resumo
This paper presents an analysis regarding the influence of missing data on datasets when submitted to traditional classification algorithms in data mining applications. For this purpose, we use ten UCI datasets and manipulate them to hold controlled levels of missing data. Our empirical analysis shows that the classification performance decreases after significant insertion of missing values in all datasets tested. Among the analyzed algorithms, Naïve Bayes is the least influenced by missing data, being SMO the next. IBK is the most influenced, presenting the lowest accuracy, predominantly in datasets whose independent variables are continuous.
Referências
Chang, W. and Shin, J. (2006). “Missing data handling in multi-layer perceptron.” In Proceedings of the 10th WSEAS international conference on Computers, Stevens Point, Wisconsin, USA, p. 640-645.
Espinosa, R., Zubcoff, J. and Mazón, J.N. (2011). “A set of experiments to consider data quality criteria in classification techniques for data mining”. In Proceedings of the 2011 international conference on Computational science and its applications (ICCSA'11), Berlin, Heidelberg, p. 680-694.
Farhangfar, A., Kurgan, L., Dy, J. (2008) “Impact of imputation of missing values on classification error for discrete data”. In Pattern Recognition. 41(12), p.3692-3705.
Han, J. and Kamber, M., (2011). “Data Mining: concepts and techniques”. Morgan Kaufmann, San Francisco, USA.
Hulse, J.V., Khoshgoftaar, T.M. and Napolitano, A. (2011). "Evaluating the Impact of Data Quality on Sampling," In Journal of Information & Knowledge Management (JIKM). 10(03), p. 225-245.
Jonsson, P. and Wohlin, C. (2004) “An Evaluation of k-Nearest Neighbour Imputation Using Likert Data”, In Proceedings of the Software Metrics, 10th International Symposium. IEEE Computer Society, Washington, DC, USA, p.108-118.
Kalousis, A. and Hilario, M. (2000) “Supervised knowledge discovery from incomplete data”. Cambridge,UK, In Proceedings of the 2nd International Conference on Data Mining 2000, WIT Press.
Litle, R.J.A. and Rubin, D.B (1987) “Statistical analysis with missing data, Wiley Series in probability and statistics”, Wiley, New York.
Liu, P., Lei, L. and Rubin, D.B. (2005) “A Quantitative Study of the Effect of Missing Data in Classifiers”, In Proceedings of the The Fifth International Conference on Computer and Information Technology. IEEE Computer Society, Washington, DC, USA, p.28-33.
Luengo, J., Garcia, S., Herrera, F. (2012) “On the choice of the best imputation methods for missing values considering three groups of classification methods”. In Knowl. Inf. Syst. 32(1), p.77-108.
Nogueira, B.M., Santos, T.R.A. and Zarete, L.E. (2007) “Comparison of Classifiers Efficiency on Missing Values Recovering: Application in a Marketing Database with Massive Missing Data”, In Computational Intelligence and Data Mining, p. 66-72.
Shi, H. and Liu, Y. (2011). “Naïve bayes vs. support vector machine: resilience to missing data.” In Proceedings of the Third international conference on Artificial intelligence and computational intelligence, Springer-Verlag, Berlin, Heidelberg, p. 680-687.
Song, Q., Shepperd, M., Chen, X., Liu, J. (2008) “Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation”. In Journal of Systems and Software. 81(12), p.2361-2370.
Su, X., Khoshgoftaar, T.M. and Greiner, R. (2008) “Using imputation techniques to help learn accurate classifiers”. In 20th IEEE International Conference on Tools with Artificial Intelligence. IEEE Computer Society, Washington, DC, USA, p.437–444.
Witten, I. and Frank, E. (2005) “Data mining: pratical machine learning tools and techniques”,Morgan Kaufmann, San Francisco.
Zhang, S., Wu, X. and Zhu, M (2010) “Efficient missing data imputation for supervised learning”, In Proceedings of IEEE ICCI, p. 672-679.