Noise detection in classification problems

Luís P. F. Garcia; Ana C. Lorena; André C. P. L. F. de Carvalho

doi:10.5753/ctd.2017.3469

Luís P. F. Garcia University of Leipzig / USP
Ana C. Lorena UNIFESP
André C. P. L. F. de Carvalho USP

DOI: https://doi.org/10.5753/ctd.2017.3469

Resumo

Large volumes of data have been produced in many application domains. Nonetheless, when data quality is low, the performance of Machine Learning techniques is harmed. Real data are frequently affected by the presence of noise, which, when used in the training of Machine Learning techniques for predictive tasks, can result in complex models, with high induction time and low predictive performance. Identification and removal of noise can improve data quality and, as a result, the induced model. This thesis proposes new techniques for noise detection and the development of a recommendation system based on meta-learning to recommend the most suitable filter for new tasks. Experiments using artificial and real datasets show the relevance of this research.

Referências

Brazdil, P., Giraud-Carrier, C., Soares, C., and Vilalta, R. (2009). Metalearning - Applications to Data Mining. Cognitive Technologies. Springer, 1 edition.

Brodley, C. and Friedl, M. (1996). Identifying and eliminating mislabeled training instances. In 13th National Conference on Artificial Intelligence (AAAI), pages 799–805.

Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30.

Frenay, B. and Verleysen, M. (2014). Classification in the presence of label noise: a survey. IEEE Trans. on Neural Networks and Learning Systems, 25(5):845–869.

Garcia, L., de Carvalho, A., and Lorena, A. (2013). Noisy data set identification. In Hybrid Artificial Intelligent Systems (HAIS), volume 8073, pages 629–638.

Garcia, L., de Carvalho, A., and Lorena, A. (2015a). Effect of label noise in the complexity of classification problems. Neurocomputing, 160:108–119.

Garcia, L., de Carvalho, A., and Lorena, A. (2016a). Noise detection in the meta-learning level. Neurocomputing, 176:14–25.

Garcia, L., Lorena, A., and de Carvalho, A. (2012). A study on class noise detection and elimination. In Brazilian Symposium on Neural Networks (SBRN), pages 13–18.

Garcia, L., Lorena, A., and de Carvalho, A. (2016b). Ensembles of label noise filters: a ranking approach. Data Mining and Knowledge Discovery, 30(5):1192 – 1216.

Garcia, L., Sáez, J., Luengo, J., Lorena, A., de Carvalho, A., and Herrera, F. (2015b). Using the one-vs-one decomposition to improve the performance of class noise filters via an aggregation strategy in multi-class classification problems. Knowledge-Based Systems, 90:153–164.

Ho, T. and Basu, M. (2002). Complexity measures of supervised classification problems. IEEE Trans. on Pattern Analysis and Machine Intelligence, 24(3):289–300.

Lorena, A., Garcia, L., and de Carvalho, A. (2015). Adapting noise filters for ranking. In Brazilian Conference on Intelligent Systems (BRACIS), pages 299–304.

Sluban, B., Gamberger, D., and Lavrač, N. (2010). Advances in class noise detection. In 19th European Conference on Artificial Intelligence (ECAI), pages 1105–1106.

Sluban, B., Gamberger, D., and Lavrač, N. (2014). Ensemble-based noise detection: noise ranking and visual performance evaluation. Data Mining and Knowledge Discovery, 28(2):265–303.

Tomek, I. (1976). An experiment with the edited nearest-neighbor rule. IEEE Trans. on Systems, Man and Cybernetics, 6(6):448–452.

Wolpert, D. (1992). Stacked generalization. Neural Networks, 5(2):241–259.

Zhu, X. and Wu, X. (2004). Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review, 22(3):177–210.