Reducing the Influence of Confouders on Predictive Models

Ricardo Alves Brito

Ricardo Alves Brito PUC-MG

Resumo

The analysis of Big Data has become so important with the progressive increase of the information stored in digital media. Extracting more value from diversified and unstructured data is really challenging. With the help of predictive models, it is possible to find new patterns and trends that could be innovation bases. Predictive models need to have a relevant reliability rate to aid us in decision-making processes. In this context, this article discusses the influence of confounding variables on predictive models and proposes techniques for identifying and minimizing their effect. Through a database with information collected in a hospital, it was possible to construct a predictive model, to identify possible confounding variables, to apply a technique to minimize its influences and to evaluate the accuracy of the model through machine learning techniques. The result was an efficient prediction model.

Palavras-chave: Big Data, Predictive Model, Confounders, Multicollinearity, Machine Learning.

Referências

Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research, 46(3):399– 424.

Berk, R., Brown, L., Buja, A., George, E., and Zhao, L. (2018). Working with misspeci-fied regression models. Journal of Quantitative Criminology, 34(3):633.

García, C., Garc´ıa, J., Lopez´ Mart´ın, M., and Salmeron,´ R. (2015). Collinearity: Re-visiting the variance inflation factor in ridge regression. Journal of Applied Statistics, 42(3):648–661.

Han, J., Pei, J., and Kamber, M. (2011). Data mining: concepts and techniques. Elsevier.

http://mlr.cs.umass.edu/ml/ (2017). Uci machine learning repository.

Jena, L. and Kamila, N. K. (2015). Distributed data mining classification algorithms for prediction of chronic-kidney-disease. International Journal of Emerging Research in Management &Technology, 4(11):110–118.

Kumar, M. (2016). Prediction of chronic kidney disease using random forest machine learning algorithm. International Journal of Computer Science and Mobile Computing, 5(2):24–33.

Leatherman, E. R., Santner, T. J., and Dean, A. M. (2018). Computer experiment designs for accurate prediction. Statistics and Computing, 28(4):739.

Li, L., Rakitsch, B., and Borgwardt, K. (2011). ccsvm: correcting support vector ma-chines for confounding factors in biological data classification. Bioinformatics (Ox-ford, England), 27:i342–i348.

Li, L. and Zhang, S. (2015). logical data classification. 13:181–196.

Orthogonal projection correction for confounders in bio-International journal of data mining and bioinformatics, Low, Y. S., Gallego, B., and Shah, N. H. (2016). Comparing high-dimensional confounder control methods for rapid cohort studies from electronic health records. Journal of comparative effectiveness research, 5:179–192.

Maurizio, M. (2011). Data mining concepts and techniques. domenica.

Montgomery, D. C., Peck, E. A., and Vining, G. G. (2012). Introduction to linear regres-sion analysis, volume 821. John Wiley & Sons.

Schnitzer, M. E., Lok, J. J., and Gruber, S. (2016). Variable selection for confounder control, flexible modeling and collaborative targeted minimum loss-based estimation in causal inference. The international journal of biostatistics, 12:97–115.

Sheskin, D. J. (2011). Parametric versus nonparametric tests. International Encyclopedia of Statistical Science.

Sinha, P. and Sinha, P. (2015). Comparative study of chronic kidney disease prediction using knn and svm. International Journal of Engineering Research and Technology, 4(12):608–12.

Steyerberg, E. (2009). Lessons from case studies. Clinical Prediction Models.

Team, R. C. (2014). R: A language and environment for statistical computing. r founda-tion for statistical computing, vienna, austria. 2013.

Waller, M. A. and Fawcett, S. E. (2013). Data science, predictive analytics, and big data: a revolution that will transform supply chain design and management. Journal of Business Logistics, 34(2):77–84.