Analysis of Distinct Feature Groups in the Credit Scoring Problem
Keywords:credit scoring, feature groups, machine learning, web crawling
Registration and financial data have been traditionally used for the credit scoring problem. However,slight improvements in the reliability of the scores positively impacts financial companies. Therefore, exploring newfeatures is a strategic task. This work analyzes the importance of new feature groups not commonly employed forthe credit scoring task and others already used. We categorized features from open credit scoring datasets, suchas German and Australian and compared their groups with the ones of a company dataset used in this work. Ourdataset contains unusual feature groups, such as historical, geolocation, web behavior, and demographic data. In ouranalyzes, we first conducted bivariate tests with each feature-pair to assess their individual importance. Secondly, weran XGBoost machine learning model with each feature group to evaluate each group importance. We also appliedfeature selection with binary Particle Swarm Optimization to assess the groups importance when combined. Next, weemployed correlation tests to find inner and inter-correlation among the features groups. Finally, we used the companydataset and employed AdaBoost, Multilayer Perceptron, and XGBoost algorithms to find the best model for the task.Some of our main findings were that the unusual features added a slight improvement to registration features. We alsodetected reasonable inner correlation among some feature groups and found that all groups were relevant for the taskwith the Historical Group as the most promising. Lastly, XGBoost obtained the best performance over AdaBoost andMultilayer-perceptron for the task.
Bergstra, J. and Bengio, Y. Random search for hyper-parameter optimization. The Journal of Machine Learning Research 13 (1): 281–305, 2012.
Chen, T. and Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’16. ACM, New York, NY, USA, pp.785–794, 2016.
Djeundje, V. B., Crook, J., Calabrese, R., and Hamid, M. Enhancing credit scoring with alternative data. Expert Systems with Applications vol. 163, pp. 113766, 2021.
Ekin, O.,Hammer, P. L.,Kogan, A., and Winter, P. Distance-based classification methods. INFOR: Information Systems and Operational Research 37 (3): 337–352, 1999.
Fawcett, T. An introduction to roc analysis tom. Irbm 35 (6): 299–309, 2005.
Hastie, T.,Tibshirani, R., and Friedman, J. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.
He, H.,Zhang, W., and Zhang, S. A novel ensemble method for credit scoring: Adaption of different imbalanceratios. Expert Systems with Applications vol. 98, pp. 105 – 117, 2018.
Liberati, C. and Camillo, F. Personal values and credit scoring: new insights in the financial prediction. Journal of the Operational Research Society 69 (12): 1994–2005, 2018.
Massmann, C. and Holzmann, H. Analysing goodness of fit measures using a sensitivity based approach. Geophysical Research Abstracts vol. 14, pp. 12354, 2012.
Mester, L. J. et al. What’s the point of credit scoring? Business review 3 (Sep/Oct): 3–16, 1997.
Nazzal, J. M., El-Emary, I. M., and Najim, S. A. Multilayer perceptron neural network (mlps) for analyzing the properties of jordan oil shale 1, 2008.
Neuhauser, M. Nonparametric statistical tests: A computational approach. Chapman and Hall/CRC, 2011.
Niu, B., Ren, J.,and Li, X. Credit scoring using machine learning by combing social network information: Evidence from peer-to-peer lending. Information 10 (12): 397, 2019.
PAKDD Conference. 13th Pacific-Asia Knowledge Discovery and Data Mining Conference (PAKDD 2009) - DataMining Competition, 2009.
Paraíso, P.,Ruiz, S.,Gomes, P.,Rodrigues, L.,and Gama, J. Using network features for credit scoring inmicrofinance. International Journal of Data Science and Analytics, 2021
Shi, X.,Wong, Y. D.,Li, M. Z.-F.,Palanisamy, C.,and Chai, C.A feature learning approach based on xgboost for driving assessment and risk prediction. Accident Analysis and Prevention vol. 129, pp. 170–179, 2019.
Thomas, L. C.,Crook, J.,and Edelman, D. Credit Scoring and Its Applications. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002
Verçosa, L. F.,Lira, R.,Monteiro, R.,Silva, K.,Magalhaes, J.,Maciel, A.,Bezerra, B.,and Bastos-Filho, C. Impact of unusual features in credit scoring problem. In Anais do VIII Symposium on Knowledge Discovery, Mining and Learning. SBC, Porto Alegre, RS, Brasil, pp. 81–88, 2020.
Wirth, R. and Hipp, J.Crisp-dm: Towards a standard process model for data mining. In Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining. Springer-VerlagLondon, UK, pp. 29–39, 2000.
Yeh, I.-C. and Lien, C.-h. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications 36 (2): 2473–2480, 2009.
Ying, C.,Qi-Guang, M.,Jia-Chen, L.,and Lin, G.Advance and prospects of adaboost algorithm.Acta Automatica Sinica 39 (6): 745–758, 2013.
Yu, L. and Liu, H. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the 20th international conference on machine learning (ICML-03). pp. 856–863, 2003.
Zheng, H.,Yuan, J.,and Chen, L.Short-Term Load Forecasting Using EMD-LSTM Neural Networks with a Xgboost Algorithm for Feature Importance Evaluation. Energies 10 (8): 1–20, August, 2017.
Zhou, L. and Lai, K. K.Adaboosting neural networks for credit scoring. InThe Sixth International Symposium onNeural Networks (ISNN 2009). Springer, pp. 875–884, 2009.