Evaluating Strategies to Predict Student Dropout of a Bachelor's Degree in Computer Science
Resumo
The Brazilian Higher Education Census has revealed that the dropout rate among higher education students in Brazil exceeds 50% starting from the fifth year. This high rate results in several problems related to the wastage of resources invested by both the society and the students. Therefore, universities need to develop strategies to prevent student dropout and minimize these problems. However, predicting student dropout involves detecting patterns and predicting them over a large amount of data collected yearly from thousands of students. Given the scale and volume of data involved in dropout prediction, machine learning emerges as a powerful technique to automate the identification of these students. The objective of this paper is to identify students who are prone to dropping out based on the academic history of Bachelor’s Degree in Computer Science students at an unpaid public university in Brazil. We engineered four datasets based on the semester in which the students are enrolled. These datasets are designed to simulate the academic scenario and individual characteristics of the students available up to the prediction moment. Besides, we propose three feature models to identify the best scenario. Our method could identify the students most likely to drop out and the main features that contributed to the respective decision. Using only the information from the disciplines taken by the students proved to be the best feature model. When using these features with Gradient-Boosting, the F1-Score performance ranged between 69% and 85%, depending on the dataset.
Referências
Bishop, C. M. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 2007.
Breiman, L. Random forests. Machine Learning vol. 45, pp. 5–32, 2001.
Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. Classification and Regression Trees. CRC Press, 1984.
Chapelle, O., Vapnik, V., Bousquet, O., and Mukherjee, S. Choosing multiple parameters for support vector machines. Machine Learning 46 (1-3): 131 – 159, 2002.
Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, pp. 785–794, 2016.
Fernández-García, A. J., Preciado, J. C., Melchor, F., Rodriguez-Echeverria, R., Conejero, J. M., and Sánchez-Figueroa, F. A real-life machine learning experience for predicting university dropout at different stages using academic data. IEEE Access vol. 9, pp. 133076–133090, 2021.
Friedman, J. H. Greedy function approximation: A gradient boosting machine. The Annals of Statistics 29 (5): 1189–1232, 2001.
Geurts, P., Ernst, D., and Wehenkel, L. Extremely randomized trees. Machine Learning vol. 63, pp. 3–42, 2006.
He, H. and Ma, Y. Imbalanced Learning: Foundations, Algorithms, and Applications. Wiley-IEEE Press, 2013.
INEP. Brazilian higher education census. [link], 2022.
Ishwaran, H. The effect of splitting on random forests. Machine Learning 99 (1): 75–118, 2015.
Romero, C. and Ventura, S. Educational data mining and learning analytics: An updated survey. WIREs Data Mining and Knowledge Discovery 10 (3): e1355, 2020.
Santos, C. H. D. C., de L. Martins, S., and Plastino, A. Is it possible to predict dropout based on academic performance only? Brazilian Symposium on Informatics in Education vol. 32, pp. 792–802, 2021.
Santos, G. A. S., Bordignon, A. L., Oliveira, S. L. G., Haddad, D. B., Brandão, D. N., and Belloze, K. T. A brief review about educational data mining applied to predict student’s dropout. In Anais da V Escola Regional de Sistemas de Informação do Rio de Janeiro. SBC, Porto Alegre, RS, Brasil, pp. 86–91, 2018.
UFPR. Bachelor’s degree in computer science - curricular grade. [link], 2011.
UFPR. Previous entries. [link], 2022.
Zhu, J., Zou, H., Rosset, S., and Hastie, T. Multi-class adaboost. Statistics and Its Interface 2 (3): 349–360, 2009.