Machine Learning Techniques for Predicting the Length of Stay at Graduation in the Scope of Brazilian Public Higher Education
Abstract
This article deals with the use of techniques of the Knowledge Discovery in Databases and Cross Industry Standard Process for Data Mining processes on educational databases made available by the Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira (National Institute of Educational Studies and Research Anísio Teixeira) in order to enable the discovery of knowledge about the students’ permanence time in undergraduate courses at Brazilian public higher education institutions. For this, Supervised Machine Learning methods were used to build models based on Decision Tree, Random Forest, XGBoost and Neural Network algorithms. XGBoost models stood out in all the experiments performed.
Keywords:
University graduate, Knowledge Discovery in Databases, Cross Industry Standard Process for Data Mining, Machine Learning, XGBoost
References
Bilogur, A. (2018). Undersampling and oversampling imbalanced data. https://bit.ly/3vG48Zw. Acesso em 10 de janeiro de 2021.
Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.
Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 785–794, New York, NY, USA. Association for Computing Machinery. https://doi.org/10.1145/2939672.2939785
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996). From Data Mining to Knowledge Discovery in Databases. AI Magazine, 17(3):37. https://doi.org/10.1609/aimag.v17i3.1230
Feature-Engine (2020). EqualFrequencyDiscretiser. https://bit.ly/34wskBO. Acesso em 20 de novembro de 2020.
GoogleDevelopers (2021a). Classification: Accuracy. https://bit.ly/2RaOt5G. Acesso em 15 de fevereiro de 2021.
GoogleDevelopers (2021b). Classification: Precision and Recall. https://bit.ly/3uBs9zJ. Acesso em 15 de fevereiro de 2021
GOV.BR (2021). Classificação Internacional Normalizada da Educação Adaptada para Cursos de Graduação e Sequenciais de Formação Específica (Cine Brasil). https://bit.ly/3fInU1g. Acesso em 12 de janeiro de 2021.
INEP (2020a). Censo da Educação Superior: Microdados. https://bit.ly/3fAh2mf. Acesso em 13 de agosto de 2020.
INEP (2020b). ENADE: Microdados. https://bit.ly/3g21SFB. Acesso em 13 de agosto de 2020.
INEP (2021a). Censo da Educação Superior. https://bit.ly/34A6oWo. Acesso em 12 de janeiro de 2021.
INEP (2021b). ENADE: questionário do estudante. https://bit.ly/3p8z8PB. Acesso em 13 de janeiro de 2021.
INEP (2021c). Exame Nacional de Desempenho dos Estudantes. https://bit.ly/3i2yL7H. Acesso em 12 de janeiro de 2021.
Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., and Euler, T. (2006). YALE: Rapid Prototyping for Complex Data Mining Tasks. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, page 935–940, New York, NY, USA. Association for Computing Machinery. https://doi.org/10.1145/1150402.1150531
Popescu, M.-C., Balas, V. E., Perescu-Popescu, L., and Mastorakis, N. (2009). Multilayer perceptron and neural networks. WSEAS Trans. Cir. and Sys., 8(7):579–588.
Quinlan, J. R. (1986). Induction of decision trees. Mach. Learn., 1(1):81–106. https://doi.org/10.1007/BF00116251
Ramos, J., Rodrigues, R., Silva, J., and Oliveira, P. (2020). CRISP-EDM: uma proposta de adaptação do Modelo CRISP-DM para mineração de dados educacionais. In Anais do XXXI Simpósio Brasileiro de Informática na Educação, pages 1092–1101, Porto Alegre, RS, Brasil. SBC. https://doi.org/10.5753/cbie.sbie.2020.1092
Scikit-Learn (2021a). Cross-validation: evaluating estimator performance. https://bit.ly/3fVPbfv. Acesso em 8 de fevereiro de 2021.
Scikit-Learn (2021b). sklearn.metrics.f1-score. https://bit.ly/3yTCp9K. Acesso em 15 de fevereiro de 2021.
Scikit-Learn (2021c). sklearn.model-selection.StratifiedKFold. https://bit.ly/3g1Uhqo. Acesso em 12 de fevereiro de 2021.
Scikit-Learn (2021d). sklearn.preprocessing.OneHotEncoder. https://bit.ly/2SHi4Ei. Acesso em 10 de fevereiro de 2021.
Shearer, C. (2000). The CRISP-DM Model: The New Blueprint for Data Mining. Journal of Data Warehousing, 5(4).
Tukey, J. W. (1977). Exploratory Data Analysis. Behavioral Science: Quantitative Methods. Addison-Wesley, Reading, Mass
Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.
Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 785–794, New York, NY, USA. Association for Computing Machinery. https://doi.org/10.1145/2939672.2939785
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996). From Data Mining to Knowledge Discovery in Databases. AI Magazine, 17(3):37. https://doi.org/10.1609/aimag.v17i3.1230
Feature-Engine (2020). EqualFrequencyDiscretiser. https://bit.ly/34wskBO. Acesso em 20 de novembro de 2020.
GoogleDevelopers (2021a). Classification: Accuracy. https://bit.ly/2RaOt5G. Acesso em 15 de fevereiro de 2021.
GoogleDevelopers (2021b). Classification: Precision and Recall. https://bit.ly/3uBs9zJ. Acesso em 15 de fevereiro de 2021
GOV.BR (2021). Classificação Internacional Normalizada da Educação Adaptada para Cursos de Graduação e Sequenciais de Formação Específica (Cine Brasil). https://bit.ly/3fInU1g. Acesso em 12 de janeiro de 2021.
INEP (2020a). Censo da Educação Superior: Microdados. https://bit.ly/3fAh2mf. Acesso em 13 de agosto de 2020.
INEP (2020b). ENADE: Microdados. https://bit.ly/3g21SFB. Acesso em 13 de agosto de 2020.
INEP (2021a). Censo da Educação Superior. https://bit.ly/34A6oWo. Acesso em 12 de janeiro de 2021.
INEP (2021b). ENADE: questionário do estudante. https://bit.ly/3p8z8PB. Acesso em 13 de janeiro de 2021.
INEP (2021c). Exame Nacional de Desempenho dos Estudantes. https://bit.ly/3i2yL7H. Acesso em 12 de janeiro de 2021.
Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., and Euler, T. (2006). YALE: Rapid Prototyping for Complex Data Mining Tasks. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, page 935–940, New York, NY, USA. Association for Computing Machinery. https://doi.org/10.1145/1150402.1150531
Popescu, M.-C., Balas, V. E., Perescu-Popescu, L., and Mastorakis, N. (2009). Multilayer perceptron and neural networks. WSEAS Trans. Cir. and Sys., 8(7):579–588.
Quinlan, J. R. (1986). Induction of decision trees. Mach. Learn., 1(1):81–106. https://doi.org/10.1007/BF00116251
Ramos, J., Rodrigues, R., Silva, J., and Oliveira, P. (2020). CRISP-EDM: uma proposta de adaptação do Modelo CRISP-DM para mineração de dados educacionais. In Anais do XXXI Simpósio Brasileiro de Informática na Educação, pages 1092–1101, Porto Alegre, RS, Brasil. SBC. https://doi.org/10.5753/cbie.sbie.2020.1092
Scikit-Learn (2021a). Cross-validation: evaluating estimator performance. https://bit.ly/3fVPbfv. Acesso em 8 de fevereiro de 2021.
Scikit-Learn (2021b). sklearn.metrics.f1-score. https://bit.ly/3yTCp9K. Acesso em 15 de fevereiro de 2021.
Scikit-Learn (2021c). sklearn.model-selection.StratifiedKFold. https://bit.ly/3g1Uhqo. Acesso em 12 de fevereiro de 2021.
Scikit-Learn (2021d). sklearn.preprocessing.OneHotEncoder. https://bit.ly/2SHi4Ei. Acesso em 10 de fevereiro de 2021.
Shearer, C. (2000). The CRISP-DM Model: The New Blueprint for Data Mining. Journal of Data Warehousing, 5(4).
Tukey, J. W. (1977). Exploratory Data Analysis. Behavioral Science: Quantitative Methods. Addison-Wesley, Reading, Mass
Published
2021-08-24
How to Cite
RODRIGUES, Ebony M.; GOUVEIA, Roberta M. M..
Machine Learning Techniques for Predicting the Length of Stay at Graduation in the Scope of Brazilian Public Higher Education. In: CONGRESS ON TECHNOLOGIES IN EDUCATION (CTRL+E), 6. , 2021, Evento Online.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2021
.
p. 128-137.
