Natural Language Processing Approaches for Accrediting Students on Extracurricular Activities

Resumo


The undergraduate programs at Brazilian universities allow students to include extracurricular activities in their academic transcripts. The large amount of proof documents (certificates and declarations) submitted by students that are subsequently analyzed by the academic staff makes the accrediting of extracurricular activities time-consuming and prone to error. This paper describes a methodology to classify academic proof documents according to the pre-defined groups by the Universidade de Brasília regulations for extracurricular activities accreditation. Experimental results showed that TF-IDF with SVM outperformed BERT, CNN and BiLSTM with 0.94 average Macro F1-Score, though their performances' difference were not statistically significant.
Palavras-chave: academic documents, extracurricular activities, classification, machine learning

Referências

Arroyo-Fernández, I., Méndez-Cruz, C.-F., Sierra, G., Torres-Moreno, J.-M., and Sidorov, G. (2019). Unsupervised sentence representations as word information series: Revisiting tf–idf. Computer Speech & Language, 56:107–129.

Bird, S. (2006). Nltk: the natural language toolkit. In Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, pages 69–72.

Dadgar, S. M. H., Araghi, M. S., and Farahani, M. M. (2016). A novel text mining approach based on tf-idf and support vector machine for news classification. In 2016 IEEE International Conference on Engineering and Technology (ICETECH), pages 112–116. IEEE.

Devlin, J., Chang, M.-W., Lee, K., and Toutanoa, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.v

Dogra, V., Verma, S., Kavita, Chatterjee, P., Shafi, J., Choi, J., and Ijaz, M. F. (2022). A complete process of text classification system using state-of-the-art nlp models. Computational Intelligence and Neuroscience, 2022(1):1883698.

Dragoni, M., Villata, S., Rizzi, W., and Governatori, G. (2018). Combining natural language processing approaches for rule extraction from legal documents. In AI Approaches to the Complexity of Legal Systems: AICOL International Workshops 2015-2017: AICOL-VI@ JURIX 2015, AICOL-VII@ EKAW 2016, AICOL-VIII@ JURIX 2016, AICOL-IX@ ICAIL 2017, and AICOL-X@ JURIX 2017, Revised Selected Papers 6, pages 287–300. Springer.

Duarte, J. M. and Berton, L. (2023). A review of semi-supervised learning for text classification. Artificial intelligence review, 56(9):9401–9469.

Fleith, D. D., Costa Jr, A. L., and Soriano De Alencar, E. M. (2012). The tutorial education program: An honors program for brazilian undergraduate students.

Hartmann, N., Fonseca, E. R., Shulby, C., Treviso, M. V., Rodrigues, J. S., and Aluísio, S. M. (2017). Portuguese word embeddings: Evaluating on word analogies and natural language tasks. CoRR, abs/1708.06025.

Hassan, F. u. and Le, T. (2020). Automated requirements identification from construction contract documents using natural language processing. Journal of Legal Affairs and Dispute Resolution in Engineering and Construction, 12(2):04520009.

Heppner, A., Pawar, A., Kivi, D., and Mago, V. (2019). Automating articulation: Applying natural language processing to post-secondary credit transfer. In IEEE Access, volume 7, pages 48295–48306.

Jurafsky, D. and Martin, J. H. (2019). Speech and Language Processing. Pearson.

Khurana, D., Koli, A., Khatter, K., and Singh, S. (2023). Natural language processing: state of the art, current trends and challenges. Multimedia Tools and Applications, 82(3):3713–3744

Kim, Y. (2014). Convolutional neural networks for sentence classification. CoRR, abs/1408.5882.

Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. CoRR, abs/1412.6980

Krishnamurthy, J., Dasigi, P., and Gardner, M. (2017). Neural semantic parsing with type constraints for semi-structured tables. In Palmer, M., Hwa, R., and Riedel, S., editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1516–1526, Copenhagen, Denmark. Association for Computational Linguistics.

Lawhorn, B. (2008). Extracurricular activities. Occupational Outlook Quarterly, 9(1):1621.

Lu, W., Zhou, Y., Yu, J., and Jia, C. (2019). Concept extraction and prerequisite relation learning from educational data. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9678–9685.

Luthfi, M. F. and Lhaksamana, K. M. (2020). Implementation of tf-idf method and support vector machine algorithm for job applicants text classification. Jurnal Media Informatika Budidarma, 4(4):1181–1186.

Martins, I. L. (2007). Educação tutorial no ensino presencial: uma análise sobre o pet. PET–Programa de Educação Tutorial: estratégia para o desenvolvimento da graduação. Brasília: Ministério da Educação.

Meystre, S. and Haug, P. J. (2006). Natural language processing to extract medical problems from electronic clinical documents: performance evaluation. Journal of Biomedical Informatics, 39(6):589–599.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. 1st International Conference on Learning Representations, ICLR 2013.

O’Shea, K. and Nash, R. (2015). An introduction to convolutional neural networks. arXiv preprint arXiv:1511.08458

Pohlert, T. (2014). The pairwise multiple comparison of mean ranks package (pmcmr). R package, 27(2019):9.

Rahman, A.-u., Musleh, D., Nabil, M., Alubaidan, H., Gollapalli, M., Krishnasamy, G., Almoqbil, D., Khan, M. A. A., Farooqui, M., Ahmed, M. I. B., et al. (2022). Assessment of information extraction techniques, models and systems. Mathematical Modelling of Engineering Problems, 9(3).

Siami-Namini, S., Tavakoli, N., and Namin, A. S. (2019). The performance of lstm and bilstm in forecasting time series. In 2019 IEEE International conference on big data (Big Data), pages 3285–3292. IEEE.

Widiastuti, N. and Dewi, K. (2020). Document image extraction system design. volume 879, page 012069.

Xiao, W., Ji, P., and Hu, J. (2022). A survey on educational data mining methods used for predicting students’ performance. Engineering Reports, 4(5):e12482.

Xie, Q., Dai, Z., Hovy, E., Luong, T., and Le, Q. (2020). Unsupervised data augmentation for consistency training. Advances in neural information processing systems, 33:62566268

Zhang, B., Wang, Y., Hou, W., Wu, H., Wang, J., Okumura, M., and Shinozaki, T. (2021). Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Advances in Neural Information Processing Systems, 34:18408–18419.
Publicado
04/11/2024
CAVALCANTE, João Pedro F. M.; MARINHO, Mayara C.; BORGES, Vinicius R. P.. Natural Language Processing Approaches for Accrediting Students on Extracurricular Activities. In: SIMPÓSIO BRASILEIRO DE INFORMÁTICA NA EDUCAÇÃO (SBIE), 35. , 2024, Rio de Janeiro/RJ. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 1796-1809. DOI: https://doi.org/10.5753/sbie.2024.242548.