Automatic Grading of Portuguese Short Answers Using a Machine Learning Approach

Lucas Galhardi; Rodrigo C. Thom de Souza; Jacques Brancher

doi:10.5753/sbsi.2020.13133

Lucas Galhardi UEL
Rodrigo C. Thom de Souza UFPR
Jacques Brancher UEL

DOI: https://doi.org/10.5753/sbsi.2020.13133

Resumo

Short answers are routinely used in learning environments for students’ assessment. Despite its importance, teachers find the task of assessing discursive answers very time-consuming. Aiming at assisting in this problem, this work explores the Automatic Short Answer Grading (ASAG) field using a machine learning approach. The literature was reviewed and 44 papers using different techniques were analyzed considering many aspects. A Portuguese dataset was build with more than 7000 short answers. Different approaches were experimented and a final model was created with their combination. The model’s effectiveness showed to be satisfactory, with kappa scores indicating moderate/substantial agreement between the model and human grading.

Palavras-chave: Machine Learning, Short answers, assessments

Referências

ABED (2016). Censo EAD Brasil 2016 - Relatório Analítico de Aprendizagem a Distância no Brasil. ABED.

Aldabe, I., Lacalle, O. L., Maritxalar, M., and Lopez-Gazpio, I. (2015). Supervised Hierarchical Classification for Student Answer Scoring. arXiv preprint.

Alvarado, J. G., Abdi Ghavidel, H., Zouaq, A., Jovanovic, J., and Mcdonald, J. (2018). A Comparison of Features for the Automatic Labeling of Student Answers to Openended Questions. Edm, pages 55–65.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2016). Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics.

Burrows, S., Gurevych, I., and Stein, B. (2015). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, pages 60–117.

Butcher, P. G. and Jordan, S. E. (2010). A comparison of human and computer marking of short free-text student responses. Computers & Education, 55(2):489–499.

Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY, USA. ACM.

Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin, 70(4):213.

Conort, X. (2012). Short Answer Scoring — Explanation of “Gxav” Solution. ASAP ’12 SAS Methodology Paper, pages 1–22.

Dzikovska, M., Nielsen, R., Brew, C., Leacock, C., Giampiccolo, D., Bentivogli, L., Clark, P., Dagan, I., and Dang, H. T. (2013). SemEval-2013 Task 7: The Joint Student Response Analysis and 8th Recognizing Textual Entailment Challenge. Seventh International Workshop on Semantic Evaluation, pages 263–274.

Dzikovska, M. O., Nielsen, R. D., and Brew, C. (2012). Towards Effective Tutorial Feedback for Explanation Questions: A Dataset and Baselines. Proceedings of the 2012

Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 200–210.

Galhardi, L., Barbosa, C. R., de Souza, R. C. T., and Brancher, J. D. (2018). Portuguese automatic short answer grading. In Brazilian Symposium on Computers in Education (Simpósio Brasileiro de Informática na Educação-SBIE), volume 29, page 1373.

Galhardi, L. and Brancher, J. D. (2018a). Auto-avaliador colaborativo e inteligente de respostas. In Anais dosWorkshops do Congresso Brasileiro de Informática na Educação, volume 7, page 142.

Galhardi, L. B. and Brancher, J. D. (2018b). Machine learning approach for automatic short answer grading: A systematic review. In Ibero-American Conference on Artificial Intelligence, pages 380–391. Springer.

Haley, D. T., Thomas, P., De Roeck, A., and Petre, M. (2007). Measuring improvement in latent semantic analysis-based marking systems: Using a computer to mark questions about HTML. Conferences in Research and Practice in Information Technology Series, 66:35–42.

Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Rodrigues, J., and Aluisio, S. (2017). Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks. arXiv preprint arXiv.

Heilman, M. and Madnani, N. (2013). ETS: Domain Adaptation and Stacking for Short Answer Scoring. Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval), 2:275–279.

Kohail, S. and Biemann, C. (2017). Matching , Re-ranking and Scoring : Learning Textual Similarity by Incorporating Dependency Graph Alignment and Coverage Features. 18th International Conference on Computational Linguistics and Intelligent Text Processing. Budapest.

Landis, J. R. and Koch, G. G. (1977). The measurement of observer agreement for categorical data. biometrics, pages 159–174.

Liu, O. L., Rios, J. A., Heilman, M., Gerard, L., and Linn, M. C. (2016). Validation of automated scoring of science assessments. Journal of Research in Science Teaching, 53(2):215–233.

Madnani, N., Burstein, J., Sabatini, J., and O’Reilly, T. (2013). Automated Scoring of a Summary-Writing Task Designed to Measure Reading Comprehension. Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 163–168.

Magooda, A., Zahran, M. A., Rashwan, M., Raafat, H., and Fayek, M. B. (2016). Vector Based Techniques for Short Answer Grading. International Florida Artificial Intelligence Research Society Conference Ahmed, pages 238–243.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.

Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41.

Moharreri, K., Ha, M., and Nehm, R. H. (2014). EvoGrader: an online formative assessment tool for automatically evaluating written evolutionary explanations. Evolution: Education and Outreach, 7(1):15.

Mohler, M. and Mihalcea, R. (2009). Text-to-text semantic similarity for automatic short answer grading. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics on - EACL ’09, pages 567–575.

Nascimento, M. d. G. C. d. A. and Santos, J. V. (2015). Sessão Especial 05 - Políticas educacionais e currículo: interfaces na educação infantil e ensino fundamental. 37a Reunião Nacional da ANPEd – 04 a 08 de outubro de 2015, UFSC – Florianópolis.

Passero, G., Haendchen Filho, A., and Dazzi, R. (2016). Avaliação do uso de métodos baseados em lsa e wordnet para correção de questões discursivas. In Brazilian Symposium on Computers in Education (SBIE), volume 27, page 1136.

Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.

Pulman, S. G. and Sukkarieh, J. Z. (2005). Automatic short answer marking. Proceedings of the second workshop on Building Educational Applications Using NLP - EdAppsNLP 05, 1(June):9–16.

Ramachandran, L., Cheng, J., and Foltz, P. (2015). Identifying Patterns For Short Answer Scoring Using Graph-based Lexico-Semantic Text Matching. Workshop on Innovative Use of NLP for Building Educational Applications, 10:97–106.

Roy, S., Bhatt, H. S., and Narahari, Y. (2016). An Iterative Transfer Learning Based Ensemble Technique for Automatic Short Answer Grading. arXiv preprint, 285:1622– 1623.

Sakaguchi, K., Heilman, M., and Madnani, N. (2015). Effective Feature Integration for Automated Short Answer Scoring. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1049–1054.

Santos, J. C. A. d. et al. (2016). Avaliação automática de questões discursivas usando lsa. Universidade Federal do Pará.

Vanbelle, S. (2016). A New Interpretation of the Weighted Kappa Coefficients. Psychometrika, 81(2):399–410.

Vijaymeena, M. and Kavitha, K. (2016). A survey on similarity measures in text mining. Machine Learning and Applications: An International Journal, 3(2):19–28.

Williamson, D. M., Xi, X., and Breyer, F. J. (2012). A Framework for Evaluation and Use of Automated Scoring. Educational Measurement: Issues and Practice, 31(1):2–13.

Zbontar, J. (2012). Short Answer Scoring by Stacking. ASAP ’12 SAS Methodology Paper, pages 1–7.

Zhang, C., Liu, C., Zhang, X., and Almpanidis, G. (2017). An up-to-date comparison of state-of-the-art classification algorithms. Expert Systems with Applications, 82:128– 150.