Prompt Engineering for Automatic Short Answer Grading in Brazilian Portuguese
Resumo
Automatic Short Answer Grading (ASAG) is a prominent area of Artificial Intelligence in Education (AIED). Despite much research, developing ASAG systems is challenging, even when focused on a single subject, mostly due to the variability in length and content of students' answers. While recent research has explored Large Language Models (LLMs) to enhance the efficiency of ASAG, the LLM performance is highly dependent on the prompt design. In that context, prompt engineering plays a crucial role. However, to the best of our knowledge, no research has systematically investigated prompt engineering in ASAG. Thus, this study compares over 128 prompt combinations for a Portuguese dataset based on GPT-3.5-Turbo and GPT-4-Turbo. Our findings indicate the crucial role of specific prompt components in improving GPT results and shows that GPT-4 consistently outperformed GPT-3.5 in this domain. These insights guide prompt design for ASAG in the context of Brazilian Portuguese. Therefore, we recommend students, educators, and developers leverage these findings to optimize prompt design and benefit from the advancements offered by state-of-the-art LLMs whenever possible.
Referências
Bonthu, S., Rama Sree, S., & Krishna Prasad, M. (2021). Automated short answer grading using deep learning: A survey. In Machine Learning and Knowledge Extraction: 5th IFIP TC 5, TC 12, WG 8.4, WG 8.9, WG 12.9 International Cross-Domain Conference, CD-MAKE 2021, Virtual Event, August 17–20, 2021, Proceedings 5 (pp. 61–78). Springer.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language models are few-shot learners.
Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25, 60–117.
Camus, L., & Filighera, A. (2020a). Investigating transformers for automatic short answer grading. In Artificial Intelligence in Education: 21st International Conference, AIED 2020, Ifrane, Morocco, July 6–10, 2020, Proceedings, Part II 21 (pp. 43–48). Springer.
Camus, L., & Filighera, A. (2020b). Investigating transformers for automatic short answer grading. In I. I. Bittencourt, M. Cukurova, K. Muldner, R. Luckin, & E. Millán (Eds.), Artificial Intelligence in Education (pp. 43–48). Cham: Springer International Publishing.
Chakraborty, C., Sethi, R., Chauhan, V., Sarma, B., & Chakraborty, U. K. (2023). Automatic short answer grading using universal sentence encoder. In M. E. Auer, W. Pachatz, & T. Rüütmann (Eds.), Learning in the Age of Digital and Green Transition (pp. 511–518). Cham: Springer International Publishing.
Chang, L.-H., & Ginter, F. (2024). Automatic short answer grading for Finnish with ChatGPT. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, pp. 23173–23181).
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Condor, A., Litster, M., & Pardos, Z. A. (2021). Automatic short answer grading with SBERT on out-of-sample questions. In Educational Data Mining.
del Gobbo, E., Guarino, A., Cafarelli, B., & Grilli, L. (2023). GradeAid: A framework for automatic short answers grading in educational contexts—Design, implementation and evaluation. Knowledge and Information Systems, 65(10), 4295–4334.
Eager, B., & Brunton, R. (2023). Prompting higher education towards AI-augmented teaching and learning practice. Journal of University Teaching & Learning Practice, 20(5), 02.
Galhardi, L., de Souza, R. C. T., & Brancher, J. (2020). Automatic grading of Portuguese short answers using a machine learning approach. In Anais Estendidos do XVI Simpósio Brasileiro de Sistemas de Informação (pp. 109–124). SBC.
Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., & Neubig, G. (2023). PAL: Program-aided language models. In International Conference on Machine Learning (pp. 10764–10799). PMLR.
Giray, L. (2023). Prompt engineering with ChatGPT: A guide for academic writers. Annals of Biomedical Engineering, 1–5.
Karmaker Santu, S. K., & Feng, D. (2023). TELER: A general taxonomy of LLM prompts for benchmarking complex tasks. In H. Bouamor, J. Pino, & K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 14197–14203). Singapore: Association for Computational Linguistics.
Kasneci, E., Sessler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., & others. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274.
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems, 35, 22199–22213.
Li, J., Gui, L., Zhou, Y., West, D., Aloisi, C., & He, Y. (2023). Distilling ChatGPT for explainable automated student answer assessment. arXiv preprint arXiv:2305.12962.
Mohler, M., & Mihalcea, R. (2009). Text-to-text semantic similarity for automatic short answer grading. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009) (pp. 567–575).
Moore, S., Nguyen, H. A., Bier, N., Domadia, T., & Stamper, J. (2022). Assessing the quality of student-generated short answer questions using GPT-3. In European Conference on Technology Enhanced Learning (pp. 243–257). Springer.
Naismith, B., Mulcaire, P., & Burstein, J. (2023). Automated evaluation of written discourse coherence using GPT-4. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) (pp. 394–403).
Nguyen, H. A., Stec, H., Hou, X., Di, S., & McLaren, B. M. (2023). Evaluating ChatGPT’s decimal skills and feedback generation in a digital learning game. In European Conference on Technology Enhanced Learning (pp. 278–293). Springer.
Nicol, D. J., & Macfarlane-Dick, D. (2006). Formative assessment and self-regulated learning: A model and seven principles of good feedback practice. Studies in Higher Education, 31(2), 199–218.
OpenAI. (2023). GPT-4 technical report.
Patil, S., & Adhiya, K. P. (2022). Automated evaluation of short answers: A systematic review. In Intelligent Data Communication Technologies and Internet of Things: Proceedings of ICICI 2021 (pp. 953–963).
Putnikovic, M., & Jovanovic, J. (2023). Embeddings for automatic short answer grading: A scoping review. IEEE Transactions on Learning Technologies.
Sahu, A., & Bhowmick, P. K. (2020). Feature engineering and ensemble-based approach for improving automatic short-answer grading performance. IEEE Transactions on Learning Technologies, 13(1), 77–90.
Short, C. E., & Short, J. C. (2023). The artificially intelligent entrepreneur: ChatGPT, prompt engineering, and entrepreneurial rhetoric creation. Journal of Business Venturing Insights, 19, e00388.
Sung, C., Dhamecha, T. I., & Mukhi, N. (2019). Improving short answer grading using transformer-based pre-training. In S. Isotani, E. Millán, A. Ogan, P. Hastings, B. McLaren, & R. Luckin (Eds.), Artificial Intelligence in Education (pp. 469–481). Cham: Springer International Publishing.
Taylor, N., Zhang, Y., Joyce, D. W., Gao, Z., Kormilitzin, A., & Nevado-Holgado, A. (2023). Clinical prompt learning with frozen language models. IEEE Transactions on Neural Networks and Learning Systems.
Vanbelle, S. (2016). A new interpretation of the weighted kappa coefficients. Psychometrika, 81, 399–410.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.
White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., & Schmidt, D. C. (2023). A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv preprint arXiv:2302.11382.
Yan, L., Sha, L., Zhao, L., Li, Y., Martinez-Maldonado, R., Chen, G., Li, X., Jin, Y., & Gašević, D. (2024). Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology, n/a(n/a).
Yancey, K. P., Laflair, G., Verardi, A., & Burstein, J. (2023). Rating short L2 essays on the CEFR scale with GPT-4. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) (pp. 576–584).
Zirar, A. (2023). Exploring the impact of language models, such as ChatGPT, on student learning and assessment. Review of Education, 11(3), e3433.
Ziyu, Z., Qiguang, C., Longxuan, M., Mingda, L., Yi, H., Yushan, Q., Haopeng, B., Weinan, Z., & Liu, T. (2023). Through the lens of core competency: Survey on evaluation of large language models. In J. Zhang (Ed.), Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 2: Frontier Forum) (pp. 88–109). Chinese Information Processing Society of China.