research-article

Large Language Models for Education: Grading Open-Ended Questions Using ChatGPT

Authors:
Gustavo Pinto

UFPA & Zup Innovation, Brazil

UFPA & Zup Innovation, Brazil

0000-0001-7598-2799
View Profile

,
Isadora Cardoso-Pereira

Zup Innovation, Brazil

Zup Innovation, Brazil

0000-0002-7681-7653
View Profile

,
Danilo Monteiro

Zup Innovation, Brazil

Zup Innovation, Brazil

0000-0001-7393-729X
View Profile

,
Danilo Lucena

UFPE & Zup Innovation, Algeria

UFPE & Zup Innovation, Algeria

0009-0003-2058-3661
View Profile

,
Alberto Souza

Zup Innovation, Brazil

Zup Innovation, Brazil

0000-0002-0654-4186
View Profile

,
Kiev Gama

UFPE, Brazil

UFPE, Brazil

0000-0003-1508-6196
View Profile

SBES '23: Proceedings of the XXXVII Brazilian Symposium on Software EngineeringSeptember 2023Pages 293–302https://doi.org/10.1145/3613372.3614197

Published:25 September 2023Publication History

SBES '23: Proceedings of the XXXVII Brazilian Symposium on Software Engineering

Pages 293–302

ABSTRACT

As a way of addressing increasingly sophisticated problems, software professionals face the constant challenge of seeking improvement. However, for these individuals to enhance their skills, their process of studying and training must involve feedback that is both immediate and accurate. In the context of software companies, where the scale of professionals undergoing training is large, but the number of qualified professionals available for providing corrections is small, delivering effective feedback becomes even more challenging. To circumvent this challenge, this work presents an exploration of using Large Language Models (LLMs) to support the correction process of open-ended questions in technical training.

In this study, we utilized ChatGPT to correct open-ended questions answered by 42 industry professionals on two topics. Evaluating the corrections and feedback provided by ChatGPT, we observed that it is capable of identifying semantic details in responses that other metrics cannot observe. Furthermore, we noticed that, in general, subject matter experts tended to agree with the corrections and feedback given by ChatGPT.

References

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72.Google Scholar
Andrew Begel and Beth Simon. 2008. Novice Software Developers, All over Again. In Proceedings of the Fourth International Workshop on Computing Education Research (Sydney, Australia) (ICER ’08). Association for Computing Machinery, New York, NY, USA, 3–14. https://doi.org/10.1145/1404520.1404522Google ScholarDigital Library
Jan Philip Bernius, Stephan Krusche, and Bernd Bruegge. 2022. Machine learning based feedback on textual student answers in large courses. Computers and Education: Artificial Intelligence 3 (2022), 100081.Google ScholarCross Ref
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.Google Scholar
Sarah Gielen, Elien Peeters, Filip Dochy, Patrick Onghena, and Katrien Struyven. 2010. Improving the effectiveness of peer feedback for learning. Learning and instruction 20, 4 (2010), 304–315.Google Scholar
Antonio Hernández-Blanco, Boris Herrera-Flores, David Tomás, and Borja Navarro-Colorado. 2019. A systematic review of deep learning approaches to educational data mining. Complexity 2019 (2019).Google Scholar
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. Comput. Surveys 55, 12 (2023), 1–38.Google ScholarDigital Library
Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, 2023. ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences 103 (2023), 102274.Google ScholarCross Ref
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916 (2022).Google Scholar
Alfirna Rizqi Lahitani, Adhistya Erna Permanasari, and Noor Akhmad Setiawan. 2016. Cosine similarity to determine similarity measure: Study case in online essay assessment. In 2016 4th International Conference on Cyber and IT Service Management. IEEE, 1–6.Google ScholarCross Ref
Baoli Li and Liping Han. 2013. Distance weighted cosine similarity measure for text classification. In Intelligent Data Engineering and Automated Learning–IDEAL 2013: 14th International Conference, IDEAL 2013, Hefei, China, October 20-23, 2013. Proceedings 14. Springer, 611–618.Google Scholar
Qian Li, Hao Peng, Jianxin Li, Congying Xia, Renyu Yang, Lichao Sun, Philip S Yu, and Lifang He. 2020. A survey on text classification: From shallow to deep learning. arXiv preprint arXiv:2008.00364 (2020).Google Scholar
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.Google ScholarDigital Library
Vivian Liu and Lydia B Chilton. 2022. Design guidelines for prompt engineering text-to-image generative models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–23.Google ScholarDigital Library
Xiaofei Ma, Zhiguo Wang, Patrick Ng, Ramesh Nallapati, and Bing Xiang. 2019. Universal text representation from bert: An empirical study. arXiv preprint arXiv:1910.07973 (2019).Google Scholar
Guido Makransky, Malene Warming Thisgaard, and Helen Gadegaard. 2016. Virtual simulations as preparation for lab exercises: Assessing learning of key laboratory skills in microbiology and improvement of essential non-cognitive skills. PloS one 11, 6 (2016), e0155895.Google ScholarCross Ref
Steven Moore, Huy A Nguyen, Norman Bier, Tanvi Domadia, and John Stamper. 2022. Assessing the Quality of Student-Generated Short Answer Questions Using GPT-3. In Educating for a New Future: Making Sense of Technology-Enhanced Learning Adoption: 17th European Conference on Technology Enhanced Learning, EC-TEL 2022, Toulouse, France, September 12–16, 2022, Proceedings. Springer, 243–257.Google ScholarDigital Library
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (Philadelphia, Pennsylvania) (ACL ’02). Association for Computational Linguistics, USA, 311–318. https://doi.org/10.3115/1073083.1073135Google ScholarDigital Library
Partha Pratim Ray. 2023. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems (2023).Google Scholar
Thomas Scialom and Jacopo Staiano. 2019. Ask to learn: A study on curiosity-driven question generation. arXiv preprint arXiv:1911.03350 (2019).Google Scholar
Ritika Singh and Satwinder Singh. 2021. Text similarity measures in news articles by vector space model using NLP. Journal of The Institution of Engineers (India): Series B 102 (2021), 329–338.Google ScholarCross Ref
Robert McCaughan Smith. 1982. Learning how to learn: Applied theory for adults. Open University Press Great Britain.Google Scholar
Marieke Thurlings, Marjan Vermeulen, Theo Bastiaens, and Sjef Stijnen. 2013. Understanding feedback: A learning theory perspective. Educational Research Review 9 (2013), 1–15.Google ScholarCross Ref
Phillip D Tomporowski, Bryan McCullick, Daniel M Pendleton, and Caterina Pesce. 2015. Exercise and children’s cognition: The role of exercise characteristics and a place for metacognition. Journal of Sport and Health Science 4, 1 (2015), 47–55.Google ScholarCross Ref
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google Scholar
Regina Vollmeyer and Falko Rheinberg. 2005. A surprising effect of feedback on learning. Learning and instruction 15, 6 (2005), 589–602.Google Scholar
Sam Witteveen and Martin Andrews. 2019. Paraphrasing with large language models. arXiv preprint arXiv:1911.09661 (2019).Google Scholar
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, 2023. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).Google Scholar
Mengxiao Zhu, Ou Lydia Liu, and Hee-Sun Lee. 2020. The effect of automated feedback on revision behavior and learning gains in formative assessment of scientific argument writing. Computers & Education 143 (2020), 103668.Google ScholarDigital Library

Index Terms

Large Language Models for Education: Grading Open-Ended Questions Using ChatGPT
1. Software and its engineering
  1. Software creation and management
    1. Designing software
      1. Software design tradeoffs

Recommendations

The Influence of the Answer Box Size on Item Nonresponse to Open-Ended Questions in a Web Survey

This article investigates item nonresponse in open-ended survey questions because such item nonresponse is much higher than in closed questions. The difference is a result of the higher cognitive burden placed on the respondent. To study item ...
Read More
Effects of the Number of Open-Ended Probing Questions on Response Quality in Cognitive Online Pretests

Cognitive online pretests have, in recent years, become recognized as a promising tool for evaluating questions prior to their use in actual surveys. While existing research has shown that cognitive online pretests produce similar results to face-to-face ...
Read More
Who tends to answer open-ended questions in an e-service survey? The contribution of closed-ended answers

This study presents a web survey investigating the effects of gender, age, prior usage behaviours, and closed-ended answers on response behaviour for open-ended questions in a user satisfaction or experience evaluation. Two types of open-ended questions ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SBES '23: Proceedings of the XXXVII Brazilian Symposium on Software Engineering
September 2023
570 pages
ISBN:9798400707872
DOI:10.1145/3613372

Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 September 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Automated grading
ChatGPT
Open-ended Questions
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate147of427submissions,34%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 376
  Total Downloads
- Downloads (Last 12 months)376
- Downloads (Last 6 weeks)84
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Large Language Models for Education: Grading Open-Ended Questions Using ChatGPT

SBES '23: Proceedings of the XXXVII Brazilian Symposium on Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

The Influence of the Answer Box Size on Item Nonresponse to Open-Ended Questions in a Web Survey

Effects of the Number of Open-Ended Probing Questions on Response Quality in Cognitive Online Pretests

Who tends to answer open-ended questions in an e-service survey? The contribution of closed-ended answers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Large Language Models for Education: Grading Open-Ended Questions Using ChatGPT

SBES '23: Proceedings of the XXXVII Brazilian Symposium on Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

The Influence of the Answer Box Size on Item Nonresponse to Open-Ended Questions in a Web Survey

Effects of the Number of Open-Ended Probing Questions on Response Quality in Cognitive Online Pretests

Who tends to answer open-ended questions in an e-service survey? The contribution of closed-ended answers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media