GPT-3.5 for Data Augmentation in Automatic Essay Scoring: A Preliminary Analysis
Resumo
Machine learning models are susceptible to the dataset used during its training. Dealing with limited or imbalanced datasets is challenging, and a commonly adopted approach to mitigate this limitation is data augmentation. For example, expanding the training set in a computer vision problem may involve rotation and resizing images; however, this task is more complex when dealing with textual data. This work investigates the use of GPT-3.5 for data augmentation in a dataset of argumentative essay texts from the National High School Exam (ENEM), which is used as a selection criterion for entry into public universities in Brazil. More specifically, we adopted traditional Natural Language Processing (NLP) techniques for essay scoring and compared the results with and without the data augmentation. Our results show that the long argumentative essays generated by GPT in the data augmentation process did not improve the performance of NLP models. Moreover, GPT could not adequately classify its synthetic data, suggesting poor quality of the generated data, and did not outperform NLP models in classifying real data.Referências
Bai, X. and Stede, M. (2022). A survey of current machine learning approaches to student free-text evaluation for intelligent tutoring. International Journal of Artificial Intelligence in Education, pages 1–39.
Bayer, M., Kaufhold, M.-A., Buchhold, B., Keller, M., Dallmeyer, J., and Reuter, C. (2022). Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers. International Journal of Machine Learning and Cybernetics, 14(1):135–150.
Burrows, S., Gurevych, I., and Stein, B. (2015). The eras and trends of automatic short answer grading. International journal of artificial intelligence in education, 25:60–117.
Camelo, R., Justino, S., and de Mello, R. F. L. (2020). Coh-metrix pt-br: Uma api web de análise textual para a educação. In Anais dos Workshops do IX Congresso Brasileiro de Informática na Educação, pages 179–186. SBC.
Carvalho, F., Rodrigues, R. G., Santos, G., Cruz, P., Ferrari, L., and Guedes, G. P. (2019). Evaluating the brazilian portuguese version of the 2015 liwc lexicon with sentiment analysis in social networks. In Anais do VIII Brazilian Workshop on Social Network Analysis and Mining, pages 24–34. SBC.
Carvalho, R., Lins, L. F., Rodrigues, L., Miranda, P., Oliveira, H., Cordeiro, T., Bittencourt, I. I., Isotani, S., and Mello, R. F. (2024). Exploring nlp and embedding for automatic essay scoring in the portuguese. In International Conference on Artificial Intelligence in Education, pages 228–233. Springer.
Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., et al. (2024). A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45.
Chassab, R. H., Zakaria, L. Q., and Tiun, S. (2021). Automatic essay scoring: A review on the feature analysis techniques. International Journal of Advanced Computer Science and Applications, 12(10).
Cochran, K., Cohn, C., Rouet, J. F., and Hastings, P. (2023). Improving automated evaluation of student text responses using gpt-3.5 for text data augmentation. In Wang, N., Rebolledo-Mendez, G., Matsuda, N., Santos, O. C., and Dimitrova, V., editors, Artificial Intelligence in Education, pages 217–228. Springer Nature Switzerland.
Costa, L., Oliveira, E., and Júnior, A. C. (2020). Corretor automático de redações em língua portuguesa: um mapeamento sistemático de literatura. In Anais do XXXI Simpósio Brasileiro de Informática na Educação, pages 1403–1412, Porto Alegre, RS, Brasil. SBC.
Dai, H., Liu, Z., Liao, W., Huang, X., Cao, Y., Wu, Z., Zhao, L., Xu, S., Liu, W., Liu, N., Li, S., Zhu, D., Cai, H., Sun, L., Li, Q., Shen, D., Liu, T., and Li, X. (2023). Auggpt: Leveraging chatgpt for text data augmentation.
de Lima, T. B., da Silva, I. L. A., Freitas, E. L. S. X., and Mello, R. F. (2023). Avaliaçao automática de redaçao: Uma revisao sistemática. Revista Brasileira de Informática na Educação, 31:205–221.
Dikli, S. (2006). An overview of automated scoring of essays. Journal of Technology, Learning, and Assessment, 5(1).
ENEM (2022). A redação no Enem 2022: cartilha do participante. Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira (Inep).
Ferreira-Mello, R., André, M., Pinheiro, A., Costa, E., and Romero, C. (2019). Text mining in education. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(6):e1332.
Ferreira Mello, R., Fiorentino, G., Oliveira, H., Miranda, P., Rakovic, M., and Gasevic, D. (2022). Towards automated content analysis of rhetorical structure of written essays using sequential content-independent features in portuguese. In LAK22: 12th International Learning Analytics and Knowledge Conference, pages 404–414.
Galhardi, L., Herculano, M. F., Rodrigues, L., Miranda, P., Oliveira, H., Cordeiro, T., Bittencourt, I. I., Isotani, S., and Mello, R. F. (2024). Contextual features for automatic essay scoring in portuguese. In International Conference on Artificial Intelligence in Education, pages 270–282. Springer.
Li, F., Xi, X., Cui, Z., Li, D., and Zeng, W. (2023). Automatic essay scoring method based on multi-scale features. Applied Sciences, 13(11):6775.
Marinho, J., Anchiêta, R., and Moura, R. (2022a). Essay-br: a brazilian corpus to automatic essay scoring task. Journal of Information and Data Management, 13(1):65–76.
Marinho, J. C., Cordeiro, F., Anchiêta, R. T., and Moura, R. S. (2022b). Automated essay scoring: An approach based on enem competencies. In Anais do XIX Encontro Nacional de Inteligência Artificial e Computacional, pages 49–60. SBC.
Oliveira, H., Ferreira Mello, R., Barreiros Rosa, B. A., Rakovic, M., Miranda, P., Cordeiro, T., Isotani, S., Bittencourt, I., and Gasevic, D. (2023a). Towards explainable prediction of essay cohesion in portuguese and english. In LAK23: 13th International Learning Analytics and Knowledge Conference, pages 509–519.
Oliveira, H., Mello, R. F., Miranda, P., Alexandre, B., Cordeiro, T., Bittencourt, I. I., and Isotani, S. (2023b). Classificaçao ou regressao? avaliando coesao textual em redaçoes no contexto do enem. In Anais do XXXIV Simpósio Brasileiro de Informática na Educação, pages 1226–1237. SBC.
Oliveira, H., Miranda, P., Isotani, S., Santos, J., Cordeiro, T., Bittencourt, I. I., and Mello, R. F. (2022). Estimando coesão textual em redações no contexto do enem utilizando modelos de aprendizado de máquina. In Anais do XXXIII Simpósio Brasileiro de Informática na Educação, pages 883–894. SBC.
Park, Y.-H., Choi, Y.-S., Park, C.-Y., and Lee, K.-J. (2022). Essaygan: Essay data augmentation based on generative adversarial networks for automated essay scoring. Applied Sciences, 12(12):5803.
Philip, H. and Tashu, T. M. (2024). Phrase-level adversarial training for mitigating bias in neural network-based automatic essay scoring. arXiv preprint arXiv:2409.04795.
Quteineh, H., Samothrakis, S., and Sutcliffe, R. (2020). Textual data augmentation for efficient active learning on tiny datasets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7400–7410. Association for Computational Linguistics.
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: pretrained bert models for brazilian portuguese. In Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part I 9, pages 403–417. Springer.
White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., SpencerSmith, J., and Schmidt, D. C. (2023). A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382.
Xiao, C., Ma, W., Song, Q., Xu, S. X., Zhang, K., Wang, Y., and Fu, Q. (2024). Human-ai collaborative essay scoring: A dual-process framework with llms.
Bayer, M., Kaufhold, M.-A., Buchhold, B., Keller, M., Dallmeyer, J., and Reuter, C. (2022). Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers. International Journal of Machine Learning and Cybernetics, 14(1):135–150.
Burrows, S., Gurevych, I., and Stein, B. (2015). The eras and trends of automatic short answer grading. International journal of artificial intelligence in education, 25:60–117.
Camelo, R., Justino, S., and de Mello, R. F. L. (2020). Coh-metrix pt-br: Uma api web de análise textual para a educação. In Anais dos Workshops do IX Congresso Brasileiro de Informática na Educação, pages 179–186. SBC.
Carvalho, F., Rodrigues, R. G., Santos, G., Cruz, P., Ferrari, L., and Guedes, G. P. (2019). Evaluating the brazilian portuguese version of the 2015 liwc lexicon with sentiment analysis in social networks. In Anais do VIII Brazilian Workshop on Social Network Analysis and Mining, pages 24–34. SBC.
Carvalho, R., Lins, L. F., Rodrigues, L., Miranda, P., Oliveira, H., Cordeiro, T., Bittencourt, I. I., Isotani, S., and Mello, R. F. (2024). Exploring nlp and embedding for automatic essay scoring in the portuguese. In International Conference on Artificial Intelligence in Education, pages 228–233. Springer.
Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., et al. (2024). A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45.
Chassab, R. H., Zakaria, L. Q., and Tiun, S. (2021). Automatic essay scoring: A review on the feature analysis techniques. International Journal of Advanced Computer Science and Applications, 12(10).
Cochran, K., Cohn, C., Rouet, J. F., and Hastings, P. (2023). Improving automated evaluation of student text responses using gpt-3.5 for text data augmentation. In Wang, N., Rebolledo-Mendez, G., Matsuda, N., Santos, O. C., and Dimitrova, V., editors, Artificial Intelligence in Education, pages 217–228. Springer Nature Switzerland.
Costa, L., Oliveira, E., and Júnior, A. C. (2020). Corretor automático de redações em língua portuguesa: um mapeamento sistemático de literatura. In Anais do XXXI Simpósio Brasileiro de Informática na Educação, pages 1403–1412, Porto Alegre, RS, Brasil. SBC.
Dai, H., Liu, Z., Liao, W., Huang, X., Cao, Y., Wu, Z., Zhao, L., Xu, S., Liu, W., Liu, N., Li, S., Zhu, D., Cai, H., Sun, L., Li, Q., Shen, D., Liu, T., and Li, X. (2023). Auggpt: Leveraging chatgpt for text data augmentation.
de Lima, T. B., da Silva, I. L. A., Freitas, E. L. S. X., and Mello, R. F. (2023). Avaliaçao automática de redaçao: Uma revisao sistemática. Revista Brasileira de Informática na Educação, 31:205–221.
Dikli, S. (2006). An overview of automated scoring of essays. Journal of Technology, Learning, and Assessment, 5(1).
ENEM (2022). A redação no Enem 2022: cartilha do participante. Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira (Inep).
Ferreira-Mello, R., André, M., Pinheiro, A., Costa, E., and Romero, C. (2019). Text mining in education. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(6):e1332.
Ferreira Mello, R., Fiorentino, G., Oliveira, H., Miranda, P., Rakovic, M., and Gasevic, D. (2022). Towards automated content analysis of rhetorical structure of written essays using sequential content-independent features in portuguese. In LAK22: 12th International Learning Analytics and Knowledge Conference, pages 404–414.
Galhardi, L., Herculano, M. F., Rodrigues, L., Miranda, P., Oliveira, H., Cordeiro, T., Bittencourt, I. I., Isotani, S., and Mello, R. F. (2024). Contextual features for automatic essay scoring in portuguese. In International Conference on Artificial Intelligence in Education, pages 270–282. Springer.
Li, F., Xi, X., Cui, Z., Li, D., and Zeng, W. (2023). Automatic essay scoring method based on multi-scale features. Applied Sciences, 13(11):6775.
Marinho, J., Anchiêta, R., and Moura, R. (2022a). Essay-br: a brazilian corpus to automatic essay scoring task. Journal of Information and Data Management, 13(1):65–76.
Marinho, J. C., Cordeiro, F., Anchiêta, R. T., and Moura, R. S. (2022b). Automated essay scoring: An approach based on enem competencies. In Anais do XIX Encontro Nacional de Inteligência Artificial e Computacional, pages 49–60. SBC.
Oliveira, H., Ferreira Mello, R., Barreiros Rosa, B. A., Rakovic, M., Miranda, P., Cordeiro, T., Isotani, S., Bittencourt, I., and Gasevic, D. (2023a). Towards explainable prediction of essay cohesion in portuguese and english. In LAK23: 13th International Learning Analytics and Knowledge Conference, pages 509–519.
Oliveira, H., Mello, R. F., Miranda, P., Alexandre, B., Cordeiro, T., Bittencourt, I. I., and Isotani, S. (2023b). Classificaçao ou regressao? avaliando coesao textual em redaçoes no contexto do enem. In Anais do XXXIV Simpósio Brasileiro de Informática na Educação, pages 1226–1237. SBC.
Oliveira, H., Miranda, P., Isotani, S., Santos, J., Cordeiro, T., Bittencourt, I. I., and Mello, R. F. (2022). Estimando coesão textual em redações no contexto do enem utilizando modelos de aprendizado de máquina. In Anais do XXXIII Simpósio Brasileiro de Informática na Educação, pages 883–894. SBC.
Park, Y.-H., Choi, Y.-S., Park, C.-Y., and Lee, K.-J. (2022). Essaygan: Essay data augmentation based on generative adversarial networks for automated essay scoring. Applied Sciences, 12(12):5803.
Philip, H. and Tashu, T. M. (2024). Phrase-level adversarial training for mitigating bias in neural network-based automatic essay scoring. arXiv preprint arXiv:2409.04795.
Quteineh, H., Samothrakis, S., and Sutcliffe, R. (2020). Textual data augmentation for efficient active learning on tiny datasets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7400–7410. Association for Computational Linguistics.
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: pretrained bert models for brazilian portuguese. In Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part I 9, pages 403–417. Springer.
White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., SpencerSmith, J., and Schmidt, D. C. (2023). A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382.
Xiao, C., Ma, W., Song, Q., Xu, S. X., Zhang, K., Wang, Y., and Fu, Q. (2024). Human-ai collaborative essay scoring: A dual-process framework with llms.
Publicado
24/11/2025
Como Citar
CARVALHO, Ruan; MIRANDA, Péricles B. C.; OLIVEIRA, Hilário T. A.; XAVIER, Cleon; RODRIGUES, Luiz; COSTA, Newarney T.; MELLO, Rafael Ferreira.
GPT-3.5 for Data Augmentation in Automatic Essay Scoring: A Preliminary Analysis. In: SIMPÓSIO BRASILEIRO DE INFORMÁTICA NA EDUCAÇÃO (SBIE), 36. , 2025, Curitiba/PR.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 578-589.
DOI: https://doi.org/10.5753/sbie.2025.12536.
