From Zero-shot to Self-generated References: Leveraging LLMs for Scoring ENEM Essays

  • Matheus Yasuo Ribeiro Utino USP
  • Paulo Mann UFRJ

Abstract


This study investigates the application of Large Language Models (LLMs) to Automated Essay Scoring (AES) in the context of Brazil’s Exame Nacional do Ensino Médio (ENEM). We evaluate five state-of-the-art LLMs across three prompting scenarios: zero-shot, one-shot (with high-score references), and a novel self-generated reference approach, where the model generates its own ideal reference before evaluation. Using the Essay-BR corpus, we assess performance using both classification and regression metrics. Results show that one-shot prompting consistently yields the best metrics, while the self-generated reference method presents a viable alternative when no real references are available. Our findings highlight the promise of LLMs for educational scoring.

References

Amorim, E., Cançado, M., and Veloso, A. (2018). Automated essay scoring in the presence of biased ratings. In Walker, M., Ji, H., and Stent, A., editors, NAACL 2018, Volume 1, pages 229–237.

Atkinson, J. and Palma, D. (2025). An llm-based hybrid approach for enhanced automated essay scoring. Scientific Reports, 15(1):14551.

Berti, L., Giorgi, F., and Kasneci, G. (2025). Emergent abilities in large language models: A survey. arXiv preprint arXiv:2503.05788.

Chen, Z. Z., Ma, J., Zhang, X., Hao, N., Yan, A., Nourbakhsh, A., Yang, X., McAuley, J., Petzold, L., and Wang, W. Y. (2024). A survey on large language models for critical societal domains: Finance, healthcare, and law. arXiv preprint arXiv:2405.01769.

Fonseca, E., Medeiros, I., Kamikawachi, D., and Bokan, A. (2018). Automatically grading brazilian student essays. In PROPOR 2018, page 170–179, Berlin, Heidelberg. Springer-Verlag.

Garcia, G. (2024). Desigualdade: 63% da riqueza do brasil está nas mãos de 1% da população, diz relatório da oxfam. Accessed on: Jun 6, 2025.

Inep (2020). Histórico enem. [link]. Accessed at: 7 jun. 2025.

INEP (2024). The ENEM Essay: Participant’s Handbook 2024. Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira (INEP). Accessed: 2025-06-07.

Magalhães, T. (2023). Brasil tem baixo desempenho e estagna em ranking mundial da educação básica. Accessed on: Jun 6, 2025.

Marinho, J. C., Anchiêta, R. T., and Moura, R. S. (2022a). Essay-br: a brazilian corpus to automatic essay scoring task. Journal of Information and Data Management, 13(1).

Marinho, J. C., Cordeiro, F., Anchiêta, R. T., and Moura, R. S. (2022b). Automated essay scoring: An approach based on enem competencies. In Encontro Nacional de Inteligência Artificial e Computacional (ENIAC), pages 49–60. SBC.

Mello, R. F., Oliveira, H., Wenceslau, M., Batista, H., Cordeiro, T., Bittencourt, I. I., and Isotanif, S. (2024). Propor’24 competition on automatic essay scoring of portuguese narrative essays. In Proc. PROPOR 2024, Vol. 2, pages 1–5.

Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M. A., Socher, R., Amatriain, X., and Gao, J. (2024). Large language models: A survey. ArXiv, abs/2402.06196.

Pires, G. (2023). Enem 2023: entenda como o exame é capaz de mudar a vida dos estudantes. Accessed at: 7 jun. 2025.

Qin, L., Chen, Q., Feng, X., Wu, Y., Zhang, Y., Li, Y., Li, M., Che, W., and Yu, P. S. (2024). Large language models meet nlp: A survey. arXiv preprint arXiv:2405.12819.

Silveira, I. C., Barbosa, A., and Mauá, D. D. (2024). A new benchmark for automatic essay scoring in portuguese. In Proc. PROPOR 2024, Vol. 1, pages 228–237.

Xu, H., Gan, W., Qi, Z., Wu, J., and Yu, P. S. (2024). Large language models for education: A survey. arXiv preprint arXiv:2405.13001.
Published
2025-09-29
UTINO, Matheus Yasuo Ribeiro; MANN, Paulo. From Zero-shot to Self-generated References: Leveraging LLMs for Scoring ENEM Essays. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 16. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 467-477. DOI: https://doi.org/10.5753/stil.2025.37847.