Automatic Scoring of Elementary School Essays in Brazilian Portuguese with LLMs: Comparing Gemini, GPT-4o, Claude, and Mistral
Resumo
Writing is an essential skill for the development of students’ critical thinking, communication, and language competencies. However, evaluating written productions efficiently and within an appropriate timeframe remains a challenge, especially in contexts of high teaching demands. This study investigates the use of Large Language Models (LLMs) for the automated analysis of essays with a focus on achieving accurate and consistent assessments. Four advanced models such as Gemini, GPT-4o, Claude 3.7 and Mistral were examined and applied to the evaluation of narrative texts written in Brazilian Portuguese by elementary school students. The results indicate that the Gemini 2.0 Pro model demonstrated greater accuracy in score assignment, while Claude 3.7 stood out for consistency in alignment with human evaluation. The findings highlight the potential of LLMs to support pedagogical practices by providing consistent assessments and contributing to the development of students’ writing skills. The study proposes viable alternatives for the use of artificial intelligence in educational contexts with limited resources.
Palavras-chave:
Automated Essay Scoring, Large Language Models, Artificial Intelligence in Education
Referências
Aucejo, E. M. and Wong, K. (2024). The effect of feedback on student performance. Journal of Public Economics, 224:105274.
Chase, H. (2023). Langchain. [link].
da Silva, W. A. and de Araujo, C. C. (2025). Automated enem essay scoring and feedbacks: A prompt-driven llm approach. In Proceedings of the ... (complete as appropriate), Recife, Brazil.
da Silva Filho, M. W., Nascimento, A. C., Miranda, P., Rodrigues, L., Cordeiro, T., Isotani, S., Bittencourt, I. I., and Mello, R. F. (2023). Automated formal register scoring of student narrative essays written in portuguese. In Workshop de Aplicações Práticas de Learning Analytics em Instituições de Ensino no Brasil (WAPLA), pages 1–11. SBC.
Er, E., Dimitriadis, Y., and Gašević, D. (2021). Collaborative peer feedback and learning analytics: Theory-oriented design for supporting class-wide interventions. Assessment & Evaluation in Higher Education, 46(2):169–190.
Ferreira Mello, R., Pereira Junior, C., Rodrigues, L., Pereira, F. D., Cabral, L., Costa, N., Ramalho, G., and Gasevic, D. (2025). Automatic short answer grading in the llm era: Does gpt-4 with prompt engineering beat traditional models? page 93–103.
Filho, M. W. d. S., Nascimento, A. C. A., Miranda, P., Rodrigues, L., Cordeiro, T., Isotani, S., Bittencourt, I. I., and Mello, R. F. (2023). Automated formal register scoring of student narrative essays written in portuguese. In Anais do Congresso Brasileiro de Informática na Educação (CBIE). The paper mentions CBIE 29[cite: 1]. Specific page numbers would be found in the full proceedings.
Gasparini, S. M., Barreto, S. M., and Assunção, A. A. (2022). O professor, as condições de trabalho e os efeitos sobre sua saúde. Educação & Pesquisa, 48(2):e242423.
Graham, S. and Harris, K. R. (2019). Evidence-based practices in writing. In Best Practices in Writing Instruction.
Hou, Z. J., Ciuba, A., and Li, X. L. (2025). Improve llm-based automatic essay scoring with linguistic features. arXiv preprint arXiv:2404.19064. Available at [link].
Liew, P. Y. and Tan, I. K. T. (2024). On automated essay grading using large language models. In Proceedings of the 2024 8th International Conference on Computer Science and Artificial Intelligence (CSAI), page 8, Beijing, China. ACM.
Liu, H.-C., Wang, C., Keefer, M. W., Kim, S., Glaser, K., van der Wegen, R., and Rus, V. (2023). Evaluating LLMs for grading undergraduate student essays. arXiv preprint arXiv:2502.09497.
Marrs, S. et al. (2016). Exploring elementary student perceptions of writing feedback. Journal on Educational Psychology, 10(1):16–28.
Mello, R. F., de Oliveira, H. T. A., Wenceslau, M., Batista, H., Cordeiro, T., Bittencourt, I. I., and Isotani, S. (2024a). Brazilian portuguese narrative essays dataset. Accessed: May 15, 2025.
Mello, R. F., Oliveira, H., Wenceslau, M., Batista, H., Cordeiro, T., Bittencourt, I. I., and Isotanif, S. (2024b). Propor’24 competition on automatic essay scoring of portuguese narrative essays. In Proceedings of the 16th International Conference on Computational Processing of Portuguese-Vol. 2, pages 1–5.
Mello, R. F., Rodrigues, L., Sousa, E., Batista, H., Lins, M., Nascimento, A., and Gasevic, D. (2025). Automatic detection of narrative rhetorical categories and elements on middle school written essays.
Page, E. B. (1966). The imminence of grading essays by computer. The Phi Delta Kappan, 47(5):238–243.
Salim, Y., Stevanus, V., Barlian, E., Sari, A. C., and Suhartono, D. (2019). Automated english digital essay grader using machine learning. In 2019 IEEE International Conference on Engineering, Technology and Education (TALE), pages 1–6. IEEE.
Seßler, K., Fürstenberg, M., Bühler, B., and Kasneci, E. (2025). Can ai grade your essays? a comparative analysis of large language models and teacher ratings in multidimensional essay scoring. page 462–472.
Singh, A., Tan, D., Hepworth, C., and Seeland, M. (2024). Comparing LLM responses for education across model families. arXiv preprint arXiv:2502.08450.
Wang, D. and Wang, J. (2021). The impact mechanism of aes on improving english writing achievement. Journal Unknown.
Chase, H. (2023). Langchain. [link].
da Silva, W. A. and de Araujo, C. C. (2025). Automated enem essay scoring and feedbacks: A prompt-driven llm approach. In Proceedings of the ... (complete as appropriate), Recife, Brazil.
da Silva Filho, M. W., Nascimento, A. C., Miranda, P., Rodrigues, L., Cordeiro, T., Isotani, S., Bittencourt, I. I., and Mello, R. F. (2023). Automated formal register scoring of student narrative essays written in portuguese. In Workshop de Aplicações Práticas de Learning Analytics em Instituições de Ensino no Brasil (WAPLA), pages 1–11. SBC.
Er, E., Dimitriadis, Y., and Gašević, D. (2021). Collaborative peer feedback and learning analytics: Theory-oriented design for supporting class-wide interventions. Assessment & Evaluation in Higher Education, 46(2):169–190.
Ferreira Mello, R., Pereira Junior, C., Rodrigues, L., Pereira, F. D., Cabral, L., Costa, N., Ramalho, G., and Gasevic, D. (2025). Automatic short answer grading in the llm era: Does gpt-4 with prompt engineering beat traditional models? page 93–103.
Filho, M. W. d. S., Nascimento, A. C. A., Miranda, P., Rodrigues, L., Cordeiro, T., Isotani, S., Bittencourt, I. I., and Mello, R. F. (2023). Automated formal register scoring of student narrative essays written in portuguese. In Anais do Congresso Brasileiro de Informática na Educação (CBIE). The paper mentions CBIE 29[cite: 1]. Specific page numbers would be found in the full proceedings.
Gasparini, S. M., Barreto, S. M., and Assunção, A. A. (2022). O professor, as condições de trabalho e os efeitos sobre sua saúde. Educação & Pesquisa, 48(2):e242423.
Graham, S. and Harris, K. R. (2019). Evidence-based practices in writing. In Best Practices in Writing Instruction.
Hou, Z. J., Ciuba, A., and Li, X. L. (2025). Improve llm-based automatic essay scoring with linguistic features. arXiv preprint arXiv:2404.19064. Available at [link].
Liew, P. Y. and Tan, I. K. T. (2024). On automated essay grading using large language models. In Proceedings of the 2024 8th International Conference on Computer Science and Artificial Intelligence (CSAI), page 8, Beijing, China. ACM.
Liu, H.-C., Wang, C., Keefer, M. W., Kim, S., Glaser, K., van der Wegen, R., and Rus, V. (2023). Evaluating LLMs for grading undergraduate student essays. arXiv preprint arXiv:2502.09497.
Marrs, S. et al. (2016). Exploring elementary student perceptions of writing feedback. Journal on Educational Psychology, 10(1):16–28.
Mello, R. F., de Oliveira, H. T. A., Wenceslau, M., Batista, H., Cordeiro, T., Bittencourt, I. I., and Isotani, S. (2024a). Brazilian portuguese narrative essays dataset. Accessed: May 15, 2025.
Mello, R. F., Oliveira, H., Wenceslau, M., Batista, H., Cordeiro, T., Bittencourt, I. I., and Isotanif, S. (2024b). Propor’24 competition on automatic essay scoring of portuguese narrative essays. In Proceedings of the 16th International Conference on Computational Processing of Portuguese-Vol. 2, pages 1–5.
Mello, R. F., Rodrigues, L., Sousa, E., Batista, H., Lins, M., Nascimento, A., and Gasevic, D. (2025). Automatic detection of narrative rhetorical categories and elements on middle school written essays.
Page, E. B. (1966). The imminence of grading essays by computer. The Phi Delta Kappan, 47(5):238–243.
Salim, Y., Stevanus, V., Barlian, E., Sari, A. C., and Suhartono, D. (2019). Automated english digital essay grader using machine learning. In 2019 IEEE International Conference on Engineering, Technology and Education (TALE), pages 1–6. IEEE.
Seßler, K., Fürstenberg, M., Bühler, B., and Kasneci, E. (2025). Can ai grade your essays? a comparative analysis of large language models and teacher ratings in multidimensional essay scoring. page 462–472.
Singh, A., Tan, D., Hepworth, C., and Seeland, M. (2024). Comparing LLM responses for education across model families. arXiv preprint arXiv:2502.08450.
Wang, D. and Wang, J. (2021). The impact mechanism of aes on improving english writing achievement. Journal Unknown.
Publicado
24/11/2025
Como Citar
LOBO, Jamilla et al.
Automatic Scoring of Elementary School Essays in Brazilian Portuguese with LLMs: Comparing Gemini, GPT-4o, Claude, and Mistral. In: SIMPÓSIO BRASILEIRO DE INFORMÁTICA NA EDUCAÇÃO (SBIE), 36. , 2025, Curitiba/PR.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 167-180.
DOI: https://doi.org/10.5753/sbie.2025.12161.
