Uso de LLMs para correção de atividades de programação: uma análise comparativa no contexto de orientação a objetos

Marcel da Silva Melo; Fernando Barbosa Matos; Rodrigo Elias Francisco; Cleon Xavier Pereira Júnior; Rafael Dias Araújo

doi:10.5753/educomp.2026.18664

Marcel da Silva Melo IF Goiano
Fernando Barbosa Matos IF Goiano
Rodrigo Elias Francisco IF Goiano
Cleon Xavier Pereira Júnior IF Goiano
Rafael Dias Araújo UFU

DOI: https://doi.org/10.5753/educomp.2026.18664

Resumo

A correção de atividades de programação é uma tarefa trabalhosa que exige tempo e esforço consideráveis dos docentes, sobretudo em turmas numerosas, o que gera um alto volume de correções. Diante deste cenário, este trabalho visou investigar o potencial de LLMs, em configuração padrão e abordagem zero-shot, na avaliação automática de atividades de Programação Orientada a Objetos (POO). Foi realizada uma análise comparativa, utilizando métricas estatísticas, entre as notas geradas por LLMs e a nota atribuída pelo professor da disciplina. O estudo apontou que os modelos GPT-4.1, GPT-4.1-mini, Grok-3, DeepSeek-V3 e Grok-3-mini alcançaram baixos índices de erros e mantiveram um forte nível de concordância com as notas atribuídas pelo professor.

Referências

Akyash, M., Azar, K. Z., and Kamali, H. M. (2025). StepGrade: Grading Programming Assignments with Context-Aware LLMs. arXiv:2503.20851 [cs].

Balse, R., Kumar, V., Prasad, P., and Warriem, J. M. (2023). Evaluating the Quality of LLM-Generated Explanations for Logical Errors in CS1 Student Programs. In Proceedings of the 16th Annual ACM India Compute Conference, COMPUTE ’23, pages 49–54, New York, NY, USA. Association for Computing Machinery.

Barros, J., Moraes, L. O., Oliveira, F., and Delgado, C. A. D. M. (2025). Large Language Models Generating Feedback for Students of Introductory Programming Courses. In Cristea, A. I., Walker, E., Lu, Y., Santos, O. C., and Isotani, S., editors, Artificial Intelligence in Education, pages 421–433, Cham. Springer Nature Switzerland.

Chai, T. and Draxler, R. R. (2014). Root mean square error (rmse) or mean absolute error (mae)? – arguments against avoiding rmse in the literature. Geoscientific Model Development, 7(3):1247–1250.

Cohen, J. (1968). Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4):213–220.

Efan, E., Krismadinata, K., Jama, J., and Mulya, R. (2023). A Systematic Literature Review of Teaching and Learning on Object-Oriented Programming Course. International Journal of Information and Education Technology, 13:302–312.

Estévez-Ayres, I., Callejo, P., Hombrados-Herrera, M. A., Alario-Hoyos, C., and Delgado Kloos, C. (2025). Evaluation of LLM Tools for Feedback Generation in a Course on Concurrent Programming. International Journal of Artificial Intelligence in Education, 35(2):774–790.

Ferreira Mello, R., Pereira Junior, C., Rodrigues, L., Pereira, F. D., Cabral, L., Costa, N., Ramalho, G., and Gasevic, D. (2025). Automatic short answer grading in the llm era: Does gpt-4 with prompt engineering beat traditional models? In Proceedings of the 15th international learning analytics and knowledge conference, pages 93–103.

Grandel, S., Schmidt, D. C., and Leach, K. (2024). Applying Large Language Models to Enhance the Assessment of Parallel Functional Programming Assignments. In Proceedings of the 1st International Workshop on Large Language Models for Code, LLM4Code ’24, pages 102–110, New York, NY, USA. Association for Computing Machinery.

Gutiérrez, L. E., Guerrero, C. A., and López-Ospina, H. A. (2022). Ranking of problems and solutions in the teaching and learning of object-oriented programming. Education and Information Technologies, 27(5):7205–7239.

Kasneci, E., Sessler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O., Sailer, M., Schmidt, A., Seidel, T., Stadler, M., Weller, J., Kuhn, J., and Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103:102274.

Landis, J. R. and Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1):159–174.

Lobo, J., Anthony, L., Falcao, A., Xavier, C., Torrezão, N., Isotani, S., Ibert, I., Rodrigues, L., and Mello, R. (2025). Automatic scoring of elementary school essays in brazilian portuguese with llms: Comparing gemini, gpt-4o, claude, and mistral. In Simpósio Brasileiro de Informática na Educação (SBIE), pages 167–180. SBC.

Marques, D. and Morandini, M. (2024). Uso do ChatGPT no Contexto Educacional: Uma Revisão Sistemática da Literatura. In Simpósio Brasileiro de Informática na Educação (SBIE), pages 1784–1795. SBC. ISSN: 0000-0000.

Martins, R. M. (2025). Java-Judge-OO: Uma Ferramenta Educacional para Avaliação Automatizada de Programação Orientada a Objetos em Java. In Simpósio Brasileiro de Educação em Computação (EDUCOMP), pages 39–41. SBC. ISSN: 3086-0741.

Menolli, A. and Strik, B. (2025). Educational Insights from Code: Mapping Learning Challenges in Object-Oriented Programming through Code-Based Evidence. In Simpósio Brasileiro de Engenharia de Software (SBES), pages 544–554. SBC. ISSN: 2833-0633.

Montenegro-Rueda, M., Fernández-Cerero, J., Fernández-Batanero, J. M., and López-Meneses, E. (2023). Impact of the Implementation of ChatGPT in Education: A Systematic Review. Computers, 12(8):153. Number: 8 Publisher: Multidisciplinary Digital Publishing Institute.

Nakamoto, R., Flanagan, B., Yamauchi, T., Dai, Y., Takami, K., and Ogata, H. (2023). Enhancing automated scoring of math self-explanation quality using llm-generated datasets: A semi-supervised approach. Computers, 12(11):217.

Quincozes, C., Molinos, D., Araújo, R., Quincozes, S., and Guedes, G. (2025). Engenharia de prompt para a geração automatizada de questões assistida por LLMs: Uma análise comparativa. In Anais do XXXVI Simpósio Brasileiro de Informática na Educação, pages 1347–1360, Porto Alegre, RS, Brasil. SBC.

Razafinirina, M. A., Dimbisoa, W. G., and Mahatody, T. (2024). Pedagogical Alignment of Large Language Models (LLM) for Personalized Learning: A Survey, Trends and Challenges. Journal of Intelligent Learning Systems and Applications, 16(4):448–480. Number: 4 Publisher: Scientific Research Publishing.

Renze, M. and Guven, E. (2024). The Effect of Sampling Temperature on Problem Solving in Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 7346–7356. arXiv:2402.05201 [cs].

Seo, H., Hwang, T., Jung, J., Kang, H., Namgoong, H., Lee, Y., and Jung, S. (2025). Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy. Applied Sciences, 15(2):671. Number: 2 Publisher: Multidisciplinary Digital Publishing Institute.

Silva, P. and Costa, E. (2025). Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving. arXiv:2503.14630 [cs].

Strik, B. H. (2025). Uma abordagem baseada em inteligência artificial para identificação e classificação automatizada de problemas na aprendizagem de programação orientada a objetos por meio da análise de código-fonte. Master’s thesis, State University of Londrina, Londrina.

Ta, N. B. D., Nguyen, H. G. P., and Gottipati, S. (2023). ExGen: Ready-To-Use Exercise Generation in Introductory Programming Courses. International Conference on Computers in Education.

Xie, W., Niu, J., Xue, C. J., and Guan, N. (2024). Grade Like a Human: Rethinking Automated Assessment with Large Language Models. arXiv:2405.19694 [cs].

Yousef, M., Mohamed, K., Medhat, W., Mohamed, E. H., Khoriba, G., and Arafa, T. (2025). BeGrading: large language models for enhanced feedback in programming education. Neural Computing and Applications, 37(2):1027–1040.