Uma Análise Comparativa de LLMs com Técnicas de Engenharia de Prompt para Classificação Automática de Respostas Curtas
Resumo
A Correção Automática de Respostas Curtas (ASAG) busca reduzir o esforço humano em avaliações educacionais de larga escala, mas ainda há poucas investigações em português brasileiro. Este estudo compara três grandes modelos de linguagem (GPT-4o-mini, Sabiazinho-3 e Gemini 2.0-Flash) e analisa o impacto de sete elementos de engenharia de prompt no desempenho dos modelos. Com base em um conjunto de dados em português, avaliamos todas as combinações possíveis desses elementos. A combinação de exemplos few-shot com rubrica explícita foi a mais eficaz; o raciocínio passo a passo beneficiou especialmente o GPT-4o-mini. Sabiazinho-3 teve maior concordância com humanos, Gemini 2.0-Flash obteve menor erro médio absoluto, mas com mais alucinações, e o GPT-4o-mini gerou saídas numéricas mais limpas.
Referências
Abonizio, H. et al. (2025b). Sabiá-3 technical report. Preprint.
Ahmed, A., Joorabchi, A., and Hayes, M. J. (2022). On deep learning approaches to automated assessment: Strategies for short answer grading. CSEDU (2), pages 85–94.
Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z., Yu, T., Chung, W., Do, Q. V., Xu, Y., and Fung, P. (2023). A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. (2020). Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
Burrows, S., Gurevych, I., and Stein, B. (2015). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25:60–117.
Camus, L. and Filighera, A. (2020). Investigating transformers for automatic short answer grading. In Artificial Intelligence in Education: 21st International Conference, AIED 2020, Ifrane, Morocco, July 6–10, 2020, Proceedings, Part II, page 43–48, Berlin, Heidelberg. Springer-Verlag.
Carpenter, D., Min, W., Lee, S., Ozogul, G., Zheng, X., and Lester, J. (2024). Assessing student explanations with large language models using fine-tuning and few-shot learning. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), pages 403–413, Mexico City, Mexico. Association for Computational Linguistics.
Chamieh, I., Zesch, T., and Giebermann, K. (2024). LLMs in short answer scoring: Limitations and promise of zero-shot and few-shot approaches. In Kochmar, E., Bexte, M., Burstein, J., Horbach, A., Laarmann-Quante, R., Tack, A., Yaneva, V., and Yuan, Z., editors, Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), pages 309–315, Mexico City, Mexico. Association for Computational Linguistics.
Chang, L.-H. and Ginter, F. (2024). Automatic short answer grading for finnish with chatgpt. Proceedings of the AAAI Conference on Artificial Intelligence, 38(21):23173–23181.
ElNaka, A., Nael, O., Afifi, H., and Sharaf, N. (2021). Arascore: Investigating response-based arabic short answer scoring. Procedia Computer Science, 189:282–291. AI in Computational Linguistics.
Fleiss, J. L., Cohen, J., and Everitt, B. S. (1969). Large sample standard errors of kappa and weighted kappa. Psychological Bulletin, 72(5):323–327.
Freitag, R. M. K. and Gois, T. S. d. (2024). Performance in a dialectal profiling task of llms for varieties of brazilian portuguese. In Anais do XV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana (STIL 2024), STIL 2024, page 317–326. Sociedade Brasileira de Computação.
Galhardi, L., Barbosa, C., Thom de Souza, R. C., and Brancher, J. (2018). Portuguese automatic short answer grading. page 1373.
Galhardi, L., de Souza, R., and Brancher, J. (2020). Automatic grading of portuguese short answers using a machine learning approach. In Anais Estendidos do XVI Simpósio Brasileiro de Sistemas de Informação, pages 109–124.
Giray, L. (2023). Prompt engineering with chatgpt: A guide for academic writers. Annals of Biomedical Engineering, 51(12):2629–2633.
Grévisse, C. (2024). Llm-based automatic short answer grading in undergraduate medical education. BMC Medical Education, 24(1):1060.
Imran, M. and Almusharraf, N. (2024). Google gemini as a next generation ai educational tool: A review of emerging educational technology. Smart Learning Environments, 11(1):22.
Karmaker Santu, S. K. and Feng, D. (2023). TELeR: A general taxonomy of LLM prompts for benchmarking complex tasks. In Bouamor, H., Pino, J., and Bali, K., editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14197–14203, Singapore. Association for Computational Linguistics.
Khot, T. et al. (2023). Decomposed prompting: A modular approach for solving complex tasks. Preprint.
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9).
Mello, R., Rodrigues, L., Cabral, L., Pereira, F., Júnior, C. P., Gasevic, D., and Ramalho, G. (2024). Prompt engineering for automatic short answer grading in brazilian portuguese. In Anais do XXXV Simpósio Brasileiro de Informática na Educação, pages 1730–1743, Porto Alegre, RS, Brasil. SBC.
Mello, R. F., Pereira Junior, C., Rodrigues, L., Pereira, F. D., Cabral, L., Costa, N., Ramalho, G., and Gasevic, D. (2025). Automatic short answer grading in the llm era: Does gpt-4 with prompt engineering beat traditional models? In Proceedings of the 15th International Learning Analytics and Knowledge Conference (LAK ’25), pages 93–103, New York, NY, USA. Association for Computing Machinery.
Mohler, M. and Mihalcea, R. (2009). Text-to-text semantic similarity for automatic short answer grading. In Lascarides, A., Gardent, C., and Nivre, J., editors, Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 567–575, Athens, Greece. Association for Computational Linguistics.
Piech, C., Huang, J., Chen, Z., Do, C., Ng, A., and Koller, D. (2013). Tuned models of peer assessment in moocs. arXiv preprint arXiv:1307.2579.
Qin, L., Chen, Q., Feng, X., Wu, Y., Zhang, Y., Li, Y., Li, M., Che, W., and Yu, P. S. (2024). Large language models meet nlp: A survey. Preprint.
Rawte, V., Sheth, A., and Das, A. (2023). A survey of hallucination in large foundation models.
Ripmiatin, E., Purnamasari, P. D., and Ratna, A. A. P. (2024). Comparing classical distance measures and word embeddings for automatic short answer grading. In Proceedings of the 2023 9th International Conference on Communication and Information Processing, ICCIP ’23, page 492–497, New York, NY, USA. Association for Computing Machinery.
Sung, C., Dhamecha, T. I., and Mukhi, N. (2019). Improving short answer grading using transformer-based pre-training. In Artificial Intelligence in Education: 20th International Conference, AIED 2019, Chicago, IL, USA, June 25-29, 2019, Proceedings, Part I, page 469–481, Berlin, Heidelberg. Springer-Verlag.
Süzen, N., Gorban, A. N., Levesley, J., and Mirkes, E. M. (2020). Automatic short answer grading and feedback using text mining methods. In Procedia Computer Science, volume 169, pages 726–743.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2023). Attention is all you need.
Warneke, K., Keiner, M., Wallot, S., Siegel, S. D., Günther, C., Wirth, K., and and, S. P.-M. (2025). The impact of sample size on reliability metrics stability in isokinetic strength assessments: Does size matter? Measurement in Physical Education and Exercise Science, 0(0):1–12.
Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V. (2021). Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., and Schmidt, D. C. (2023). A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382.
Xu, Z., Jain, S., and Kankanhalli, M. (2025). Hallucination is inevitable: An innate limitation of large language models.
Yan, L., Sha, L., Zhao, L., Li, Y., Martinez-Maldonado, R., Chen, G., Li, X., Jin, Y., and Gašević, D. (2024). Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology, 55:90–112.
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y., Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li, Y., Tang, X., Liu, Z., Liu, P., Nie, J.-Y., and Wen, J.-R. (2025). A survey of large language models.
Zhao, Z., Lang, W., Doulgeris, A. P., and Chen, L. (2017). Improved llm methods using linear regression. In 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pages 5350–5353, Fort Worth, TX, USA. IEEE.
Zhuang, Z., Chen, Q., Ma, L., Li, M., Han, Y., Qian, Y., Bai, H., Feng, Z., Zhang, W., and Liu, T. (2023). Through the lens of core competency: Survey on evaluation of large language models.
