Exploring the use of LLMs to label programming strategies
Abstract
In programming courses, a single exercise can be solved in different ways. Understanding the strategies adopted by students to solve problems is pedagogically relevant, as it allows us to assess, for example, whether they are assimilating the content covered in class and applying it correctly in their codes. To help teachers in this context, this paper investigates the use of LLMs to label students’ codes according to the strategy they used to solve programming exercises. To this end, a database with labeled codes was created and experiments were subsequently carried out with different LLMs. Preliminary results indicate that, with the formulation of appropriate prompts, LLMs have the potential to perform the task of automatic code labeling.References
Barbosa, A., Barros Costa, E., and Brito, P. H. (2023). Juízes online são suficientes ou precisamos de um VAR? In Anais do III Simpósio Brasileiro de Educação em Computação, pages 386–394. SBC.
Beh, M. Y., Gottipatti, S., LO, D., and Shankararaman, V. (2016). Semi-automated tool for providing effective feedback on programming assignments.
Brown, T. B. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., et al. (2024). A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45.
Galvão, L., Fernandes, D., and Gadelha, B. (2016). Juiz online como ferramenta de apoio a uma metodologia de ensino híbrido em programação. In Brazilian Symposium on Computers in Education (Simpósio Brasileiro de Informática na Educação-SBIE), volume 27, page 140.
Glassman, E. L., Scott, J., Singh, R., Guo, P. J., and Miller, R. C. (2015). Overcode: Visualizing variation in student solutions to programming problems at scale. ACM Transactions on Computer-Human Interaction (TOCHI), 22(2):1–35.
Joyner, D., Arrison, R., Ruksana, M., Salguero, E., Wang, Z., Wellington, B., and Yin, K. (2019). From clusters to content: Using code clustering for course improvement. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education, pages 780–786.
Koivisto, T. and Hellas, A. (2022). Evaluating codeclusters for effectively providing feedback on code submissions. In 2022 IEEE Frontiers in Education Conference (FIE), pages 1–9. IEEE.
Mehta, A., Gupta, N., Balachandran, A., Kumar, D., Jalote, P., et al. (2023). Can ChatGPT play the role of a teaching assistant in an introductory programming course? arXiv preprint arXiv:2312.07343.
Melo, R., Pessoa, M., and Fernandes, D. (2024). Clusterização de soluções de exercícios de programação: um mapeamento sistemático da literatura. In Anais do 35º Simpósio Brasileiro de Informática na Educação (SBIE), pages 1715–1729. Sociedade Brasileira de Computação.
Pires, R., Abonizio, H., Almeida, T. S., and Nogueira, R. (2023). Sabiá: Portuguese large language models. In Brazilian Conference on Intelligent Systems, pages 226–240. Springer.
Poldrack, R. A., Lu, T., and Beguš, G. (2023). AI-assisted coding: Experiments with GPT-4. arXiv preprint arXiv:2304.13187.
Reiss, M. V. (2023). Testing the reliability of ChatGPT for text annotation and classification: A cautionary remark. arXiv preprint arXiv:2304.11085.
Wasik, S., Antczak, M., Badura, J., Laskowski, A., and Sternal, T. (2018). A survey on online judge systems and their applications. ACM Computing Surveys (CSUR), 51(1):1–34.
Yin, H., Moghadam, J., and Fox, A. (2015). Clustering student programming assignments to multiply instructor leverage. In Proceedings of the second (2015) ACM conference on learning@ scale, pages 367–372.
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. (2023). A survey of large language models. arXiv preprint arXiv:2303.18223.
Beh, M. Y., Gottipatti, S., LO, D., and Shankararaman, V. (2016). Semi-automated tool for providing effective feedback on programming assignments.
Brown, T. B. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu, K., Chen, H., Yi, X., Wang, C., Wang, Y., et al. (2024). A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45.
Galvão, L., Fernandes, D., and Gadelha, B. (2016). Juiz online como ferramenta de apoio a uma metodologia de ensino híbrido em programação. In Brazilian Symposium on Computers in Education (Simpósio Brasileiro de Informática na Educação-SBIE), volume 27, page 140.
Glassman, E. L., Scott, J., Singh, R., Guo, P. J., and Miller, R. C. (2015). Overcode: Visualizing variation in student solutions to programming problems at scale. ACM Transactions on Computer-Human Interaction (TOCHI), 22(2):1–35.
Joyner, D., Arrison, R., Ruksana, M., Salguero, E., Wang, Z., Wellington, B., and Yin, K. (2019). From clusters to content: Using code clustering for course improvement. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education, pages 780–786.
Koivisto, T. and Hellas, A. (2022). Evaluating codeclusters for effectively providing feedback on code submissions. In 2022 IEEE Frontiers in Education Conference (FIE), pages 1–9. IEEE.
Mehta, A., Gupta, N., Balachandran, A., Kumar, D., Jalote, P., et al. (2023). Can ChatGPT play the role of a teaching assistant in an introductory programming course? arXiv preprint arXiv:2312.07343.
Melo, R., Pessoa, M., and Fernandes, D. (2024). Clusterização de soluções de exercícios de programação: um mapeamento sistemático da literatura. In Anais do 35º Simpósio Brasileiro de Informática na Educação (SBIE), pages 1715–1729. Sociedade Brasileira de Computação.
Pires, R., Abonizio, H., Almeida, T. S., and Nogueira, R. (2023). Sabiá: Portuguese large language models. In Brazilian Conference on Intelligent Systems, pages 226–240. Springer.
Poldrack, R. A., Lu, T., and Beguš, G. (2023). AI-assisted coding: Experiments with GPT-4. arXiv preprint arXiv:2304.13187.
Reiss, M. V. (2023). Testing the reliability of ChatGPT for text annotation and classification: A cautionary remark. arXiv preprint arXiv:2304.11085.
Wasik, S., Antczak, M., Badura, J., Laskowski, A., and Sternal, T. (2018). A survey on online judge systems and their applications. ACM Computing Surveys (CSUR), 51(1):1–34.
Yin, H., Moghadam, J., and Fox, A. (2015). Clustering student programming assignments to multiply instructor leverage. In Proceedings of the second (2015) ACM conference on learning@ scale, pages 367–372.
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. (2023). A survey of large language models. arXiv preprint arXiv:2303.18223.
Published
2025-04-07
How to Cite
MELO, Rafaela; SOUZA, Tiago; OLIVEIRA, Elaine; GALVÃO, Leandro; PESSOA, Marcela; FERNANDES, David.
Exploring the use of LLMs to label programming strategies. In: BRAZILIAN SYMPOSIUM ON COMPUTING EDUCATION (EDUCOMP), 5. , 2025, Juiz de Fora/MG.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 178-190.
ISSN 3086-0733.
DOI: https://doi.org/10.5753/educomp.2025.5335.
