Designing an LLM-based Multiagent System for Generating Activities and their Rubrics: A Study on Data Mining

  • Eryck Silva UNICAMP
  • Julio Cesar dos Reis UNICAMP

Resumo


Assessment is the primary way in which instructors evaluate students’ progress. However, the development of high-quality assessments and their corresponding rubrics requires a significant workload from instructors. In this context, Artificial Intelligence can be explored to assist in co-creating assessments and rubrics. This study proposes MASGAR, a multi-agent system designed to create activities and rubrics. We define the system’s architecture and employ a simulated test study to assess the viability of MASGAR in a Data Mining course by generating two activities and their rubrics. Results indicate that co-creation is essential for conveying human experiences and leveraging LLM-based systems in educational contexts. Students’ feedback from the course revealed that activities were coherent and creative, and suggested criteria for improvements.

Referências

Aguilar-Savén, R. S. (2004). Business process modelling: Review and framework. International Journal of production economics, 90(2):129–149.

Alves, N. d. C., von Wangenheim, C. G., Alberto, M., and Martins-Pacheco, L. H. (2020). Uma Proposta de Avaliação da Originalidade do Produto no Ensino de Algoritmos e Programação na Educação Básica. In Simpósio Brasileiro de Informática na Educação (SBIE), pages 41–50. SBC.

Bahroun, Z., Anane, C., Ahmed, V., and Zacca, A. (2023). Transforming Education: A Comprehensive Review of Generative Artificial Intelligence in Educational Settings through Bibliometric and Content Analysis. Sustainability, 15(17):12983.

Becker, J. (2024). Multi-agent large language models for conversational task-solving. arXiv preprint arXiv:2410.22932.

Bloom, B., Hastings, J., and Madaus, G. (1971). Handbook on Formative and Summative Evaluation of Student Learning. McGraw-Hill.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.

Carbonell, J. (1970). AI in CAI: An Artificial-Intelligence Approach to Computer-Assisted Instruction. IEEE Transactions on Man Machine Systems, 11(4):190–202.

Chico, V. J. S., Tessler, J. F., Bonacin, R., and dos Reis, J. C. (2024). BEQuizzer: AI-Based Quiz Automatic Generation in the Portuguese Language. In Rapp, A., Di Caro, L., Meziane, F., and Sugumaran, V., editors, Natural Language Processing and Information Systems, pages 237–248, Cham. Springer Nature Switzerland.

Duong, T. N. B. and Meng, C. Y. (2024). Automatic grading of short answers using large language models in software engineering courses. In 2024 IEEE Global Engineering Education Conference (EDUCON), pages 1–10.

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T. (2023). A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.

Izu, C. and Mirolo, C. (2024). Towards comprehensive assessment of code quality at cs1-level: Tools, rubrics and refactoring rules. In 2024 IEEE Global Engineering Education Conference (EDUCON), pages 1–10.

Jiang, B., Xie, Y., Wang, X., Yuan, Y., Hao, Z., Bai, X., Su, W. J., Taylor, C. J., and Mallick, T. (2024). Towards rationality in language and multimodal agents: A survey. arXiv preprint arXiv:2406.00252.

Jo, E., Epstein, D. A., Jung, H., and Kim, Y.-H. (2023). Understanding the Benefits and Challenges of Deploying Conversational AI Leveraging Large Language Models for Public Health Intervention. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–16, Hamburg Germany. ACM.

Keuning, H., Heeren, B., and Jeuring, J. (2021). A tutoring system to learn code refactoring. In Proceedings of the 52nd ACM Technical Symposium on Computer Science Education, pages 562–568.

Kinnunen, P. and Simon, B. (2012). My program is ok – am i? computing freshmen’s experiences of doing programming assignments. Computer Science Education, 22(1):1–28.

Krathwohl, D. R. (2002). A Revision of Bloom’s Taxonomy: An Overview. Theory Into Practice, 41(4):212–218.

Kumar, V. S. and Boulanger, D. (2021). Automated Essay Scoring and the Deep Learning Black Box: How Are Rubric Scores Determined? International Journal of Artificial Intelligence in Education, 31(3):538–584.

Lancaster, T., Robins, A. V., and Fincher, S. A. (2019). Assessment and Plagiarism, page 414–444. Cambridge Handbooks in Psychology. Cambridge University Press.

Lima, M. R., Ferreira, D. J., and Dias, E. S. (2024). Uso de Rubricas em Disciplinas de Programação Introdutória: Uma Revisão Sistemática da Literatura. In Simpósio Brasileiro de Informática na Educação (SBIE), pages 1–14. SBC.

Linnenbrink, E. A. and Pintrich, P. R. (2003). The role of self-efficacy beliefs instudent engagement and learning in the classroom. Reading & Writing Quarterly, 19(2):119–137.

Martins, F. L. B., de Oliveira, A. C. A., de Vasconcelos, D. R., and de Menezes, M. V. (2023). Avaliando a habilidade do ChatGPT de realizar provas de Dedução Natural em Lógica Proposicional. In Simpósio Brasileiro de Informática na Educação (SBIE), pages 1282–1292. SBC.

Phung, T., Pădurean, V.-A., Cambronero, J., Gulwani, S., Kohn, T., Majumdar, R., Singla, A., and Soares, G. (2023). Generative ai for programming education: Benchmarking chatgpt, gpt-4, and human tutors. In Proceedings of the 2023 ACM Conference on International Computing Education Research-Volume 2, pages 41–42.

Rockembach, G. R. and Thom, L. H. (2024). Investigating the Use of Intelligent Tutors Based on Large Language Models: Automated generation of Business Process Management questions using the Revised Bloom’s Taxonomy. In Simpósio Brasileiro de Informática Na Educação (SBIE), pages 1587–1601. SBC.

Russell, S. J. and Norvig, P. (2016). Artificial Intelligence: A Modern Approach. Prentice Hall Series in Artificial Intelligence. Pearson, Boston Columbus Indianapolis, third edition, global edition edition.

Scriven, M. (1967). The methodology of evaluation. In Tyler, R., Gagné, R., and Scriven, M., editors, Perspectives of Curriculum Evaluation, AERA Monograph Series on Curriculum Evaluation, volume 1, pages 39–83. Rand McNally, Chicago.

Villa, J. E. A., Garcia, R., Miranda, A. L. M., Oran, A., Guedes, G. T. A., Santana, B. S., Silva, D. G., Valle, P., and Silva, W. (2024). Perspectiva dos Estudantes sobre um Agente Pedagógico Baseado em Exemplos para a Aprendizagem de Programação: uma análise qualitativa. In Simpósio Brasileiro de Informática na Educação (SBIE), pages 459–473. SBC.

Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., Zhao, W. X., Wei, Z., and Wen, J. (2024). A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6):186345.

Wang, Z., Cai, S., Chen, G., Liu, A., Ma, X., and Liang, Y. (2023). Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560.

Wolber, D., Abelson, H., Spertus, E., and Looney, L. (2011). App inventor. ” O’Reilly Media, Inc.”.

Wu, X., Xiao, L., Sun, Y., Zhang, J., Ma, T., and He, L. (2022). A survey of human-in-the-loop for machine learning. Future Generation Computer Systems, 135:364–381.

Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. (2023). A survey of large language models. arXiv preprint arXiv:2303.18223, 1(2).
Publicado
24/11/2025
SILVA, Eryck; REIS, Julio Cesar dos. Designing an LLM-based Multiagent System for Generating Activities and their Rubrics: A Study on Data Mining. In: SIMPÓSIO BRASILEIRO DE INFORMÁTICA NA EDUCAÇÃO (SBIE), 36. , 2025, Curitiba/PR. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 958-972. DOI: https://doi.org/10.5753/sbie.2025.12723.