Smarter Questions, Smaller Models: RAG-Enhanced Multiple-Choice Question Generation for POSCOMP

  • José Robson da Silva Araujo Junior UFCG
  • Leandro Balby Marinho UFCG
  • Lívia Sampaio Campos UFCG
  • Kemilli Nicole dos Santos Lima UFCG
  • David Eduardo Pereira UFCG
  • Helen Bento Cavalcanti UFCG
  • Ana Luíza Cavalcante Ramos UFCG
  • Eliane Cristina de Araújo UFCG

Resumo


Generating high-quality multiple-choice questions (MCQs) for specialized exams like POSCOMP remains a complex and labor-intensive task. This paper proposes a Retrieval-Augmented Generation (RAG) approach to support MCQ creation using Large Language Models (LLMs). We introduce a novel data set of 1,340 past POSCOMP questions, enriched with LLM-classified themes that show strong agreement with human annotations. The RAG method was compared to a few-shot baseline across five LLMs, generating 120 MCQs evaluated by human experts and an LLM-as-a-judge using a detailed rubric. Results show that the RAG approach improves question quality in up to half of the evaluated criteria, highlighting its potential for educational assessment tasks.

Referências

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

Campos Taschetto, L. and Fileto, R. (2024). Using retrieval-augmented generation to improve performance of large language models on the brazilian university admission exam. In Simpósio Brasileiro de Banco de Dados (SBBD), pages 799–805. SBC.

Ch, D. R. and Saha, S. K. (2018). Automatic multiple choice question generation from text: A survey. IEEE Transactions on Learning Technologies, 13(1):14–25.

Das, B., Majumder, M., Phadikar, S., and Sekh, A. A. (2021). Automatic question generation and answer assessment: a survey. Research and Practice in Technology Enhanced Learning, 16(1):5.

Gopi, S., Sreekanth, D., and Dehbozorgi, N. (2024). Enhancing engineering education through llm-driven adaptive quiz generation: A rag-based approach. In 2024 IEEE Frontiers in Education Conference (FIE), pages 1–8. IEEE.

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al. (2024). A survey on llm-as-a-judge. arXiv preprint arXiv:2411.15594.

Hang, C. N., Tan, C. W., and Yu, P.-D. (2024). Mcqgen: A large language model-driven mcq generator for personalized learning. IEEE Access.

Jiang, Z. and Feng, S. (2025). Usmlegpt: An ai application for developing mcqs via multi-agent system. Software Impacts, 23:100742.

Kostina, A., Dikaiakos, M. D., Stefanidis, D., and Pallis, G. (2025). Large language models for text classification: Case study and comprehensive review. arXiv preprint arXiv:2501.08457.

Landis, J. R. and Koch, G. G. (1977). An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics, pages 363–374.

Li, D., Jiang, B., Huang, L., Beigi, A., Zhao, C., Tan, Z., Bhattacharjee, A., Jiang, Y., Chen, C., Wu, T., et al. (2024a). From generation to judgment: Opportunities and challenges of llm-as-a-judge. arXiv preprint arXiv:2411.16594.

Li, H., Dong, Q., Chen, J., Su, H., Zhou, Y., Ai, Q., Ye, Z., and Liu, Y. (2024b). Llms-as-judges: a comprehensive survey on llm-based evaluation methods. arXiv preprint arXiv:2412.05579.

Li, R., Jiang, Y.-H., Wang, Y., Hu, H., and Jiang, B. (2024c). A large language model-enabled solution for the automatic generation of situated multiple-choice math questions. In Conference Proceedings of the 28th Global Chinese Conference on Computers in Education (GCCCE 2024). Chongqing, China: Global Chinese Conference on Computers in Education, pages 130–136.

Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., et al. (2023). Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36:46534–46594.

Madri, V. R. and Meruva, S. (2023). A comprehensive review on mcq generation from text. Multimedia Tools and Applications, 82(25):39415–39434.

Marques, D. and Morandini, M. (2024). Uso do chatgpt no contexto educacional: Uma revisão sistemática da literatura. In Anais do XXXV Simpósio Brasileiro de Informática na Educação, pages 1784–1795, Porto Alegre, RS, Brasil. SBC.

Meißner, N., Speth, S., Kieslinger, J., and Becker, S. (2024). Evalquiz–llm-based automated generation of self-assessment quizzes in software engineering education. In Software Engineering im Unterricht der Hochschulen 2024, pages 53–64. Gesellschaft für Informatik eV.

Moore, S., Nguyen, H. A., Chen, T., and Stamper, J. (2023). Assessing the quality of multiple-choice questions using gpt-4 and rule-based methods. In European conference on technology enhanced learning, pages 229–245. Springer.

Pawar, P., Dube, R., Joshi, A., Gulhane, Z., and Patil, R. (2024). Automated generation and evaluation of multiplechoice quizzes using langchain and gemini llm. In 2024 International Conference on Electrical Electronics and Computing Technologies (ICEECT), volume 1, pages 1–7. IEEE.

Pradeesh, N., Remya, T., MG, T., Pranav, V., et al. (2025). Retrieval-augmented generation for multiple-choice questions and answers generation. Procedia Computer Science, 259:504–511.

Silvestre, A., Amaral, E., Holanda, M., and Canedo, E. (2023). Students’ perception about chatgpt’s impact on their academic education. In Anais do XXXIV Simpósio Brasileiro de Informática na Educação, pages 1260–1270, Porto Alegre, RS, Brasil. SBC.

Superbi, J., Pinto, H., Santos, E., Lattari, L., and Castro, B. (2024). Enhancing large language model performance on enem math questions using retrieval-augmented generation. In Anais do XVIII Brazilian e-Science Workshop, pages 56–63, Porto Alegre, RS, Brasil. SBC.

Tran, A., Angelikas, K., Rama, E., Okechukwu, C., Smith, D. H., and MacNeil, S. (2023). Generating multiple choice questions for computing courses using large language models. In 2023 IEEE Frontiers in Education Conference (FIE), pages 1–8. IEEE.

Wang, J., Xiao, R., and Tseng, Y.-J. (2025). Generating ai literacy mcqs: A multi-agent llm approach. In Proceedings of the 56th ACM Technical Symposium on Computer Science Education V. 2, pages 1651–1652.

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.

Yao, Z., Parashar, A., Zhou, H., Jang, W. S., Ouyang, F., Yang, Z., and Yu, H. (2024). Mcqg-srefine: Multiple choice question generation and evaluation with iterative self-critique, correction, and comparison feedback. arXiv preprint arXiv:2410.13191.

Zhao, H., Chen, Q. P., Zhang, Y. B., and Yang, G. (2024). Advancing single-and multi-task text classification through large language model fine-tuning. arXiv preprint arXiv:2412.08587.
Publicado
24/11/2025
ARAUJO JUNIOR, José Robson da Silva; MARINHO, Leandro Balby; CAMPOS, Lívia Sampaio; LIMA, Kemilli Nicole dos Santos; PEREIRA, David Eduardo; CAVALCANTI, Helen Bento; RAMOS, Ana Luíza Cavalcante; ARAÚJO, Eliane Cristina de. Smarter Questions, Smaller Models: RAG-Enhanced Multiple-Choice Question Generation for POSCOMP. In: SIMPÓSIO BRASILEIRO DE INFORMÁTICA NA EDUCAÇÃO (SBIE), 36. , 2025, Curitiba/PR. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 1233-1247. DOI: https://doi.org/10.5753/sbie.2025.12854.