LLM-Based Automatic Generation of Multiple-Choice Questions With Meaningful Distractors
Resumo
Creating effective Multiple-Choice Questions (MCQs) with high-quality distractors that carefully evaluate students is challenging. Large Language Models (LLMs) can contribute by automatically generating questions. This study presents a novel framework for generating distractors for Portugueselanguage. We evaluate the framework using Sabía-3 (Portuguese-specific) and GPT4o-mini (multilingual). We assessed the grammatical and semantic diversity of the generated distractors and devised a qualitative evaluation through Claude-3 Haiku. Results show that integrating educational principles into prompts enhances the relevance and diversity of distractors, marking progress in automated activity and assessment generation for the Portuguese language.Referências
Abonizio, H., Almeida, T. S., Laitz, T., Junior, R. M., Bonás, G. K., Nogueira, R., and Pires, R. (2024). Sabiá-3 technical report.
Awalurahman, H. W. and Budi, I. (2024). Automatic distractor generation in multiple-choice questions: a systematic literature review. PeerJ Computer Science, 10:e2441.
Çavuşoğlu, D., Şen, S., and Sert, U. (2024). DisGeM: Distractor generation for multiple choice questions with span masking. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N., editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9714–9732, Miami, Florida, USA. Association for Computational Linguistics.
CH, D. R. and Saha, S. K. (2020). Automatic multiple choice question generation from text: A survey. IEEE Transactions on Learning Technologies, 13(1):14–25.
Chico, V. J. S., Tessler, J. F., Bonacin, R., and dos Reis, J. C. (2024). Bequizzer: Ai-based quiz automatic generation in the portuguese language. In Rapp, A., Di Caro, L., Meziane, F., and Sugumaran, V., editors, Natural Language Processing and Information Systems, pages 237–248, Cham. Springer Nature Switzerland.
Gierl, M. J., Bulut, O., Guo, Q., and Zhang, X. (2017). Developing, analyzing, and using distractors for multiple-choice tests in education: A comprehensive review. Review of Educational Research, 87(6):1082–1116.
Gonçalo Oliveira, H., Caetano, I., Matos, R., and Amaro, H. (2023). Generating and Ranking Distractors for Multiple-Choice Questions in Portuguese. In Simões, A., Berón, M. M., and Portela, F., editors, 12th Symposium on Languages, Applications and Technologies (SLATE 2023), volume 113 of Open Access Series in Informatics (OASIcs), pages 4:1–4:9, Dagstuhl, Germany. Schloss Dagstuhl – Leibniz-Zentrum für Informatik.
Hardalov, M., Mihaylov, T., Zlatkova, D., Dinkov, Y., Koychev, I., and Nakov, P. (2020). EXAMS: A multi-subject high school examinations dataset for cross-lingual and multilingual question answering. In Webber, B., Cohn, T., He, Y., and Liu, Y., editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5427–5444, Online. ACL.
Kumar, P. (2024). Large language models (llms): survey, technical frameworks, and future challenges. Artificial Intelligence Review, 57(10):260.
Marvin, G., Hellen, N., Jjingo, D., and Nakatumba-Nabende, J. (2024). Prompt engineering in large language models. In Jacob, I. J., Piramuthu, S., and Falkowski-Gilski, P., editors, Data Intelligence and Cognitive Informatics, pages 387–402, Singapore. Springer Nature Singapore.
Moore, S., Fang, E., Nguyen, H. A., and Stamper, J. (2023). Crowdsourcing the evaluation of multiple-choice questions using item-writing flaws and bloom’s taxonomy. In Proceedings of the Tenth ACM Conference on Learning @ Scale, L@S ’23, page 25–34, New York, NY, USA. Association for Computing Machinery.
OpenAI (24). GPT-4o mini: advancing cost-efficient intelligence.
Pal Chowdhury, S., Zouhar, V., and Sachan, M. (2024). Autotutor meets large language models: A language model tutor with rich pedagogy and guardrails. In Proceedings of the Eleventh ACM Conference on Learning @ Scale, L@S ’24, page 5–15, New York, NY, USA. Association for Computing Machinery.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA. Association for Computational Linguistics.
Perković, G., Drobnjak, A., and Botički, I. (2024). Hallucinations in llms: Understanding and addressing challenges. In 2024 47th MIPRO ICT and Electronics Convention (MIPRO), pages 2084–2088.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language models are unsupervised multitask learners.
Ren, S. and Q. Zhu, K. (2021). Knowledge-driven distractor generation for cloze-style multiple choice questions. Proceedings of the AAAI Conference on Artificial Intelligence, 35(5):4339–4347.
Rodriguez-Torrealba, R., Garcia-Lopez, E., and Garcia-Cabot, A. (2022). End-to-end generation of multiple-choice questions using text-to-text transfer transformer models. Expert Systems with Applications, 208:118258.
Rush, B. R., Rankin, D. C., and White, B. J. (2016). The impact of item-writing flaws and item complexity on examination item difficulty and discrimination value. BMC Medical Education, 16(1):250.
Shree Charran, R., Dubey, R. K., and Jain, S. (2022). Chapter 18 chronological text similarity with pretrained embedding and edit distance. In Pandey, R., Khatri, S. K., kumar Singh, N., and Verma, P., editors, Artificial Intelligence and Machine Learning for EDGE Computing, pages 279–286. Academic Press.
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. In Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part I, page 403–417, Berlin, Heidelberg. Springer-Verlag.
Vu, S. T., Truong, H. T., Do, O. T., Le, T. A., and Mai, T. T. (2024). A chatgpt-based approach for questions generation in higher education. In Proceedings of the 1st ACM Workshop on AI-Powered QA Systems for Multimedia, AIQAM ’24, page 13–18, New York, NY, USA. Association for Computing Machinery.
Wang, Q., Rose, R., Orita, N., and Sugawara, A. (2023). Automated generation of multiple-choice cloze questions for assessing English vocabulary using GPT-turbo 3.5. In Hämäläinen, M., Öhman, E., Pirinen, F., Alnajjar, K., Miyagawa, S., Bizzoni, Y., Partanen, N., and Rueter, J., editors, Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages, pages 52–61. ACL.
Wen, Q., Liang, J., Sierra, C., Luckin, R., Tong, R., Liu, Z., Cui, P., and Tang, J. (2024). Ai for education (ai4edu): Advancing personalized education with llm and adaptive learning. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, page 6743–6744, New York, NY, USA. Association for Computing Machinery.
Xu, W., Zhu, G., Zhao, X., Pan, L., Li, L., and Wang, W. (2024). Pride and prejudice: LLM amplifies self-bias in self-refinement. In Ku, L.-W., Martins, A., and Srikumar, V., editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15474–15492, Bangkok, Thailand. Association for Computational Linguistics.
Yu, H. C., Shih, Y. A., Law, K. M., Hsieh, K., Cheng, Y. C., Ho, H. C., Lin, Z. A., Hsu, W.-C., and Fan, Y.-C. (2024). Enhancing distractor generation for multiple-choice questions with retrieval augmented pretraining and knowledge graph integration. In Ku, L.-W., Martins, A., and Srikumar, V., editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 11019–11029, Bangkok, Thailand. Association for Computational Linguistics.
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). Judging llm-as-a-judge with mt-bench and chatbot arena. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors, Advances in Neural Information Processing Systems, volume 36, pages 46595–46623. Curran Associates, Inc.
Zhu, Y., Lu, S., Zheng, L., Guo, J., Zhang, W., Wang, J., and Yu, Y. (2018). Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, page 1097–1100, New York, NY, USA. Association for Computing Machinery.
Awalurahman, H. W. and Budi, I. (2024). Automatic distractor generation in multiple-choice questions: a systematic literature review. PeerJ Computer Science, 10:e2441.
Çavuşoğlu, D., Şen, S., and Sert, U. (2024). DisGeM: Distractor generation for multiple choice questions with span masking. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N., editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9714–9732, Miami, Florida, USA. Association for Computational Linguistics.
CH, D. R. and Saha, S. K. (2020). Automatic multiple choice question generation from text: A survey. IEEE Transactions on Learning Technologies, 13(1):14–25.
Chico, V. J. S., Tessler, J. F., Bonacin, R., and dos Reis, J. C. (2024). Bequizzer: Ai-based quiz automatic generation in the portuguese language. In Rapp, A., Di Caro, L., Meziane, F., and Sugumaran, V., editors, Natural Language Processing and Information Systems, pages 237–248, Cham. Springer Nature Switzerland.
Gierl, M. J., Bulut, O., Guo, Q., and Zhang, X. (2017). Developing, analyzing, and using distractors for multiple-choice tests in education: A comprehensive review. Review of Educational Research, 87(6):1082–1116.
Gonçalo Oliveira, H., Caetano, I., Matos, R., and Amaro, H. (2023). Generating and Ranking Distractors for Multiple-Choice Questions in Portuguese. In Simões, A., Berón, M. M., and Portela, F., editors, 12th Symposium on Languages, Applications and Technologies (SLATE 2023), volume 113 of Open Access Series in Informatics (OASIcs), pages 4:1–4:9, Dagstuhl, Germany. Schloss Dagstuhl – Leibniz-Zentrum für Informatik.
Hardalov, M., Mihaylov, T., Zlatkova, D., Dinkov, Y., Koychev, I., and Nakov, P. (2020). EXAMS: A multi-subject high school examinations dataset for cross-lingual and multilingual question answering. In Webber, B., Cohn, T., He, Y., and Liu, Y., editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5427–5444, Online. ACL.
Kumar, P. (2024). Large language models (llms): survey, technical frameworks, and future challenges. Artificial Intelligence Review, 57(10):260.
Marvin, G., Hellen, N., Jjingo, D., and Nakatumba-Nabende, J. (2024). Prompt engineering in large language models. In Jacob, I. J., Piramuthu, S., and Falkowski-Gilski, P., editors, Data Intelligence and Cognitive Informatics, pages 387–402, Singapore. Springer Nature Singapore.
Moore, S., Fang, E., Nguyen, H. A., and Stamper, J. (2023). Crowdsourcing the evaluation of multiple-choice questions using item-writing flaws and bloom’s taxonomy. In Proceedings of the Tenth ACM Conference on Learning @ Scale, L@S ’23, page 25–34, New York, NY, USA. Association for Computing Machinery.
OpenAI (24). GPT-4o mini: advancing cost-efficient intelligence.
Pal Chowdhury, S., Zouhar, V., and Sachan, M. (2024). Autotutor meets large language models: A language model tutor with rich pedagogy and guardrails. In Proceedings of the Eleventh ACM Conference on Learning @ Scale, L@S ’24, page 5–15, New York, NY, USA. Association for Computing Machinery.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA. Association for Computational Linguistics.
Perković, G., Drobnjak, A., and Botički, I. (2024). Hallucinations in llms: Understanding and addressing challenges. In 2024 47th MIPRO ICT and Electronics Convention (MIPRO), pages 2084–2088.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language models are unsupervised multitask learners.
Ren, S. and Q. Zhu, K. (2021). Knowledge-driven distractor generation for cloze-style multiple choice questions. Proceedings of the AAAI Conference on Artificial Intelligence, 35(5):4339–4347.
Rodriguez-Torrealba, R., Garcia-Lopez, E., and Garcia-Cabot, A. (2022). End-to-end generation of multiple-choice questions using text-to-text transfer transformer models. Expert Systems with Applications, 208:118258.
Rush, B. R., Rankin, D. C., and White, B. J. (2016). The impact of item-writing flaws and item complexity on examination item difficulty and discrimination value. BMC Medical Education, 16(1):250.
Shree Charran, R., Dubey, R. K., and Jain, S. (2022). Chapter 18 chronological text similarity with pretrained embedding and edit distance. In Pandey, R., Khatri, S. K., kumar Singh, N., and Verma, P., editors, Artificial Intelligence and Machine Learning for EDGE Computing, pages 279–286. Academic Press.
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. In Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part I, page 403–417, Berlin, Heidelberg. Springer-Verlag.
Vu, S. T., Truong, H. T., Do, O. T., Le, T. A., and Mai, T. T. (2024). A chatgpt-based approach for questions generation in higher education. In Proceedings of the 1st ACM Workshop on AI-Powered QA Systems for Multimedia, AIQAM ’24, page 13–18, New York, NY, USA. Association for Computing Machinery.
Wang, Q., Rose, R., Orita, N., and Sugawara, A. (2023). Automated generation of multiple-choice cloze questions for assessing English vocabulary using GPT-turbo 3.5. In Hämäläinen, M., Öhman, E., Pirinen, F., Alnajjar, K., Miyagawa, S., Bizzoni, Y., Partanen, N., and Rueter, J., editors, Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages, pages 52–61. ACL.
Wen, Q., Liang, J., Sierra, C., Luckin, R., Tong, R., Liu, Z., Cui, P., and Tang, J. (2024). Ai for education (ai4edu): Advancing personalized education with llm and adaptive learning. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, page 6743–6744, New York, NY, USA. Association for Computing Machinery.
Xu, W., Zhu, G., Zhao, X., Pan, L., Li, L., and Wang, W. (2024). Pride and prejudice: LLM amplifies self-bias in self-refinement. In Ku, L.-W., Martins, A., and Srikumar, V., editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15474–15492, Bangkok, Thailand. Association for Computational Linguistics.
Yu, H. C., Shih, Y. A., Law, K. M., Hsieh, K., Cheng, Y. C., Ho, H. C., Lin, Z. A., Hsu, W.-C., and Fan, Y.-C. (2024). Enhancing distractor generation for multiple-choice questions with retrieval augmented pretraining and knowledge graph integration. In Ku, L.-W., Martins, A., and Srikumar, V., editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 11019–11029, Bangkok, Thailand. Association for Computational Linguistics.
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). Judging llm-as-a-judge with mt-bench and chatbot arena. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S., editors, Advances in Neural Information Processing Systems, volume 36, pages 46595–46623. Curran Associates, Inc.
Zhu, Y., Lu, S., Zheng, L., Guo, J., Zhang, W., Wang, J., and Yu, Y. (2018). Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, page 1097–1100, New York, NY, USA. Association for Computing Machinery.
Publicado
24/11/2025
Como Citar
CHICO, Víctor Jesús Sotelo; REGINO, André Gomes; BONACIN, Rodrigo; REIS, Julio Cesar dos.
LLM-Based Automatic Generation of Multiple-Choice Questions With Meaningful Distractors. In: SIMPÓSIO BRASILEIRO DE INFORMÁTICA NA EDUCAÇÃO (SBIE), 36. , 2025, Curitiba/PR.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 813-827.
DOI: https://doi.org/10.5753/sbie.2025.12675.
