Estimando Parâmetros da Teoria de Resposta ao Item com NLP e Aprendizado de Máquina

  • William Oliveira da Costa e Silva UFRR
  • Filipe Dwan Pereira UFRR
  • Rafael Mello CESAR School

Resumo


A calibração de itens da Teoria de Resposta ao Item (TRI) é um processo custoso que depende de respostas de estudantes. Propomos uma metodologia para predizer os parâmetros de dificuldade (b) e discriminação (a) de itens novos a partir do texto, eliminando essa dependência. O método treina um modelo de regressão usando parâmetros reais da TRI como alvo e um conjunto de features que une rubricas geradas por um Modelo de Linguagem Amplo (LLM) com embeddings. Os resultados indicam que as rubricas interpretáveis são preditivas bem como os embeddings, validando um fluxo de trabalho para calibração a priori de itens que pode agilizar a criação de avaliações.

Referências

AlKhuzaey, S., Grasso, F., Payne, T. R., and Tamma, V. (2024). Text-based question difficulty prediction: A systematic review of automatic approaches. International Journal of Artificial Intelligence in Education, 34(3):862–914.

Benedetto, L. (2023). A quantitative study of nlp approaches to question difficulty estimation. In International Conference on Artificial Intelligence in Education, pages 428–434. Springer.

Benedetto, L., Cremonesi, P., Caines, A., Buttery, P., Cappelli, A., Giussani, A., and Turrin, R. (2023). A survey on recent approaches to question difficulty estimation from text. ACM Computing Surveys, 55(9):1–37.

Byrd, M. and Srivastava, S. (2022). Predicting difficulty and discrimination of natural language questions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 119–130.

Gurdil, H., Soguksu, Y. B., Salihoglu, S., and Coskun, F. (2024). Integration of artificial intelligence in educational measurement: Efficacy of chatgpt in data generation within the scope of item response theory. arXiv preprint arXiv:2402.01731.

Hsu, F.-Y., Lee, H.-M., Chang, T.-H., and Sung, Y.-T. (2018). Automated estimation of item difficulty for multiple-choice tests: An application of word embedding techniques. Information Processing & Management, 54(6):969–984.

Köppen, M. (2000). The curse of dimensionality. In 5th online world conference on soft computing in industrial applications (WSC5), volume 1, pages 4–8.

Lalor, J. P., Wu, H., and Yu, H. (2019). Learning latent parameters without human response patterns: Item response theory with artificial crowds. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, volume 2019, page 4240.

Liu, Y., Bhandari, S., and Pardos, Z. A. (2025). Leveraging llm respondents for item evaluation: A psychometric analysis. British Journal of Educational Technology, 56(3):1028–1052.

Liu, Y., Wang, Y., and Zhang, J. (2012). New machine learning algorithm: random forest. In Proceedings of the Third International Conference on Information Computing and Applications, ICICA’12, page 246–252, Berlin, Heidelberg. Springer-Verlag.

LMArena (2025). Overview Leaderboard — LMArena. Online.

Lord, F. M. (2012). Applications of item response theory to practical testing problems. Routledge.

Mello, R. F., Rodrigues, L., Cabral, L., Pereira, F. D., Júnior, C. P., Gasevic, D., and Ramalho, G. (2024). Prompt engineering for automatic short answer grading in brazilian portuguese. In Simpósio Brasileiro de Informática na Educação (SBIE), pages 1730–1743. SBC.

Natesan, P., Nandakumar, R., Minka, T., and Rubright, J. D. (2016). Bayesian prior choice in irt estimation using mcmc and variational bayes. Frontiers in psychology, 7:1422.

Rodrigues, L., Pereira, F. D., Cabral, L., Gašević, D., Ramalho, G., and Mello, R. F. (2024a). Assessing the quality of automatic-generated short answers using gpt-4. Computers and Education: Artificial Intelligence, 7:100248.

Rodrigues, L., Pereira, F. D., Cabral, L., Ramalho, G., Gasevic, D., and Mello, R. F. (2024b). Can gpt4 answer educational tests? empirical analysis of answer quality based on question complexity and difficulty. In International Conference on Artificial Intelligence in Education, pages 192–205. Springer.

Susanti, Y., Tokunaga, T., Nishikawa, H., and Obari, H. (2017). Controlling item difficulty for automatic vocabulary question generation. Research and Practice in Technology Enhanced Learning, 12(1):25.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.

Yancey, K. P., Runge, A., Laflair, G., and Mulcaire, P. (2024). Bert-irt: Accelerating item piloting with bert embeddings and explainable irt models. In Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024), pages 428–438.
Publicado
24/11/2025
COSTA E SILVA, William Oliveira da; PEREIRA, Filipe Dwan; MELLO, Rafael. Estimando Parâmetros da Teoria de Resposta ao Item com NLP e Aprendizado de Máquina. In: SIMPÓSIO BRASILEIRO DE INFORMÁTICA NA EDUCAÇÃO (SBIE), 36. , 2025, Curitiba/PR. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 1042-1055. DOI: https://doi.org/10.5753/sbie.2025.12767.