Evaluating Large Language Models through Multidimensional Item Response Theory: A Comprehensive Case Study on ENEM
Abstract
LLM evaluations on tasks like high-stakes multidisciplinary tests still rely on raw accuracy, a metric that weights easy and difficult questions equally and ignores guessing. To help bridge this methodological gap, we repurpose the official three-parameter logistic Item Response Theory (IRT) calibration that the Brazilian education authority (INEP) uses to score humans on the Exame Nacional do Ensino Médio (ENEM), and apply it to LLM responses. We then fit a four-dimensional 3-PL model aligned with ENEM’s knowledge domains. Results show that similar accuracies can mask proficiency gaps exceeding one standard deviation across domains. Mathematics remains the toughest domain for both humans and models, whereas questions on Human Sciences are systematically easier for both.References
Abonizio, H. Q., Almeida, T. S., Laitz, T. S., Junior, R. M., Bonás, G. K., Nogueira, R., and Pires, R. (2024). Sabiá-3 technical report. CoRR, abs/2410.12049.
Baker, F. B. (2001). The Basics of Item Response Theory ISBN 1-886047-030. Heinemann, second edition.
Bassett, R. and Deride, J. (2016). Maximum a posteriori estimators as a limit of bayes estimators. Mathematical Programming, 174.
Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the r environment. Journal of Statistical Software, 48(6):1–29.
Chow, J. C., Cheng, T. Y., Chien, T.-W., and Chou, W. (2024). Assessing chatgpt’s capability for multiple choice questions using raschonline: Observational study. JMIR Form Res, 8:e46800.
DeepSeek-AI (2024). Deepseek-v3 technical report. arXiv:2412.19437. [link].
DeepSeek-AI (2025). Deepseek-r1: Incentivizing reasoning capability in llms via sparse mixture-of-experts. arXiv:2501.12948. [link].
EleutherAI (2024). The language model evaluation harness.
INEP (2021). Accessed 8 May 2025 [link].
INEP (2022). Accessed 8 May 2025 [link].
INEP (2023). Accessed 8 May 2025 [link].
Liu, Y., Bhandari, S., and Pardos, Z. A. (2025). Leveraging llm respondents for item evaluation: A psychometric analysis. British Journal of Educational Technology, (Early View):1–25.
Meta AI (2024a). The Llama4 herd: The beginning of a new era of natively multimodal AI innovation. Accessed 16Jun2025.
Meta AI (2024b). The Llama4 herd: The beginning of a new era of natively multimodal AI innovation. Announcement of the Llama4 family—including Llama4Scout. 17B activated / 109B total parameters; 10M-token context; knowledge-cutoff August 2024. Accessed 16Jun2025.
Nunes, D., Primi, R., Pires, R., Lotufo, R. A., and Nogueira, R. (2023). Evaluating GPT-3.5 and GPT-4 models on brazilian university admission exams. CoRR, abs/2303.17003.
OpenAI. Deep research. Acessed Jun-16,2025. [link].
OpenAI (2024a). GPT-4o System Card. Accessed 13Jun2025. [link].
OpenAI (2024b). GPT-4o mini: Advancing Cost-Efficient Intelligence. Released 18Jul2024 — accessed 13Jun2025. [link].
OpenAI (2024c). OpenAI o1 System Card. Updated 5Dec2024 — accessed 13Jun2025 [link].
OpenAI (2025). OpenAI o3 and o4-mini System Card. [link]. Published 16Apr2025 — accessed 13Jun2025.
Pires, R., Abonizio, H., Almeida, T., and Nogueira, R. (2023a). Sabiá: Portuguese large language models. In Anais da XII Brazilian Conference on Intelligent Systems, pages 226–240, Porto Alegre, RS, Brasil. SBC.
Pires, R., Almeida, T. S., Abonizio, H. Q., and Nogueira, R. (2023b). Evaluating gpt-4’s vision capabilities on brazilian university admission exams. CoRR, abs/2311.14169.
R Core Team (2025). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
Silveira, I. C. and Mauá, D. D. (2017). University entrance exam as a guiding test for artificial intelligence. In 2017 Brazilian Conference on Intelligent Systems, BRACIS 2017, Uberlândia, Brazil, October 2-5, 2017, pages 426–431. IEEE Computer Society.
Silveira, I. C. and Mauá, D. D. (2018). Advances in automatically solving the ENEM. In 7th Brazilian Conference on Intelligent Systems, BRACIS 2018, São Paulo, Brazil, October 22-25, 2018, pages 43–48. IEEE Computer Society.
Superbi, J., Pinto, H., Santos, E., Lattari, L., and Castro, B. (2024). Enhancing large language model performance on enem math questions using retrievalaugmented generation. In Anais do XVIII Brazilian e-Science Workshop, pages 56–63, Porto Alegre, RS, Brasil. SBC.
Taschetto, L. and Fileto, R. (2024). Using retrieval-augmented generation to improve performance of large language models on the brazilian university admission exam. In Anais do XXXIX Simpósio Brasileiro de Bancos de Dados, pages 799–805, Porto Alegre, RS, Brasil. SBC.
Wei, J. e. a. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
Zhang, X., yan Li, C., Zong, Y., Ying, Z., He, L., and Qiu, X. (2023). Evaluating the performance of large language models on gaokao benchmark. ArXiv, abs/2305.12474.
Zong, Y. and Qiu, X. (2024). GAOKAO-MM: A Chinese human-level benchmark for multimodal models evaluation. In Ku, L.-W., Martins, A., and Srikumar, V., editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 8817–8825, Bangkok, Thailand. Association for Computational Linguistics.
Baker, F. B. (2001). The Basics of Item Response Theory ISBN 1-886047-030. Heinemann, second edition.
Bassett, R. and Deride, J. (2016). Maximum a posteriori estimators as a limit of bayes estimators. Mathematical Programming, 174.
Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the r environment. Journal of Statistical Software, 48(6):1–29.
Chow, J. C., Cheng, T. Y., Chien, T.-W., and Chou, W. (2024). Assessing chatgpt’s capability for multiple choice questions using raschonline: Observational study. JMIR Form Res, 8:e46800.
DeepSeek-AI (2024). Deepseek-v3 technical report. arXiv:2412.19437. [link].
DeepSeek-AI (2025). Deepseek-r1: Incentivizing reasoning capability in llms via sparse mixture-of-experts. arXiv:2501.12948. [link].
EleutherAI (2024). The language model evaluation harness.
INEP (2021). Accessed 8 May 2025 [link].
INEP (2022). Accessed 8 May 2025 [link].
INEP (2023). Accessed 8 May 2025 [link].
Liu, Y., Bhandari, S., and Pardos, Z. A. (2025). Leveraging llm respondents for item evaluation: A psychometric analysis. British Journal of Educational Technology, (Early View):1–25.
Meta AI (2024a). The Llama4 herd: The beginning of a new era of natively multimodal AI innovation. Accessed 16Jun2025.
Meta AI (2024b). The Llama4 herd: The beginning of a new era of natively multimodal AI innovation. Announcement of the Llama4 family—including Llama4Scout. 17B activated / 109B total parameters; 10M-token context; knowledge-cutoff August 2024. Accessed 16Jun2025.
Nunes, D., Primi, R., Pires, R., Lotufo, R. A., and Nogueira, R. (2023). Evaluating GPT-3.5 and GPT-4 models on brazilian university admission exams. CoRR, abs/2303.17003.
OpenAI. Deep research. Acessed Jun-16,2025. [link].
OpenAI (2024a). GPT-4o System Card. Accessed 13Jun2025. [link].
OpenAI (2024b). GPT-4o mini: Advancing Cost-Efficient Intelligence. Released 18Jul2024 — accessed 13Jun2025. [link].
OpenAI (2024c). OpenAI o1 System Card. Updated 5Dec2024 — accessed 13Jun2025 [link].
OpenAI (2025). OpenAI o3 and o4-mini System Card. [link]. Published 16Apr2025 — accessed 13Jun2025.
Pires, R., Abonizio, H., Almeida, T., and Nogueira, R. (2023a). Sabiá: Portuguese large language models. In Anais da XII Brazilian Conference on Intelligent Systems, pages 226–240, Porto Alegre, RS, Brasil. SBC.
Pires, R., Almeida, T. S., Abonizio, H. Q., and Nogueira, R. (2023b). Evaluating gpt-4’s vision capabilities on brazilian university admission exams. CoRR, abs/2311.14169.
R Core Team (2025). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
Silveira, I. C. and Mauá, D. D. (2017). University entrance exam as a guiding test for artificial intelligence. In 2017 Brazilian Conference on Intelligent Systems, BRACIS 2017, Uberlândia, Brazil, October 2-5, 2017, pages 426–431. IEEE Computer Society.
Silveira, I. C. and Mauá, D. D. (2018). Advances in automatically solving the ENEM. In 7th Brazilian Conference on Intelligent Systems, BRACIS 2018, São Paulo, Brazil, October 22-25, 2018, pages 43–48. IEEE Computer Society.
Superbi, J., Pinto, H., Santos, E., Lattari, L., and Castro, B. (2024). Enhancing large language model performance on enem math questions using retrievalaugmented generation. In Anais do XVIII Brazilian e-Science Workshop, pages 56–63, Porto Alegre, RS, Brasil. SBC.
Taschetto, L. and Fileto, R. (2024). Using retrieval-augmented generation to improve performance of large language models on the brazilian university admission exam. In Anais do XXXIX Simpósio Brasileiro de Bancos de Dados, pages 799–805, Porto Alegre, RS, Brasil. SBC.
Wei, J. e. a. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
Zhang, X., yan Li, C., Zong, Y., Ying, Z., He, L., and Qiu, X. (2023). Evaluating the performance of large language models on gaokao benchmark. ArXiv, abs/2305.12474.
Zong, Y. and Qiu, X. (2024). GAOKAO-MM: A Chinese human-level benchmark for multimodal models evaluation. In Ku, L.-W., Martins, A., and Srikumar, V., editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 8817–8825, Bangkok, Thailand. Association for Computational Linguistics.
Published
2025-09-29
How to Cite
TASCHETTO, Leonardo; FILETO, Renato.
Evaluating Large Language Models through Multidimensional Item Response Theory: A Comprehensive Case Study on ENEM. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 16. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 455-466.
DOI: https://doi.org/10.5753/stil.2025.37846.
