Evaluating the Accuracy and Stability of Frontier LLMs on ENADE Computer Science Questions

Lucas de Moura Carvalho; Cleiton Moreira de Carvalho Junior; Nabor C. Mendonça

Lucas de Moura Carvalho UNIFOR
Cleiton Moreira de Carvalho Junior UFC
Nabor C. Mendonça UNIFOR

Resumo

Large Language Models (LLMs) are increasingly evaluated on complex reasoning tasks, yet most benchmarks rely on single-shot prompting and fail to account for output variability. This study investigates both the accuracy and response stability of four multimodal frontier LLMs—GPT-4o, o1, Gemini 2.0 Flash, and DeepSeek-R1—on the multiple-choice Computer Science section of Brazil’s 2021 ENADE exam. Each model was prompted ten independent times per question, and their performance was evaluated using five complementary accuracy and stability metrics. Our results reveal substantial variation in both accuracy and stability across models and questions. Gemini 2.0 Flash achieved the highest accuracy, while also exhibiting strong response stability, followed by o1 and DeepSeek-R1. In contrast, GPT-4o showed lower accuracy and notably less stability. Interestingly, higher stability did not always correlate with higher accuracy, with some models being more stable than others, but less accurate. A per-question analysis shows that questions excluded from the ENADE score due to psychometric anomalies often correlate with high model disagreement or systematic errors, though some excluded questions were answered consistently by most models, providing insights into LLMs as complementary tools for exam validation. These findings reinforce the importance of multi-run evaluations and stability-focused metrics when benchmarking LLMs, especially in high-stakes educational contexts.