Assessing DeepSeek-R1’s Performance on Brazil’s Defining National Education Benchmark
Resumo
This paper evaluates DeepSeek-R1’s performance on Brazil’s National High School Exam (ENEM), a culturally-specific educational benchmark that tests reasoning across multiple domains. Using the pass@k metric across three years (2022–2024) of ENEM questions, we found strong reasoning capabilities, particularly in Human Sciences, where 2023–2024 performance exceeded 0.98 pass@1 score. Mathematics showed the most variability, with diverse scores across the three years. The model demonstrated sophisticated self-translation capabilities when handling Brazilian Portuguese language questions without explicit translation instructions. Despite strong overall performance, inconsistencies across subject domains persist. Our findings suggest that recent advances in AI reasoning extend effectively beyond typical AI benchmarks to diverse cultural contexts, with implications for AI deployment in various global settings. This evaluation contributes to understanding how generative AI systems perform when facing multidisciplinary, culturally-situated reasoning challenges.
Publicado
29/09/2025
Como Citar
BENDER, Alexandre Thurow; GOMES, Gabriel Almeida; CORRÊA, Ulisses Brisolara; ARAUJO, Ricardo Matsumura.
Assessing DeepSeek-R1’s Performance on Brazil’s Defining National Education Benchmark. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 35. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 65-77.
ISSN 2643-6264.
