Is Machine-Translation Enough? Understanding Impacts in LLM Benchmarking
Resumo
The Massive Multitask Language Understanding (MMLU) benchmark is widely employed to assess the general capabilities of Large Language Models (LLMs) across diverse domains. The predominance of English in training data and benchmark datasets introduces inherent limitations when extending assessments to a multilingual context. Portuguese results are often grouped within broad ’multilingual’ categories, preventing meaningful examination of model behavior and specific linguistic challenges. This study investigates the impact of English-to-Portuguese translation methods on MMLU performance, comparing: Google Translate, GPT-3.5, GPT-4o, and professional human translation. We evaluate three multilingual open-weight models and Sabiá 3, a state-of-the-art Portuguese-specialized model. Contrary to expectations, accuracy in machine-translated versions of MMLU is slightly higher than in the human-translated version. In a 95% confidence interval, the expected accuracy increase on GPT-4o’s translation is the highest, but still small, between 0.85% and 1.52%. We further evaluated these translation methods with standardized answer-order perturbations, while a slight decrease in accuracy was observed, the standard deviation among translation methods was 0.07%. The presented results support the trustworthiness of machine-translation for LLM benchmarking in Portuguese, providing a stepping stone for better performance evaluation through translated state-of-the-art benchmarks.
Publicado
29/09/2025
Como Citar
PEDROSA, Taígo Ítalo de Moraes; COSTA, Evandro de Barros; SANTOS, Robério José Rogério dos.
Is Machine-Translation Enough? Understanding Impacts in LLM Benchmarking. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 35. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 107-120.
ISSN 2643-6264.
