Is Machine-Translation Enough? Understanding Impacts in LLM Benchmarking

Taígo Ítalo de Moraes Pedrosa; Evandro de Barros Costa; Robério José Rogério dos Santos

Taígo Ítalo de Moraes Pedrosa UFAL
Evandro de Barros Costa UFAL
Robério José Rogério dos Santos UFAL

Resumo

The Massive Multitask Language Understanding (MMLU) benchmark is widely employed to assess the general capabilities of Large Language Models (LLMs) across diverse domains. The predominance of English in training data and benchmark datasets introduces inherent limitations when extending assessments to a multilingual context. Portuguese results are often grouped within broad ’multilingual’ categories, preventing meaningful examination of model behavior and specific linguistic challenges. This study investigates the impact of English-to-Portuguese translation methods on MMLU performance, comparing: Google Translate, GPT-3.5, GPT-4o, and professional human translation. We evaluate three multilingual open-weight models and Sabiá 3, a state-of-the-art Portuguese-specialized model. Contrary to expectations, accuracy in machine-translated versions of MMLU is slightly higher than in the human-translated version. In a 95% confidence interval, the expected accuracy increase on GPT-4o’s translation is the highest, but still small, between 0.85% and 1.52%. We further evaluated these translation methods with standardized answer-order perturbations, while a slight decrease in accuracy was observed, the standard deviation among translation methods was 0.07%. The presented results support the trustworthiness of machine-translation for LLM benchmarking in Portuguese, providing a stepping stone for better performance evaluation through translated state-of-the-art benchmarks.