BLUEX Revisited: Enhancing Benchmark Coverage with Automatic Captioning

João Guilherme Alves Santos; Giovana Kerche Bonás; Thales Sales Almeida

doi:10.5753/eniac.2025.14256

João Guilherme Alves Santos Unicamp http://orcid.org/0000-0001-5307-5338
Giovana Kerche Bonás Unicamp / Maritaca AI http://orcid.org/0009-0001-9460-8353
Thales Sales Almeida Unicamp / Maritaca AI http://orcid.org/0009-0006-9568-9331

DOI: https://doi.org/10.5753/eniac.2025.14256

Resumo

With the growing capabilities of Large Language Models (LLMs), there is an increasing need for robust evaluation methods, especially in multilingual and non-English contexts. We present an updated version of the BLUEX dataset, now including 2024-2025 exams and automatically generated image captions using state-of-the-art models, enhancing its relevance for data contamination studies in LLM pretraining. Captioning strategies increase accessibility to text-only models by more than 40%, producing 1,422 usable questions, more than doubling the number in the original BLUEX. We evaluated commercial and open-source LLMs and their ability to leverage visual context through captions.

Referências

Abonizio, H. et al. (2024). Sabi\’a-3 technical report. arXiv preprint arXiv:2410.12049.

Almeida, T. S. (2025). Revisited bluex benchmark - code repository. [link]. Accessed: 2025-08-07.

Almeida, T. S. et al. (2025). Tiebe: Tracking language model recall of notable worldwide events through time. arXiv preprint arXiv:2501.07482.

Almeida, T. S. et al. (2023). Bluex: A benchmark based on brazilian leading universities entrance exams. In Brazilian Conference on Intelligent Systems, pages 337–347. Springer.

Bianco, S. et al. (2023). Improving image captioning descriptiveness by ranking and llm-based fusion. arXiv preprint arXiv:2306.11593. 2Typically, the most competitive undergraduate program is Medicine.

Chang, Y. et al. (2023). Booookscore: A systematic exploration of book-length summarization in the era of llms. arXiv preprint arXiv:2310.00785.

Chen, M. et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.

Datasets, P. B. (2025). Bluex: Brazilian undergraduate entrance exams benchmark. [link]. Accessed: 2025-08-07.

Delfino, P. et al. (2017). Passing the brazilian oab exam: data preparation and some experiments. In Legal knowledge and information systems, pages 89–94. IOS Press.

Dubey, A. et al. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783.

Gao, L. et al. (2024). The language model evaluation harness.

Grattafiori, A. et al. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783.

Hendrycks, D. et al. (2021). Measuring massive multitask language understanding.

Hu, Y. et al. (2023). Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406–20417.

Lazaridou, A. et al. (2022). Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115.

Li, B. et al. (2024). Seed-bench-2-plus: Benchmarking multimodal large language models with text-rich visual comprehension. arXiv preprint arXiv:2404.16790.

Li, B. et al. (2023). Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125.

Liu, A. et al. (2024a). Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437.

Liu, J. et al. (2023). Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems, 36:21558–21572.

Liu, X. et al. (2024b). Mm-safetybench: A benchmark for safety evaluation of multimodal large language models.

Nam, D. et al. (2024). Using an llm to help with code understanding. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13.

OpenAI (2024). Gpt-4o system card. arXiv preprint arXiv:2410.21276.

OpenAI et al. (2024). Gpt-4 technical report.

Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.

Patraucean, V. et al. (2023). Perception test: A diagnostic benchmark for multimodal video models. In Oh, A. et al., editors, Advances in Neural Information Processing Systems, volume 36, pages 42748–42761. Curran Associates, Inc.

Petroni, F. et al. (2019). Language models as knowledge bases? arXiv preprint arXiv:1909.01066.

Pi, R. et al. (2024). Image textualization: An automatic framework for generating rich and detailed image descriptions. Advances in Neural Information Processing Systems, 37:108116–108139.

Pires, R. et al. (2023). Evaluating gpt-4’s vision capabilities on brazilian university admission exams. arXiv preprint arXiv:2311.14169.

Rein, D. et al. (2024). Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling.

ShanghaiRanking Consultancy (2024). Academic ranking of world universities 2024. Accessed: 2025-04-25.

Silveira, I. C. and Mauá, D. D. (2017). University entrance exam as a guiding test for artificial intelligence. In 2017 Brazilian Conference on Intelligent Systems (BRACIS), pages 426–431. IEEE.

Singhal, K. et al. (2025). Toward expert-level medical question answering with large language models. Nature Medicine, pages 1–8.

Team, T. (2024). Falcon 3 family of open foundation models.

Times Higher Education (2024). World university rankings 2024. Accessed: 2025-04-25.

Yang, A. et al. (2024). Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115.

Zhang, S. et al. (2023). Planning with large language models for code generation. arXiv preprint arXiv:2303.05510.

Zhang, T. et al. (2024a). Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics, 12:39–57.

Zhang, Y. et al. (2024b). A comprehensive survey on process-oriented automatic text summarization with exploration of llm-based methods. arXiv preprint arXiv:2403.02901.

Zhong, W. et al. (2023). Agieval: A human-centric benchmark for evaluating foundation models.