Abstract
One common trend in recent studies of language models (LMs) is the use of standardized tests for evaluation. However, despite being the fifth most spoken language worldwide, few such evaluations have been conducted in Portuguese. This is mainly due to the lack of high-quality datasets available to the community for carrying out evaluations in Portuguese. To address this gap, we introduce the Brazilian Leading Universities Entrance eXams (BLUEX), a dataset of entrance exams from the two leading universities in Brazil: UNICAMP and USP. The dataset includes annotated metadata for evaluating the performance of NLP models on a variety of subjects. Furthermore, BLUEX includes a collection of recently administered exams that are unlikely to be included in the training data of many popular LMs as of 2023. The dataset is also annotated to indicate the position of images in each question, providing a valuable resource for advancing the state-of-the-art in multimodal language understanding and reasoning. We describe the creation and characteristics of BLUEX and establish a benchmark through experiments with state-of-the-art LMs, demonstrating its potential for advancing the state-of-the-art in natural language understanding and reasoning in Portuguese. The data and relevant code can be found at https://github.com/Portuguese-Benchmark-Datasets/BLUEX.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The average and cutoff scores are reported by the entities responsible for administering the exams. The results presented in Table 3 are the average of all the exams contained in the BLUEX dataset.
References
Bowman, S., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642 (2015)
Brum, H.B., das Graças Volpe Nunes, M.: Building a sentiment corpus of tweets in Brazilian Portuguese (2017)
Chowdhery, A., et al.: Palm: scaling language modeling with pathways (2022)
FitzGerald, J., et al.: MASSIVE: a 1 m-example multilingual natural language understanding dataset with 51 typologically-diverse languages (2022)
Fonseca, E., Santos, L., Criscuolo, M., Aluisio, S.: ASSIN: Avaliacao de similaridade semantica e inferencia textual. In: 12th International Conference on Computational Processing of the Portuguese Language, Tomar, Portugal, pp. 13–15 (2016)
Gomes, J.R.S.: PLUE: Portuguese language understanding evaluation (2020). https://github.com/jubs12/PLUE
Hoffmann, J., et al.: Training compute-optimal large language models (2022)
Khot, T., Sabharwal, A., Clark, P.: SciTaiL: a textual entailment dataset from science question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Kocijan, V., Lukasiewicz, T., Davis, E., Marcus, G., Morgenstern, L.: A review of Winograd Schema Challenge datasets and approaches. arXiv preprint arXiv:2004.13831 (2020)
Kwiatkowski, T., et al.: Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguist. 7, 453–466 (2019)
Lin, X.V., et al.: Few-shot learning with multilingual language models (2022)
Longpre, S., Lu, Y., Daiber, J.: MKQA: a linguistically diverse benchmark for multilingual open domain question answering. Trans. Assoc. Computat. Linguist. 9, 1389–1406 (2021)
de Melo, G., Imaizumi, V., Cozman, F.: Winograd schemas in portuguese. In: Anais do XVI Encontro Nacional de Inteligência Artificial e Computacional, pp. 787–798. SBC (2019)
Muennighoff, N., et al.: Crosslingual generalization through multitask finetuning (2022)
Nunes, D., Primi, R., Pires, R., Lotufo, R., Nogueira, R.: Evaluating GPT-3.5 and GPT-4 models on Brazilian University admission exams (2023)
OpenAI: GPT-4 technical report (2023)
Pires, R., Abonizio, H., Almeida, T.S., Nogueira, R.: Sabiá: Portuguese large language models (2023)
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392 (2016)
Real, L., Fonseca, E., Gonçalo Oliveira, H.: The ASSIN 2 shared task: a quick overview. In: Quaresma, P., Vieira, R., Aluísio, S., Moniz, H., Batista, F., Gonçalves, T. (eds.) PROPOR 2020. LNCS (LNAI), vol. 12037, pp. 406–412. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-41505-1_39
de la Rosa, J., Ponferrada, E.G., Villegas, P., de Prado Salas, P.G., Romero, M., Grandury, M.: BERTIN: efficient pre-training of a Spanish language model using perplexity sampling (2022)
Sayama, H.F., Araujo, A.V., Fernandes, E.R.: FaQuAD: reading comprehension dataset in the domain of Brazilian higher education. In: 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), pp. 443–448. IEEE (2019)
Silveira, I.C., Mauá, D.D.: Advances in automatically solving the ENEM. In: 2018 7th Brazilian Conference on Intelligent Systems (BRACIS), pp. 43–48. IEEE (2018)
Taori, R., et al.: Stanford Alpaca: an instruction-following LLaMA model (2023). https://github.com/tatsu-lab/stanford_alpaca
Tiedemann, J., Thottingal, S.: OPUS-MT - building open translation services for the world. In: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation (EAMT), Lisbon, Portugal (2020)
Touvron, H., et al.: LLaMA: open and efficient foundation language models (2023)
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=rJ4km2R5t7
Wang, B., Komatsuzaki, A.: GPT-J-6B: a 6 billion parameter autoregressive language model, May 2021. https://github.com/kingoflolz/mesh-transformer-jax
Wei, J., et al.: Chain of thought prompting elicits reasoning in large language models. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022). https://openreview.net/forum?id=_VjQlMeSB_J
Le Scao, T., et al.: BLOOM: a 176B-parameter open-access multilingual language model (2023)
Zhang, S., et al.: OPT: open pre-trained transformer language models (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
7 Appendix
7 Appendix
1.1 7.1 Prompt for Evaluation
The prompt used for all the experiments in this paper is shown in the Fig. 3.
1.2 7.2 Benchmark per Subject
Table 4 provides a detailed report of each model achieved accuracy by subject. Questions that were associated with more than one subject contributed to the accuracy of both scores. For example, a question related to mathematics and English will be taken into account when calculating the accuracy of both mathematics and English subjects.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Almeida, T.S., Laitz, T., Bonás, G.K., Nogueira, R. (2023). BLUEX: A Benchmark Based on Brazilian Leading Universities Entrance eXams. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14195. Springer, Cham. https://doi.org/10.1007/978-3-031-45368-7_22
Download citation
DOI: https://doi.org/10.1007/978-3-031-45368-7_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45367-0
Online ISBN: 978-3-031-45368-7
eBook Packages: Computer ScienceComputer Science (R0)