BLUEX: A Benchmark Based on Brazilian Leading Universities Entrance eXams

Almeida, Thales Sales; Laitz, Thiago; Bonás, Giovana K.; Nogueira, Rodrigo

doi:10.1007/978-3-031-45368-7_22

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14195))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

217 Accesses
1 Citations

Abstract

One common trend in recent studies of language models (LMs) is the use of standardized tests for evaluation. However, despite being the fifth most spoken language worldwide, few such evaluations have been conducted in Portuguese. This is mainly due to the lack of high-quality datasets available to the community for carrying out evaluations in Portuguese. To address this gap, we introduce the Brazilian Leading Universities Entrance eXams (BLUEX), a dataset of entrance exams from the two leading universities in Brazil: UNICAMP and USP. The dataset includes annotated metadata for evaluating the performance of NLP models on a variety of subjects. Furthermore, BLUEX includes a collection of recently administered exams that are unlikely to be included in the training data of many popular LMs as of 2023. The dataset is also annotated to indicate the position of images in each question, providing a valuable resource for advancing the state-of-the-art in multimodal language understanding and reasoning. We describe the creation and characteristics of BLUEX and establish a benchmark through experiments with state-of-the-art LMs, demonstrating its potential for advancing the state-of-the-art in natural language understanding and reasoning in Portuguese. The data and relevant code can be found at https://github.com/Portuguese-Benchmark-Datasets/BLUEX.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The average and cutoff scores are reported by the entities responsible for administering the exams. The results presented in Table 3 are the average of all the exams contained in the BLUEX dataset.

References

Bowman, S., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642 (2015)
Google Scholar
Brum, H.B., das Graças Volpe Nunes, M.: Building a sentiment corpus of tweets in Brazilian Portuguese (2017)
Google Scholar
Chowdhery, A., et al.: Palm: scaling language modeling with pathways (2022)
Google Scholar
FitzGerald, J., et al.: MASSIVE: a 1 m-example multilingual natural language understanding dataset with 51 typologically-diverse languages (2022)
Google Scholar
Fonseca, E., Santos, L., Criscuolo, M., Aluisio, S.: ASSIN: Avaliacao de similaridade semantica e inferencia textual. In: 12th International Conference on Computational Processing of the Portuguese Language, Tomar, Portugal, pp. 13–15 (2016)
Google Scholar
Gomes, J.R.S.: PLUE: Portuguese language understanding evaluation (2020). https://github.com/jubs12/PLUE
Hoffmann, J., et al.: Training compute-optimal large language models (2022)
Google Scholar
Khot, T., Sabharwal, A., Clark, P.: SciTaiL: a textual entailment dataset from science question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Kocijan, V., Lukasiewicz, T., Davis, E., Marcus, G., Morgenstern, L.: A review of Winograd Schema Challenge datasets and approaches. arXiv preprint arXiv:2004.13831 (2020)
Kwiatkowski, T., et al.: Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguist. 7, 453–466 (2019)
Article Google Scholar
Lin, X.V., et al.: Few-shot learning with multilingual language models (2022)
Google Scholar
Longpre, S., Lu, Y., Daiber, J.: MKQA: a linguistically diverse benchmark for multilingual open domain question answering. Trans. Assoc. Computat. Linguist. 9, 1389–1406 (2021)
Article Google Scholar
de Melo, G., Imaizumi, V., Cozman, F.: Winograd schemas in portuguese. In: Anais do XVI Encontro Nacional de Inteligência Artificial e Computacional, pp. 787–798. SBC (2019)
Google Scholar
Muennighoff, N., et al.: Crosslingual generalization through multitask finetuning (2022)
Google Scholar
Nunes, D., Primi, R., Pires, R., Lotufo, R., Nogueira, R.: Evaluating GPT-3.5 and GPT-4 models on Brazilian University admission exams (2023)
Google Scholar
OpenAI: GPT-4 technical report (2023)
Google Scholar
Pires, R., Abonizio, H., Almeida, T.S., Nogueira, R.: Sabiá: Portuguese large language models (2023)
Google Scholar
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392 (2016)
Google Scholar
Real, L., Fonseca, E., Gonçalo Oliveira, H.: The ASSIN 2 shared task: a quick overview. In: Quaresma, P., Vieira, R., Aluísio, S., Moniz, H., Batista, F., Gonçalves, T. (eds.) PROPOR 2020. LNCS (LNAI), vol. 12037, pp. 406–412. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-41505-1_39
Chapter Google Scholar
de la Rosa, J., Ponferrada, E.G., Villegas, P., de Prado Salas, P.G., Romero, M., Grandury, M.: BERTIN: efficient pre-training of a Spanish language model using perplexity sampling (2022)
Google Scholar
Sayama, H.F., Araujo, A.V., Fernandes, E.R.: FaQuAD: reading comprehension dataset in the domain of Brazilian higher education. In: 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), pp. 443–448. IEEE (2019)
Google Scholar
Silveira, I.C., Mauá, D.D.: Advances in automatically solving the ENEM. In: 2018 7th Brazilian Conference on Intelligent Systems (BRACIS), pp. 43–48. IEEE (2018)
Google Scholar
Taori, R., et al.: Stanford Alpaca: an instruction-following LLaMA model (2023). https://github.com/tatsu-lab/stanford_alpaca
Tiedemann, J., Thottingal, S.: OPUS-MT - building open translation services for the world. In: Proceedings of the 22nd Annual Conference of the European Association for Machine Translation (EAMT), Lisbon, Portugal (2020)
Google Scholar
Touvron, H., et al.: LLaMA: open and efficient foundation language models (2023)
Google Scholar
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=rJ4km2R5t7
Wang, B., Komatsuzaki, A.: GPT-J-6B: a 6 billion parameter autoregressive language model, May 2021. https://github.com/kingoflolz/mesh-transformer-jax
Wei, J., et al.: Chain of thought prompting elicits reasoning in large language models. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022). https://openreview.net/forum?id=_VjQlMeSB_J
Le Scao, T., et al.: BLOOM: a 176B-parameter open-access multilingual language model (2023)
Google Scholar
Zhang, S., et al.: OPT: open pre-trained transformer language models (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

State University of Campinas (UNICAMP), Campinas, Brazil
Thales Sales Almeida, Thiago Laitz, Giovana K. Bonás & Rodrigo Nogueira
Maritaca AI, Campinas, Brazil
Thales Sales Almeida & Rodrigo Nogueira
NeuralMind AI, Campinas, Brazil
Thiago Laitz

Authors

Thales Sales Almeida
View author publications
You can also search for this author in PubMed Google Scholar
Thiago Laitz
View author publications
You can also search for this author in PubMed Google Scholar
Giovana K. Bonás
View author publications
You can also search for this author in PubMed Google Scholar
Rodrigo Nogueira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thales Sales Almeida .

Editor information

Editors and Affiliations

Federal University of São Carlos, São Carlos, Brazil
Murilo C. Naldi
Centro Universitario da FEI, São Bernardo do Campo, Brazil
Reinaldo A. C. Bianchi

7 Appendix

1.1 7.1 Prompt for Evaluation

The prompt used for all the experiments in this paper is shown in the Fig. 3.

Table 4. Results for each model by subject in BLUEX.

Full size table

1.2 7.2 Benchmark per Subject

Table 4 provides a detailed report of each model achieved accuracy by subject. Questions that were associated with more than one subject contributed to the accuracy of both scores. For example, a question related to mathematics and English will be taken into account when calculating the accuracy of both mathematics and English subjects.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Almeida, T.S., Laitz, T., Bonás, G.K., Nogueira, R. (2023). BLUEX: A Benchmark Based on Brazilian Leading Universities Entrance eXams. In: Naldi, M.C., Bianchi, R.A.C. (eds) Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science(), vol 14195. Springer, Cham. https://doi.org/10.1007/978-3-031-45368-7_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-45368-7_22
Published: 12 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-45367-0
Online ISBN: 978-3-031-45368-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

BLUEX: A Benchmark Based on Brazilian Leading Universities Entrance eXams

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

7 Appendix

7 Appendix

1.1 7.1 Prompt for Evaluation

1.2 7.2 Benchmark per Subject

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation