Evaluating common sense in language models through benchmarks: The Winograd challenge applied to ChatGPT in Brazilian Portuguese

Abstract


The assessment of language models with benchmarks is presented as an effective way of evaluating their comprehension limits. In this regard, the Winograd Schema Challenge, which aims to assess common sense through pronoun disambiguation tasks, has led to the development of different metrics and datasets. When applying a translation of the Winograd Challenge to Brazilian Portuguese to ChatGPT, we identified comparable results to those obtained in English. However, these results must be analyzed with caution, considering the potential biases in the model training process and the existing gaps in the reasoning dimensions covered by the available evaluation methods.
Keywords: Language models, common sense, benchmarks, Winograd challenge, ChatGPT

References

Amsili, P.; Seminck, O. (2017) “A Google-Proof Collection of French Winograd Schemas”, Proceedings of the 2nd Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2017), p. 24-29. https://doi.org/10.18653/v1/W17-1504

Bender, E. M. et al. (2021) “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”, FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, p. 610-623. https://doi.org/10.1145/3442188.3445922

Bernard, T.; Han, T. (2020) “Mandarinograd: A Chinese Collection of Winograd Schemas”, Proceedings of the Twelfth Language Resources and Evaluation Conference, p. 21-26. https://aclanthology.org/2020.lrec-1.3

Brown, T. B. et al. (2020) “Language models are few-shot learners”, NIPS'20: Proceedings of the 34th International Conference on Neural Information Processing Systems, p. 1877-1901. [link].

Davis, E. (2023) “Benchmarks for Automated Commonsense Reasoning: A Survey”, arXiv:2302.04752v2. https://doi.org/10.48550/arXiv.2302.04752

Emelin, D.; Sennrich, R. (2021) “Wino-X: Multilingual Winograd Schemas for Commonsense Reasoning and Coreference Resolution”, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, p. 8517-8532. http://dx.doi.org/10.18653/v1/2021.emnlp-main.670

Floridi, L. (2023) “AI as Agency Without Intelligence: on ChatGPT, Large Language Models, and Other Generative Models”, Philosophy & Technology 36 (15). https://doi.org/10.1007/s13347-023-00621-y

French, R. M. (2000) “The turing test: The first 50 years”, Trends in Cognitive Sciences 4 (3), p. 115-122. https://doi.org/10.1016/S1364-6613(00)01453-4

He, W. et al. (2021) “WINOLOGIC: A Zero-Shot Logic-based Diagnostic Dataset for Winograd Schema Challenge.”, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, p. 3779–3789. https://doi.org/10.18653/v1/2021.emnlp-main.307

Isaak, N.; Michael, L. (2019) “WinoFlexi: A Crowdsourcing Platform for the Development of Winograd Schemas” In: Liu, J., Bailey, J. (eds) AI 2019: Advances in Artificial Intelligence. AI 2019. Lecture Notes in Computer Science 11919. https://doi.org/10.1007/978-3-030-35288-2_24

Kocijan, V. et al. (2023) “The Defeat of the Winograd Schema Challenge”, arXiv:2201.02387v3. https://doi.org/10.48550/arXiv.2201.02387

Levesque, H. J.; Davis, E.; Morgenstern, L. (2012) “The Winograd Schema Challenge”, Thirteenth international conference on the principles of knowledge representation and reasoning, p. 552-561. https://cdn.aaai.org/ocs/4492/4492-21843-1-PB.pdf

Melo, G. S. D.; Imaizumi, V. A.; Cozman, F. G. (2019), “Winograd Schemas in Portuguese”, Anais do Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2019), p. 787–798. https://doi.org/10.5753/eniac.2019.9334

Nicos, I.; Michael, L. (2020) “Winventor: A Machine-driven Approach for the Development of Winograd Schemas”, Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, p. 26-35. https://doi.org/10.5220/0008902600260035

OpenAI (2023) “GPT-4 Technical Report”, arXiv:2303.08774v3. https://doi.org/10.48550/arXiv.2303.08774

Petrov, A. et al. (2023), “Language Model Tokenizers Introduce Unfairness Between Languages”, arXiv:2305.15425v1. https://doi.org/10.48550/arXiv.2305.15425

Pires, R. et al. (2023), “Sabiá, Portuguese Large Language Models”, arXiv:2304.07880v2. https://doi.org/10.48550/arXiv.2304.07880

Sakaguchi et al. (2021), “WinoGrande: An Adversarial Winograd Schema Challenge at Scale”, Communications of the ACM 64(9), p. 99-106. https://doi.org/10.1145/3474381

Shavrina, T. et al. (2020), “RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark”, EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, p. 4717-4726. https://doi.org/10.18653/v1/2020.emnlp-main.381

Storks, S.; Gao, Q.; Chai, J. Y. (2019) “Recent Advances in Natural Language Inference: A Survey of Benchmarks, Resources, and Approaches”, arXiv:1904.01172v3. https://doi.org/10.48550/arXiv.1904.01172

Turing, A. M. (1950) “Computing machinery and intelligence”, Mind LIX (236), p. 433-460. https://doi.org/10.1093/mind/LIX.236.433

Vadász, N.; Ligeti-Nagy, N. “Winograd schemata and other datasets for anaphora resolution in Hungarian”, Acta Linguistica Academica 69 (4), p. 564-580. http://dx.doi.org/10.1556/2062.2022.00575

Wang, A. et al. (2018) “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding”, Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP , p. 353-355. http://doi.org/10.18653/v1/W18-5446

Wang, A. et al. (2019) “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems”, NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, p. 3266-3280. https://dl.acm.org/doi/10.5555/3454287.3454581

Winograd, T. (1972) “Understanding natural language”, Cognitive Psychology 3(1), p. 1 – 191. https://doi.org/10.1016/0010-0285(72)90002-3
Published
2023-09-25
DO NASCIMENTO, Thiago Gomes; CORTIZ, Diogo. Evaluating common sense in language models through benchmarks: The Winograd challenge applied to ChatGPT in Brazilian Portuguese. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 14. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 193-198. DOI: https://doi.org/10.5753/stil.2023.233957.