Evaluating common sense in language models through benchmarks: The Winograd challenge applied to ChatGPT in Brazilian Portuguese
Abstract
The assessment of language models with benchmarks is presented as an effective way of evaluating their comprehension limits. In this regard, the Winograd Schema Challenge, which aims to assess common sense through pronoun disambiguation tasks, has led to the development of different metrics and datasets. When applying a translation of the Winograd Challenge to Brazilian Portuguese to ChatGPT, we identified comparable results to those obtained in English. However, these results must be analyzed with caution, considering the potential biases in the model training process and the existing gaps in the reasoning dimensions covered by the available evaluation methods.
Keywords:
Language models, common sense, benchmarks, Winograd challenge, ChatGPT
References
Amsili, P.; Seminck, O. (2017) “A Google-Proof Collection of French Winograd Schemas”, Proceedings of the 2nd Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2017), p. 24-29. https://doi.org/10.18653/v1/W17-1504
Bender, E. M. et al. (2021) “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”, FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, p. 610-623. https://doi.org/10.1145/3442188.3445922
Bernard, T.; Han, T. (2020) “Mandarinograd: A Chinese Collection of Winograd Schemas”, Proceedings of the Twelfth Language Resources and Evaluation Conference, p. 21-26. https://aclanthology.org/2020.lrec-1.3
Brown, T. B. et al. (2020) “Language models are few-shot learners”, NIPS'20: Proceedings of the 34th International Conference on Neural Information Processing Systems, p. 1877-1901. [link].
Davis, E. (2023) “Benchmarks for Automated Commonsense Reasoning: A Survey”, arXiv:2302.04752v2. https://doi.org/10.48550/arXiv.2302.04752
Emelin, D.; Sennrich, R. (2021) “Wino-X: Multilingual Winograd Schemas for Commonsense Reasoning and Coreference Resolution”, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, p. 8517-8532. http://dx.doi.org/10.18653/v1/2021.emnlp-main.670
Floridi, L. (2023) “AI as Agency Without Intelligence: on ChatGPT, Large Language Models, and Other Generative Models”, Philosophy & Technology 36 (15). https://doi.org/10.1007/s13347-023-00621-y
French, R. M. (2000) “The turing test: The first 50 years”, Trends in Cognitive Sciences 4 (3), p. 115-122. https://doi.org/10.1016/S1364-6613(00)01453-4
He, W. et al. (2021) “WINOLOGIC: A Zero-Shot Logic-based Diagnostic Dataset for Winograd Schema Challenge.”, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, p. 3779–3789. https://doi.org/10.18653/v1/2021.emnlp-main.307
Isaak, N.; Michael, L. (2019) “WinoFlexi: A Crowdsourcing Platform for the Development of Winograd Schemas” In: Liu, J., Bailey, J. (eds) AI 2019: Advances in Artificial Intelligence. AI 2019. Lecture Notes in Computer Science 11919. https://doi.org/10.1007/978-3-030-35288-2_24
Kocijan, V. et al. (2023) “The Defeat of the Winograd Schema Challenge”, arXiv:2201.02387v3. https://doi.org/10.48550/arXiv.2201.02387
Levesque, H. J.; Davis, E.; Morgenstern, L. (2012) “The Winograd Schema Challenge”, Thirteenth international conference on the principles of knowledge representation and reasoning, p. 552-561. https://cdn.aaai.org/ocs/4492/4492-21843-1-PB.pdf
Melo, G. S. D.; Imaizumi, V. A.; Cozman, F. G. (2019), “Winograd Schemas in Portuguese”, Anais do Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2019), p. 787–798. https://doi.org/10.5753/eniac.2019.9334
Nicos, I.; Michael, L. (2020) “Winventor: A Machine-driven Approach for the Development of Winograd Schemas”, Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, p. 26-35. https://doi.org/10.5220/0008902600260035
OpenAI (2023) “GPT-4 Technical Report”, arXiv:2303.08774v3. https://doi.org/10.48550/arXiv.2303.08774
Petrov, A. et al. (2023), “Language Model Tokenizers Introduce Unfairness Between Languages”, arXiv:2305.15425v1. https://doi.org/10.48550/arXiv.2305.15425
Pires, R. et al. (2023), “Sabiá, Portuguese Large Language Models”, arXiv:2304.07880v2. https://doi.org/10.48550/arXiv.2304.07880
Sakaguchi et al. (2021), “WinoGrande: An Adversarial Winograd Schema Challenge at Scale”, Communications of the ACM 64(9), p. 99-106. https://doi.org/10.1145/3474381
Shavrina, T. et al. (2020), “RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark”, EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, p. 4717-4726. https://doi.org/10.18653/v1/2020.emnlp-main.381
Storks, S.; Gao, Q.; Chai, J. Y. (2019) “Recent Advances in Natural Language Inference: A Survey of Benchmarks, Resources, and Approaches”, arXiv:1904.01172v3. https://doi.org/10.48550/arXiv.1904.01172
Turing, A. M. (1950) “Computing machinery and intelligence”, Mind LIX (236), p. 433-460. https://doi.org/10.1093/mind/LIX.236.433
Vadász, N.; Ligeti-Nagy, N. “Winograd schemata and other datasets for anaphora resolution in Hungarian”, Acta Linguistica Academica 69 (4), p. 564-580. http://dx.doi.org/10.1556/2062.2022.00575
Wang, A. et al. (2018) “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding”, Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP , p. 353-355. http://doi.org/10.18653/v1/W18-5446
Wang, A. et al. (2019) “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems”, NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, p. 3266-3280. https://dl.acm.org/doi/10.5555/3454287.3454581
Winograd, T. (1972) “Understanding natural language”, Cognitive Psychology 3(1), p. 1 – 191. https://doi.org/10.1016/0010-0285(72)90002-3
Bender, E. M. et al. (2021) “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”, FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, p. 610-623. https://doi.org/10.1145/3442188.3445922
Bernard, T.; Han, T. (2020) “Mandarinograd: A Chinese Collection of Winograd Schemas”, Proceedings of the Twelfth Language Resources and Evaluation Conference, p. 21-26. https://aclanthology.org/2020.lrec-1.3
Brown, T. B. et al. (2020) “Language models are few-shot learners”, NIPS'20: Proceedings of the 34th International Conference on Neural Information Processing Systems, p. 1877-1901. [link].
Davis, E. (2023) “Benchmarks for Automated Commonsense Reasoning: A Survey”, arXiv:2302.04752v2. https://doi.org/10.48550/arXiv.2302.04752
Emelin, D.; Sennrich, R. (2021) “Wino-X: Multilingual Winograd Schemas for Commonsense Reasoning and Coreference Resolution”, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, p. 8517-8532. http://dx.doi.org/10.18653/v1/2021.emnlp-main.670
Floridi, L. (2023) “AI as Agency Without Intelligence: on ChatGPT, Large Language Models, and Other Generative Models”, Philosophy & Technology 36 (15). https://doi.org/10.1007/s13347-023-00621-y
French, R. M. (2000) “The turing test: The first 50 years”, Trends in Cognitive Sciences 4 (3), p. 115-122. https://doi.org/10.1016/S1364-6613(00)01453-4
He, W. et al. (2021) “WINOLOGIC: A Zero-Shot Logic-based Diagnostic Dataset for Winograd Schema Challenge.”, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, p. 3779–3789. https://doi.org/10.18653/v1/2021.emnlp-main.307
Isaak, N.; Michael, L. (2019) “WinoFlexi: A Crowdsourcing Platform for the Development of Winograd Schemas” In: Liu, J., Bailey, J. (eds) AI 2019: Advances in Artificial Intelligence. AI 2019. Lecture Notes in Computer Science 11919. https://doi.org/10.1007/978-3-030-35288-2_24
Kocijan, V. et al. (2023) “The Defeat of the Winograd Schema Challenge”, arXiv:2201.02387v3. https://doi.org/10.48550/arXiv.2201.02387
Levesque, H. J.; Davis, E.; Morgenstern, L. (2012) “The Winograd Schema Challenge”, Thirteenth international conference on the principles of knowledge representation and reasoning, p. 552-561. https://cdn.aaai.org/ocs/4492/4492-21843-1-PB.pdf
Melo, G. S. D.; Imaizumi, V. A.; Cozman, F. G. (2019), “Winograd Schemas in Portuguese”, Anais do Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2019), p. 787–798. https://doi.org/10.5753/eniac.2019.9334
Nicos, I.; Michael, L. (2020) “Winventor: A Machine-driven Approach for the Development of Winograd Schemas”, Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, p. 26-35. https://doi.org/10.5220/0008902600260035
OpenAI (2023) “GPT-4 Technical Report”, arXiv:2303.08774v3. https://doi.org/10.48550/arXiv.2303.08774
Petrov, A. et al. (2023), “Language Model Tokenizers Introduce Unfairness Between Languages”, arXiv:2305.15425v1. https://doi.org/10.48550/arXiv.2305.15425
Pires, R. et al. (2023), “Sabiá, Portuguese Large Language Models”, arXiv:2304.07880v2. https://doi.org/10.48550/arXiv.2304.07880
Sakaguchi et al. (2021), “WinoGrande: An Adversarial Winograd Schema Challenge at Scale”, Communications of the ACM 64(9), p. 99-106. https://doi.org/10.1145/3474381
Shavrina, T. et al. (2020), “RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark”, EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, p. 4717-4726. https://doi.org/10.18653/v1/2020.emnlp-main.381
Storks, S.; Gao, Q.; Chai, J. Y. (2019) “Recent Advances in Natural Language Inference: A Survey of Benchmarks, Resources, and Approaches”, arXiv:1904.01172v3. https://doi.org/10.48550/arXiv.1904.01172
Turing, A. M. (1950) “Computing machinery and intelligence”, Mind LIX (236), p. 433-460. https://doi.org/10.1093/mind/LIX.236.433
Vadász, N.; Ligeti-Nagy, N. “Winograd schemata and other datasets for anaphora resolution in Hungarian”, Acta Linguistica Academica 69 (4), p. 564-580. http://dx.doi.org/10.1556/2062.2022.00575
Wang, A. et al. (2018) “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding”, Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP , p. 353-355. http://doi.org/10.18653/v1/W18-5446
Wang, A. et al. (2019) “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems”, NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, p. 3266-3280. https://dl.acm.org/doi/10.5555/3454287.3454581
Winograd, T. (1972) “Understanding natural language”, Cognitive Psychology 3(1), p. 1 – 191. https://doi.org/10.1016/0010-0285(72)90002-3
Published
2023-09-25
How to Cite
DO NASCIMENTO, Thiago Gomes; CORTIZ, Diogo.
Evaluating common sense in language models through benchmarks: The Winograd challenge applied to ChatGPT in Brazilian Portuguese. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 14. , 2023, Belo Horizonte/MG.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2023
.
p. 193-198.
DOI: https://doi.org/10.5753/stil.2023.233957.
