Avaliação do senso comum em modelos de linguagem através de benchmarks: Desafio de Winograd aplicado ao ChatGPT em português brasileiro

Resumo


O desempenho em benchmarks é apresentado como uma forma de avaliação efetiva dos limites de compreensão dos modelos de linguagem. Neste sentido, o desafio de esquemas de Winograd, que se propõe a avaliar o senso comum por meio de tarefas de desambiguação de pronomes, deu origem a diferentes métricas e datasets. Ao aplicar a tradução do desafio de Winograd ao ChatGPT em português brasileiro, identificamos resultados equiparáveis aos obtidos em inglês. Contudo, é preciso ter cautela ao interpretar estes dados, visto que existem vieses associados ao treinamento dos modelos e lacunas quanto às dimensões de raciocínio contempladas pelos métodos de avaliação disponíveis.
Palavras-chave: Modelos de linguagem, senso comum, benchmarks, desafio de Winograd, ChatGPT

Referências

Amsili, P.; Seminck, O. (2017) “A Google-Proof Collection of French Winograd Schemas”, Proceedings of the 2nd Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2017), p. 24-29. https://doi.org/10.18653/v1/W17-1504

Bender, E. M. et al. (2021) “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”, FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, p. 610-623. https://doi.org/10.1145/3442188.3445922

Bernard, T.; Han, T. (2020) “Mandarinograd: A Chinese Collection of Winograd Schemas”, Proceedings of the Twelfth Language Resources and Evaluation Conference, p. 21-26. https://aclanthology.org/2020.lrec-1.3

Brown, T. B. et al. (2020) “Language models are few-shot learners”, NIPS'20: Proceedings of the 34th International Conference on Neural Information Processing Systems, p. 1877-1901. [link].

Davis, E. (2023) “Benchmarks for Automated Commonsense Reasoning: A Survey”, arXiv:2302.04752v2. https://doi.org/10.48550/arXiv.2302.04752

Emelin, D.; Sennrich, R. (2021) “Wino-X: Multilingual Winograd Schemas for Commonsense Reasoning and Coreference Resolution”, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, p. 8517-8532. http://dx.doi.org/10.18653/v1/2021.emnlp-main.670

Floridi, L. (2023) “AI as Agency Without Intelligence: on ChatGPT, Large Language Models, and Other Generative Models”, Philosophy & Technology 36 (15). https://doi.org/10.1007/s13347-023-00621-y

French, R. M. (2000) “The turing test: The first 50 years”, Trends in Cognitive Sciences 4 (3), p. 115-122. https://doi.org/10.1016/S1364-6613(00)01453-4

He, W. et al. (2021) “WINOLOGIC: A Zero-Shot Logic-based Diagnostic Dataset for Winograd Schema Challenge.”, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, p. 3779–3789. https://doi.org/10.18653/v1/2021.emnlp-main.307

Isaak, N.; Michael, L. (2019) “WinoFlexi: A Crowdsourcing Platform for the Development of Winograd Schemas” In: Liu, J., Bailey, J. (eds) AI 2019: Advances in Artificial Intelligence. AI 2019. Lecture Notes in Computer Science 11919. https://doi.org/10.1007/978-3-030-35288-2_24

Kocijan, V. et al. (2023) “The Defeat of the Winograd Schema Challenge”, arXiv:2201.02387v3. https://doi.org/10.48550/arXiv.2201.02387

Levesque, H. J.; Davis, E.; Morgenstern, L. (2012) “The Winograd Schema Challenge”, Thirteenth international conference on the principles of knowledge representation and reasoning, p. 552-561. https://cdn.aaai.org/ocs/4492/4492-21843-1-PB.pdf

Melo, G. S. D.; Imaizumi, V. A.; Cozman, F. G. (2019), “Winograd Schemas in Portuguese”, Anais do Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2019), p. 787–798. https://doi.org/10.5753/eniac.2019.9334

Nicos, I.; Michael, L. (2020) “Winventor: A Machine-driven Approach for the Development of Winograd Schemas”, Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, p. 26-35. https://doi.org/10.5220/0008902600260035

OpenAI (2023) “GPT-4 Technical Report”, arXiv:2303.08774v3. https://doi.org/10.48550/arXiv.2303.08774

Petrov, A. et al. (2023), “Language Model Tokenizers Introduce Unfairness Between Languages”, arXiv:2305.15425v1. https://doi.org/10.48550/arXiv.2305.15425

Pires, R. et al. (2023), “Sabiá, Portuguese Large Language Models”, arXiv:2304.07880v2. https://doi.org/10.48550/arXiv.2304.07880

Sakaguchi et al. (2021), “WinoGrande: An Adversarial Winograd Schema Challenge at Scale”, Communications of the ACM 64(9), p. 99-106. https://doi.org/10.1145/3474381

Shavrina, T. et al. (2020), “RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark”, EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, p. 4717-4726. https://doi.org/10.18653/v1/2020.emnlp-main.381

Storks, S.; Gao, Q.; Chai, J. Y. (2019) “Recent Advances in Natural Language Inference: A Survey of Benchmarks, Resources, and Approaches”, arXiv:1904.01172v3. https://doi.org/10.48550/arXiv.1904.01172

Turing, A. M. (1950) “Computing machinery and intelligence”, Mind LIX (236), p. 433-460. https://doi.org/10.1093/mind/LIX.236.433

Vadász, N.; Ligeti-Nagy, N. “Winograd schemata and other datasets for anaphora resolution in Hungarian”, Acta Linguistica Academica 69 (4), p. 564-580. http://dx.doi.org/10.1556/2062.2022.00575

Wang, A. et al. (2018) “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding”, Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP , p. 353-355. http://doi.org/10.18653/v1/W18-5446

Wang, A. et al. (2019) “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems”, NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, p. 3266-3280. https://dl.acm.org/doi/10.5555/3454287.3454581

Winograd, T. (1972) “Understanding natural language”, Cognitive Psychology 3(1), p. 1 – 191. https://doi.org/10.1016/0010-0285(72)90002-3
Publicado
25/09/2023
DO NASCIMENTO, Thiago Gomes; CORTIZ, Diogo. Avaliação do senso comum em modelos de linguagem através de benchmarks: Desafio de Winograd aplicado ao ChatGPT em português brasileiro. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 14. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 193-198. DOI: https://doi.org/10.5753/stil.2023.233957.