Avaliação do senso comum em modelos de linguagem através de benchmarks: Desafio de Winograd aplicado ao ChatGPT em português brasileiro
Resumo
O desempenho em benchmarks é apresentado como uma forma de avaliação efetiva dos limites de compreensão dos modelos de linguagem. Neste sentido, o desafio de esquemas de Winograd, que se propõe a avaliar o senso comum por meio de tarefas de desambiguação de pronomes, deu origem a diferentes métricas e datasets. Ao aplicar a tradução do desafio de Winograd ao ChatGPT em português brasileiro, identificamos resultados equiparáveis aos obtidos em inglês. Contudo, é preciso ter cautela ao interpretar estes dados, visto que existem vieses associados ao treinamento dos modelos e lacunas quanto às dimensões de raciocínio contempladas pelos métodos de avaliação disponíveis.
Palavras-chave:
Modelos de linguagem, senso comum, benchmarks, desafio de Winograd, ChatGPT
Referências
Amsili, P.; Seminck, O. (2017) “A Google-Proof Collection of French Winograd Schemas”, Proceedings of the 2nd Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2017), p. 24-29. https://doi.org/10.18653/v1/W17-1504
Bender, E. M. et al. (2021) “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”, FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, p. 610-623. https://doi.org/10.1145/3442188.3445922
Bernard, T.; Han, T. (2020) “Mandarinograd: A Chinese Collection of Winograd Schemas”, Proceedings of the Twelfth Language Resources and Evaluation Conference, p. 21-26. https://aclanthology.org/2020.lrec-1.3
Brown, T. B. et al. (2020) “Language models are few-shot learners”, NIPS'20: Proceedings of the 34th International Conference on Neural Information Processing Systems, p. 1877-1901. [link].
Davis, E. (2023) “Benchmarks for Automated Commonsense Reasoning: A Survey”, arXiv:2302.04752v2. https://doi.org/10.48550/arXiv.2302.04752
Emelin, D.; Sennrich, R. (2021) “Wino-X: Multilingual Winograd Schemas for Commonsense Reasoning and Coreference Resolution”, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, p. 8517-8532. http://dx.doi.org/10.18653/v1/2021.emnlp-main.670
Floridi, L. (2023) “AI as Agency Without Intelligence: on ChatGPT, Large Language Models, and Other Generative Models”, Philosophy & Technology 36 (15). https://doi.org/10.1007/s13347-023-00621-y
French, R. M. (2000) “The turing test: The first 50 years”, Trends in Cognitive Sciences 4 (3), p. 115-122. https://doi.org/10.1016/S1364-6613(00)01453-4
He, W. et al. (2021) “WINOLOGIC: A Zero-Shot Logic-based Diagnostic Dataset for Winograd Schema Challenge.”, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, p. 3779–3789. https://doi.org/10.18653/v1/2021.emnlp-main.307
Isaak, N.; Michael, L. (2019) “WinoFlexi: A Crowdsourcing Platform for the Development of Winograd Schemas” In: Liu, J., Bailey, J. (eds) AI 2019: Advances in Artificial Intelligence. AI 2019. Lecture Notes in Computer Science 11919. https://doi.org/10.1007/978-3-030-35288-2_24
Kocijan, V. et al. (2023) “The Defeat of the Winograd Schema Challenge”, arXiv:2201.02387v3. https://doi.org/10.48550/arXiv.2201.02387
Levesque, H. J.; Davis, E.; Morgenstern, L. (2012) “The Winograd Schema Challenge”, Thirteenth international conference on the principles of knowledge representation and reasoning, p. 552-561. https://cdn.aaai.org/ocs/4492/4492-21843-1-PB.pdf
Melo, G. S. D.; Imaizumi, V. A.; Cozman, F. G. (2019), “Winograd Schemas in Portuguese”, Anais do Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2019), p. 787–798. https://doi.org/10.5753/eniac.2019.9334
Nicos, I.; Michael, L. (2020) “Winventor: A Machine-driven Approach for the Development of Winograd Schemas”, Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, p. 26-35. https://doi.org/10.5220/0008902600260035
OpenAI (2023) “GPT-4 Technical Report”, arXiv:2303.08774v3. https://doi.org/10.48550/arXiv.2303.08774
Petrov, A. et al. (2023), “Language Model Tokenizers Introduce Unfairness Between Languages”, arXiv:2305.15425v1. https://doi.org/10.48550/arXiv.2305.15425
Pires, R. et al. (2023), “Sabiá, Portuguese Large Language Models”, arXiv:2304.07880v2. https://doi.org/10.48550/arXiv.2304.07880
Sakaguchi et al. (2021), “WinoGrande: An Adversarial Winograd Schema Challenge at Scale”, Communications of the ACM 64(9), p. 99-106. https://doi.org/10.1145/3474381
Shavrina, T. et al. (2020), “RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark”, EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, p. 4717-4726. https://doi.org/10.18653/v1/2020.emnlp-main.381
Storks, S.; Gao, Q.; Chai, J. Y. (2019) “Recent Advances in Natural Language Inference: A Survey of Benchmarks, Resources, and Approaches”, arXiv:1904.01172v3. https://doi.org/10.48550/arXiv.1904.01172
Turing, A. M. (1950) “Computing machinery and intelligence”, Mind LIX (236), p. 433-460. https://doi.org/10.1093/mind/LIX.236.433
Vadász, N.; Ligeti-Nagy, N. “Winograd schemata and other datasets for anaphora resolution in Hungarian”, Acta Linguistica Academica 69 (4), p. 564-580. http://dx.doi.org/10.1556/2062.2022.00575
Wang, A. et al. (2018) “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding”, Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP , p. 353-355. http://doi.org/10.18653/v1/W18-5446
Wang, A. et al. (2019) “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems”, NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, p. 3266-3280. https://dl.acm.org/doi/10.5555/3454287.3454581
Winograd, T. (1972) “Understanding natural language”, Cognitive Psychology 3(1), p. 1 – 191. https://doi.org/10.1016/0010-0285(72)90002-3
Bender, E. M. et al. (2021) “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?”, FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, p. 610-623. https://doi.org/10.1145/3442188.3445922
Bernard, T.; Han, T. (2020) “Mandarinograd: A Chinese Collection of Winograd Schemas”, Proceedings of the Twelfth Language Resources and Evaluation Conference, p. 21-26. https://aclanthology.org/2020.lrec-1.3
Brown, T. B. et al. (2020) “Language models are few-shot learners”, NIPS'20: Proceedings of the 34th International Conference on Neural Information Processing Systems, p. 1877-1901. [link].
Davis, E. (2023) “Benchmarks for Automated Commonsense Reasoning: A Survey”, arXiv:2302.04752v2. https://doi.org/10.48550/arXiv.2302.04752
Emelin, D.; Sennrich, R. (2021) “Wino-X: Multilingual Winograd Schemas for Commonsense Reasoning and Coreference Resolution”, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, p. 8517-8532. http://dx.doi.org/10.18653/v1/2021.emnlp-main.670
Floridi, L. (2023) “AI as Agency Without Intelligence: on ChatGPT, Large Language Models, and Other Generative Models”, Philosophy & Technology 36 (15). https://doi.org/10.1007/s13347-023-00621-y
French, R. M. (2000) “The turing test: The first 50 years”, Trends in Cognitive Sciences 4 (3), p. 115-122. https://doi.org/10.1016/S1364-6613(00)01453-4
He, W. et al. (2021) “WINOLOGIC: A Zero-Shot Logic-based Diagnostic Dataset for Winograd Schema Challenge.”, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, p. 3779–3789. https://doi.org/10.18653/v1/2021.emnlp-main.307
Isaak, N.; Michael, L. (2019) “WinoFlexi: A Crowdsourcing Platform for the Development of Winograd Schemas” In: Liu, J., Bailey, J. (eds) AI 2019: Advances in Artificial Intelligence. AI 2019. Lecture Notes in Computer Science 11919. https://doi.org/10.1007/978-3-030-35288-2_24
Kocijan, V. et al. (2023) “The Defeat of the Winograd Schema Challenge”, arXiv:2201.02387v3. https://doi.org/10.48550/arXiv.2201.02387
Levesque, H. J.; Davis, E.; Morgenstern, L. (2012) “The Winograd Schema Challenge”, Thirteenth international conference on the principles of knowledge representation and reasoning, p. 552-561. https://cdn.aaai.org/ocs/4492/4492-21843-1-PB.pdf
Melo, G. S. D.; Imaizumi, V. A.; Cozman, F. G. (2019), “Winograd Schemas in Portuguese”, Anais do Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2019), p. 787–798. https://doi.org/10.5753/eniac.2019.9334
Nicos, I.; Michael, L. (2020) “Winventor: A Machine-driven Approach for the Development of Winograd Schemas”, Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, p. 26-35. https://doi.org/10.5220/0008902600260035
OpenAI (2023) “GPT-4 Technical Report”, arXiv:2303.08774v3. https://doi.org/10.48550/arXiv.2303.08774
Petrov, A. et al. (2023), “Language Model Tokenizers Introduce Unfairness Between Languages”, arXiv:2305.15425v1. https://doi.org/10.48550/arXiv.2305.15425
Pires, R. et al. (2023), “Sabiá, Portuguese Large Language Models”, arXiv:2304.07880v2. https://doi.org/10.48550/arXiv.2304.07880
Sakaguchi et al. (2021), “WinoGrande: An Adversarial Winograd Schema Challenge at Scale”, Communications of the ACM 64(9), p. 99-106. https://doi.org/10.1145/3474381
Shavrina, T. et al. (2020), “RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark”, EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, p. 4717-4726. https://doi.org/10.18653/v1/2020.emnlp-main.381
Storks, S.; Gao, Q.; Chai, J. Y. (2019) “Recent Advances in Natural Language Inference: A Survey of Benchmarks, Resources, and Approaches”, arXiv:1904.01172v3. https://doi.org/10.48550/arXiv.1904.01172
Turing, A. M. (1950) “Computing machinery and intelligence”, Mind LIX (236), p. 433-460. https://doi.org/10.1093/mind/LIX.236.433
Vadász, N.; Ligeti-Nagy, N. “Winograd schemata and other datasets for anaphora resolution in Hungarian”, Acta Linguistica Academica 69 (4), p. 564-580. http://dx.doi.org/10.1556/2062.2022.00575
Wang, A. et al. (2018) “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding”, Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP , p. 353-355. http://doi.org/10.18653/v1/W18-5446
Wang, A. et al. (2019) “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems”, NIPS'19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, p. 3266-3280. https://dl.acm.org/doi/10.5555/3454287.3454581
Winograd, T. (1972) “Understanding natural language”, Cognitive Psychology 3(1), p. 1 – 191. https://doi.org/10.1016/0010-0285(72)90002-3
Publicado
25/09/2023
Como Citar
DO NASCIMENTO, Thiago Gomes; CORTIZ, Diogo.
Avaliação do senso comum em modelos de linguagem através de benchmarks: Desafio de Winograd aplicado ao ChatGPT em português brasileiro. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 14. , 2023, Belo Horizonte/MG.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2023
.
p. 193-198.
DOI: https://doi.org/10.5753/stil.2023.233957.