Humano ou Máquina? Avaliando a Performance de Modelos de Linguagem como Juízes no Teste de Turing

Jailton Junior de Sousa Coelho; Maria Luiza Souza Ferrari; Maria Eloize das Neves; Amanda Vitória Cochinski; Laura Soares Correa de Oliveira; Ane Thayne de Oliveira Teixeira; Estefany da Silva Pedroso; Alan Floriano

doi:10.5753/latinoware.2025.16399

Jailton Junior de Sousa Coelho IFPR
Maria Luiza Souza Ferrari IFPR
Maria Eloize das Neves IFPR
Amanda Vitória Cochinski IFPR
Laura Soares Correa de Oliveira IFPR
Ane Thayne de Oliveira Teixeira IFPR
Estefany da Silva Pedroso IFPR
Alan Floriano IFPR

DOI: https://doi.org/10.5753/latinoware.2025.16399

Resumo

Este artigo investiga a capacidade de Large Language Models (LLMs) atuarem como avaliadores em um Teste de Turing Invertido, no qual a máquina tenta discernir se seu interlocutor é humano ou outra IA. Foram realizados experimentos com ChatGPT (OpenAI), LLaMA (Meta) e Claude (Anthropic), totalizando 270 entrevistas curtas. Em um ambiente de chat controlado, os modelos atuaram como entrevistadores, conduzindo conversas informais e classificando o interlocutor como humano ou IA apenas pelo conteúdo textual. O estudo também analisou a capacidade de cada modelo de se autoavaliar em interações consigo mesmo. Os resultados mostraram que distinguir entre interlocutores humanos e artificiais continua sendo um grande desafio. O ChatGPT obteve a maior precisão (47,8%), embora com erros recorrentes. LLaMA e Claude tiveram desempenhos inferiores, com 5,6% e 2,2%, respectivamente. No geral, a taxa de acerto das IAs entrevistadoras foi de 18,5%.

Palavras-chave: Teste de Turing Invertido, Inteligência Artificial, Modelos de Linguagem

Referências

A. M. Turing, “Computing machinery and intelligence,” Mind, vol. 59, no. 236, pp. 433–460, 1950.

K. Warwick, H. Shah, and J. H. Moor, “The turing test is dead. long live the turing test,” Philosophy & Technology, vol. 24, no. 3, pp. 301–306, 2011.

A. P. Saygin, I. Cicekli, and V. Akman, “Turing test: 50 years later,” Minds and Machines, vol. 10, no. 4, pp. 463–518, 2000.

L. Scholes and T. Timan, “Generative ai and the new uncanny valley: Dissolution of human-machine boundaries in the age of synthetic media,” AI & Society, vol. 38, no. 1, pp. 35–47, 2023.

OpenAI, “Gpt-4 technical report,” 2023, [link].

Y. Bai et al., “Constitutional ai: Harmlessness from ai feedback,” 2023, [link].

H. Touvron et al., “Llama: Open and efficient foundation language models,” 2023, [link].

C. Jones and B. Bergen, “Does gpt-4 pass the turing test?” Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pp. 318–324, 2024. [Online]. Available: [link]

I. Rathi, S. Taylor, B. K. Bergen, and C. R. Jones, “Gpt-4 is judged more human than humans in displaced and inverted turing tests,” arXiv preprint arXiv:2407.08853, 2024. [Online]. Available: [link]

G. H. Chen, S. Chen, Z. Liu, F. Jiang, and B. Wang, “Humans or llms as the judge? a study on judgement biases,” arXiv preprint arXiv:2402.10669, 2024. [Online]. Available: [link]