Evaluating the Ability of ChatGPT and DeepSeek to Solve Propositional Logic Proofs Using the Analytic Tableau Deductive System
Resumo
Os Modelos de Linguagem de Grande Escala (LLMs) têm sido amplamente aplicados em contextos educacionais, mas enfrentam desafios em tarefas que exigem raciocínio lógico rigoroso. Este artigo avalia o desempenho dos modelos ChatGPT-4o e DeepSeek R1 (DeepThink) na resolução de exercícios de Lógica Proposicional, utilizando o sistema dedutivo Tableau Analítico. As respostas foram analisadas com base na aplicação correta das regras do sistema. Os resultados mostram que, embora o DeepSeek tenha superado o ChatGPT em número de acertos, ambos os modelos ainda demonstram limitações significativas, especialmente em provas com múltiplas aplicações de regra.Referências
Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., et al. (2023). Palm 2 technical report. arXiv preprint arXiv:2305.10403.
Aydin, O., Karaarslan, E., Erenay, F. S., and Bacanin, N. (2025). Generative ai in academic writing: A comparison of deepseek, qwen, chatgpt, gemini, llama, mistral, and gemma. arXiv preprint arXiv:2503.04765.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Carbonell, J. G., Michalski, R. S., and Mitchell, T. M. (1983). An overview of machine learning. Machine learning, pages 3–23.
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al. (2024). Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53.
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.
Huth, M. (2004). Logic in Computer Science: Modelling and Reasoning about Systems. Cambridge University Press.
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., et al. (2023). Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274.
Koubaa, A. (2023). Gpt-4 vs. gpt-3.5: A concise showdown.
Lalwani, A., Chopra, L., Hahn, C., Trippel, C., Jin, Z., and Sachan, M. (2024). Nl2fol: Translating natural language to first-order logic for logical fallacy detection. arXiv preprint arXiv:2405.02318.
Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. (2024). Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434.
Liu, H., Ning, R., Teng, Z., Liu, J., Zhou, Q., and Zhang, Y. (2023). Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439.
Martins, F. L. B., Oliveira, A. C. A., Vasconcelos, D. R., and de Menezes, M. V. (2025). Avaliando a habilidade do chatgpt de realizar provas de dedução natural em lógica proposicional e lógica de predicados. Revista Brasileira de Informática na Educação, 33:244–278.
Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., and Mian, A. (2023). A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435.
OpenAI (2021). ChatGPT. [link]. Acesso em: 03 de agosto de 2024.
Saparov, A., Pang, R. Y., Padmakumar, V., Joshi, N., Kazemi, M., Kim, N., and He, H. (2023). Testing the general deductive reasoning capacity of large language models using ood examples. Advances in Neural Information Processing Systems, 36:3083–3105.
Sarzynska-Wawer, J., Wawer, A., Pawlak, A., Szymanowska, J., Stefaniak, I., Jarkiewicz, M., and Okruszek, L. (2021). Detecting formal thought disorder by deep contextualized word representations. Psychiatry Research, 304:114135.
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al. (2024). Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300.
Siskind, J. M. and Dimitriadis, A. (2008). Qtree, a latex tree-drawing package.
Tlili, A., Shehata, B., Adarkwah, M. A., Bozkurt, A., Hickey, D. T., Huang, R., and Agyemang, B. (2023). What if the devil is my guardian angel: Chatgpt as a case study of using chatbots in education. Smart learning environments, 10(1):15.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Vasconcelos, D. R. (2023). Anita: Analytic tableau proof assistant. arXiv preprint arXiv:2303.05864.
Viegas, C. V. et al. (2024). Avaliando a capacidade de llms na resolução de questões do poscomp.
Zhang, M. and Li, J. (2021). A commentary of gpt-3 in mit technology review 2021. Fundamental Research, 1(6):831–833.
Aydin, O., Karaarslan, E., Erenay, F. S., and Bacanin, N. (2025). Generative ai in academic writing: A comparison of deepseek, qwen, chatgpt, gemini, llama, mistral, and gemma. arXiv preprint arXiv:2503.04765.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Carbonell, J. G., Michalski, R. S., and Mitchell, T. M. (1983). An overview of machine learning. Machine learning, pages 3–23.
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. (2023). Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al. (2024). Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53.
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.
Huth, M. (2004). Logic in Computer Science: Modelling and Reasoning about Systems. Cambridge University Press.
Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., et al. (2023). Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274.
Koubaa, A. (2023). Gpt-4 vs. gpt-3.5: A concise showdown.
Lalwani, A., Chopra, L., Hahn, C., Trippel, C., Jin, Z., and Sachan, M. (2024). Nl2fol: Translating natural language to first-order logic for logical fallacy detection. arXiv preprint arXiv:2405.02318.
Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. (2024). Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434.
Liu, H., Ning, R., Teng, Z., Liu, J., Zhou, Q., and Zhang, Y. (2023). Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439.
Martins, F. L. B., Oliveira, A. C. A., Vasconcelos, D. R., and de Menezes, M. V. (2025). Avaliando a habilidade do chatgpt de realizar provas de dedução natural em lógica proposicional e lógica de predicados. Revista Brasileira de Informática na Educação, 33:244–278.
Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., and Mian, A. (2023). A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435.
OpenAI (2021). ChatGPT. [link]. Acesso em: 03 de agosto de 2024.
Saparov, A., Pang, R. Y., Padmakumar, V., Joshi, N., Kazemi, M., Kim, N., and He, H. (2023). Testing the general deductive reasoning capacity of large language models using ood examples. Advances in Neural Information Processing Systems, 36:3083–3105.
Sarzynska-Wawer, J., Wawer, A., Pawlak, A., Szymanowska, J., Stefaniak, I., Jarkiewicz, M., and Okruszek, L. (2021). Detecting formal thought disorder by deep contextualized word representations. Psychiatry Research, 304:114135.
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al. (2024). Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300.
Siskind, J. M. and Dimitriadis, A. (2008). Qtree, a latex tree-drawing package.
Tlili, A., Shehata, B., Adarkwah, M. A., Bozkurt, A., Hickey, D. T., Huang, R., and Agyemang, B. (2023). What if the devil is my guardian angel: Chatgpt as a case study of using chatbots in education. Smart learning environments, 10(1):15.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Vasconcelos, D. R. (2023). Anita: Analytic tableau proof assistant. arXiv preprint arXiv:2303.05864.
Viegas, C. V. et al. (2024). Avaliando a capacidade de llms na resolução de questões do poscomp.
Zhang, M. and Li, J. (2021). A commentary of gpt-3 in mit technology review 2021. Fundamental Research, 1(6):831–833.
Publicado
29/09/2025
Como Citar
SANDES, Taís Rodrigues; VASCONCELOS, Davi Romero de; MENEZES, Maria Viviane de; LIMA, Victória de Oliveira.
Evaluating the Ability of ChatGPT and DeepSeek to Solve Propositional Logic Proofs Using the Analytic Tableau Deductive System. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 22. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 1069-1080.
ISSN 2763-9061.
DOI: https://doi.org/10.5753/eniac.2025.14333.
