LLMs São Bons Matemáticos? Avaliando o Desempenho em Resolução de Exercícios
Resumo
Este trabalho analisa o uso de dois modelos de LLMs comerciais (Google Gemini 2.5 Pro e OpenAI ChatGPT 4o) para resolução de exercícios de Matemática de Nível Médio sobre cinco diferentes tópicos que incluem Funções, Geometria, Análise Combinatória e outros. Ao todo, foram resolvidas 50 questões elaboradas pela banca FGV (Fundação Getúlio Vargas). Ambos os modelos tiveram resultados semelhantes, com o Gemini sendo ligeiramente melhor, tanto no raciocínio quanto na escolha da alternativa correta. Nas questões em que o ChatGPT apresentou erro, sua versão o3-pro foi capaz de acertá-las. Os resultados podem subsidiar decisões de pessoas e organizações envolvidas com ensino-aprendizagem de Matemática sobre o uso da tecnologia.
Palavras-chave:
LLMs, Resolução de exercícios, Matemática, LLMs na educação, LLMs para ensino de matemática
Referências
Auffarth, B. (2023). Generative AI with LangChain. Packt Publishing, Birmingham, England.
Bencke, L., Paula, F., dos Santos, B., and Moreira, V. P. (2024). Can we trust LLMs as relevance judges? In Anais do XXXIX Simpósio Brasileiro de Bancos de Dados, pages 600–612, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbbd.2024.243130.
Collins, K. M., Jiang, A. Q., Frieder, S., Wong, L., Zilka, M., Bhatt, U., Lukasiewicz, T., Wu, Y., Tenenbaum, J. B., Hart, W., Gowers, T., Li, W., Weller, A., and Jamnik, M. (2024). Evaluating language models for mathematics through interactions. Proceedings of the National Academy of Sciences, 121(24):e2318124121. DOI: 10.1073/pnas.2318124121.
Gandolfi, A. (2025). GPT-4 in Education: Evaluating Aptness, Reliability, and Loss of Coherence in Solving Calculus Problems and Grading Submissions. International Journal of Artificial Intelligence in Education, 35:367–397. DOI: 10.1007/s40593-024-00403-3.
Harvey, E., Koenecke, A., and Kizilcec, R. F. (2025). “Don’t Forget the Teachers”: Towards an Educator-Centered Understanding of Harms from Large Language Models in Education. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3706598.3713210.
Lehmann, M., Cornelius, P. B., and Sting, F. J. (2025). AI Meets the Classroom: When Do Large Language Models Harm Learning? SSRN. DOI: 10.2139/ssrn.4941259.
Liu, J., Sun, D., Sun, J., Wang, J., and Yu, P. L. H. (2025). Designing a generative AI enabled learning environment for mathematics word problem solving in primary schools: Learning performance, attitudes and interaction. Computers and Education: Artificial Intelligence, 9:100438. DOI: 10.1016/j.caeai.2025.100438.
Makridakis, S., Petropoulos, F., and Kang, Y. (2023). Large language models: Their success and impact. Forecasting, 5(3):536–549. DOI: 10.3390/forecast5030030.
Marques, D. and Morandini, M. (2024). Uso do ChatGPT no Contexto Educacional: Uma Revisão Sistemática da Literatura. In Anais do XXXV Simpósio Brasileiro de Informática na Educação (SBIE 2024), SBIE 2024, page 1784–1795. Sociedade Brasileira de Computação - SBC. DOI: 10.5753/sbie.2024.242535.
Miranda, B. and Campelo, C. E. C. (2024). How effective is an LLM-based Data Analysis Automation Tool? A Case Study with ChatGPT’s Data Analyst. In Anais do XXXIX Simpósio Brasileiro de Bancos de Dados, pages 287–299, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbbd.2024.240841.
Pardos, Z. A. and Bhandari, S. (2024). ChatGPT-generated help produces learning gains equivalent to human tutor-authored help on mathematics skills. PLOS ONE, 19(5):1–18. DOI: 10.1371/journal.pone.0304013.
Rodrigues, L., Xavier, C., Costa, N., Batista, H., Silva, L. F. B., Chaleghi de Melo, W., Gasevic, D., and Ferreira Mello, R. (2025). LLMs Performance in Answering Educational Questions in Brazilian Portuguese: A Preliminary Analysis on LLMs Potential to Support Diverse Educational Needs. In Proceedings of the 15th International Learning Analytics and Knowledge Conference, LAK ’25, page 865–871, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3706468.3706515.
Santos, V. S. and Dorneles, C. F. (2024). Unveiling the Segmentation Power of LLMs: Zero-Shot Invoice Item Description Analysis. In Anais do XXXIX Simpósio Brasileiro de Banco de Dados (SBBD 2024), SBBD 2024, page 549–561. Sociedade Brasileira de Computação - SBC. DOI: 10.5753/sbbd.2024.240820.
Satpute, A., Gießing, N., Greiner-Petter, A., Schubotz, M., Teschke, O., Aizawa, A., and Gipp, B. (2024). Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’24, page 2316–2320, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3626772.3657945.
Wang, S., Xu, T., Li, H., Zhang, C., Liang, J., Tang, J., Yu, P. S., and Wen, Q. (2024). Large Language Models for Education: A Survey and Outlook. URL: [link].
Weidinger, L., Uesato, J., Rauh, M., Griffin, C., Huang, P.-S., Mellor, J., Glaese, A., Cheng, M., Balle, B., Kasirzadeh, A., Biles, C., Brown, S., Kenton, Z., Hawkins, W., Stepleton, T., Birhane, A., Hendricks, L. A., Rimell, L., Isaac, W., Haas, J., Legassick, S., Irving, G., and Gabriel, I. (2022). Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 214–229, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3531146.3533088.
Bencke, L., Paula, F., dos Santos, B., and Moreira, V. P. (2024). Can we trust LLMs as relevance judges? In Anais do XXXIX Simpósio Brasileiro de Bancos de Dados, pages 600–612, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbbd.2024.243130.
Collins, K. M., Jiang, A. Q., Frieder, S., Wong, L., Zilka, M., Bhatt, U., Lukasiewicz, T., Wu, Y., Tenenbaum, J. B., Hart, W., Gowers, T., Li, W., Weller, A., and Jamnik, M. (2024). Evaluating language models for mathematics through interactions. Proceedings of the National Academy of Sciences, 121(24):e2318124121. DOI: 10.1073/pnas.2318124121.
Gandolfi, A. (2025). GPT-4 in Education: Evaluating Aptness, Reliability, and Loss of Coherence in Solving Calculus Problems and Grading Submissions. International Journal of Artificial Intelligence in Education, 35:367–397. DOI: 10.1007/s40593-024-00403-3.
Harvey, E., Koenecke, A., and Kizilcec, R. F. (2025). “Don’t Forget the Teachers”: Towards an Educator-Centered Understanding of Harms from Large Language Models in Education. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI ’25, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3706598.3713210.
Lehmann, M., Cornelius, P. B., and Sting, F. J. (2025). AI Meets the Classroom: When Do Large Language Models Harm Learning? SSRN. DOI: 10.2139/ssrn.4941259.
Liu, J., Sun, D., Sun, J., Wang, J., and Yu, P. L. H. (2025). Designing a generative AI enabled learning environment for mathematics word problem solving in primary schools: Learning performance, attitudes and interaction. Computers and Education: Artificial Intelligence, 9:100438. DOI: 10.1016/j.caeai.2025.100438.
Makridakis, S., Petropoulos, F., and Kang, Y. (2023). Large language models: Their success and impact. Forecasting, 5(3):536–549. DOI: 10.3390/forecast5030030.
Marques, D. and Morandini, M. (2024). Uso do ChatGPT no Contexto Educacional: Uma Revisão Sistemática da Literatura. In Anais do XXXV Simpósio Brasileiro de Informática na Educação (SBIE 2024), SBIE 2024, page 1784–1795. Sociedade Brasileira de Computação - SBC. DOI: 10.5753/sbie.2024.242535.
Miranda, B. and Campelo, C. E. C. (2024). How effective is an LLM-based Data Analysis Automation Tool? A Case Study with ChatGPT’s Data Analyst. In Anais do XXXIX Simpósio Brasileiro de Bancos de Dados, pages 287–299, Porto Alegre, RS, Brasil. SBC. DOI: 10.5753/sbbd.2024.240841.
Pardos, Z. A. and Bhandari, S. (2024). ChatGPT-generated help produces learning gains equivalent to human tutor-authored help on mathematics skills. PLOS ONE, 19(5):1–18. DOI: 10.1371/journal.pone.0304013.
Rodrigues, L., Xavier, C., Costa, N., Batista, H., Silva, L. F. B., Chaleghi de Melo, W., Gasevic, D., and Ferreira Mello, R. (2025). LLMs Performance in Answering Educational Questions in Brazilian Portuguese: A Preliminary Analysis on LLMs Potential to Support Diverse Educational Needs. In Proceedings of the 15th International Learning Analytics and Knowledge Conference, LAK ’25, page 865–871, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3706468.3706515.
Santos, V. S. and Dorneles, C. F. (2024). Unveiling the Segmentation Power of LLMs: Zero-Shot Invoice Item Description Analysis. In Anais do XXXIX Simpósio Brasileiro de Banco de Dados (SBBD 2024), SBBD 2024, page 549–561. Sociedade Brasileira de Computação - SBC. DOI: 10.5753/sbbd.2024.240820.
Satpute, A., Gießing, N., Greiner-Petter, A., Schubotz, M., Teschke, O., Aizawa, A., and Gipp, B. (2024). Can LLMs Master Math? Investigating Large Language Models on Math Stack Exchange. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’24, page 2316–2320, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3626772.3657945.
Wang, S., Xu, T., Li, H., Zhang, C., Liang, J., Tang, J., Yu, P. S., and Wen, Q. (2024). Large Language Models for Education: A Survey and Outlook. URL: [link].
Weidinger, L., Uesato, J., Rauh, M., Griffin, C., Huang, P.-S., Mellor, J., Glaese, A., Cheng, M., Balle, B., Kasirzadeh, A., Biles, C., Brown, S., Kenton, Z., Hawkins, W., Stepleton, T., Birhane, A., Hendricks, L. A., Rimell, L., Isaac, W., Haas, J., Legassick, S., Irving, G., and Gabriel, I. (2022). Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 214–229, New York, NY, USA. Association for Computing Machinery. DOI: 10.1145/3531146.3533088.
Publicado
29/09/2025
Como Citar
ELEUTÉRIO, Igor Alberte R.; OLIVEIRA, Israel Efraim de; CAZZOLATO, Mirela T..
LLMs São Bons Matemáticos? Avaliando o Desempenho em Resolução de Exercícios. In: LLMS, ANÁLISE DE GRAFOS E ONTOLOGIAS (LAGO) - SIMPÓSIO BRASILEIRO DE BANCO DE DADOS (SBBD), 40. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 438-448.
DOI: https://doi.org/10.5753/sbbd_estendido.2025.247991.
