Benchmarking Large Language Models for Text-to-SQL in Brazilian Portuguese and English

  • Luís Felippe C. de Carvalho IFES
  • Paulo Sérgio dos S. Júnior IFES
  • Hilário Tomaz Alves de Oliveira IFES

Abstract


This work assessed the performance of seventeen large language models, including open-source and proprietary models, on the Text-to-SQL task in both Brazilian Portuguese and English. Two schema representation strategies were considered: textual descriptions and representations using data definition language. Experimental results on the Spider dataset demonstrated the superior performance of proprietary models, particularly the Gemini-2.5-Flash-preview, as measured by the Execution Accuracy and Exact Match metrics. Among the open-source models, Qwen-2.5-Coder-14B achieved the highest performance. An error analysis of the best-performing model revealed strong proficiency in handling clauses such as SELECT and AND/OR, while considerable challenges persisted in generating more complex constructs, including GROUP BY with HAVING and set operators like UNION and INTERSECT.

References

Affolter, K., Stockinger, K., and Bernstein, A. (2019). A comparative survey of recent natural language interfaces for databases. The VLDB Journal, 28(5):793–819.

Anthropic (2025). Claude 3.7 sonnet system card. [link]. Acessado em: (27/05/2025).

Google, I. (2025a). Gemini 2.0 flash. [link]. Acessado em: (27/05/2025).

Google, I. (2025b). Gemini 2.5 flash preview. [link]. Acessado em: (27/05/2025).

Google, I. (2025c). Gemini 2.5 pro preview. [link]. Acessado em: (27/05/2025).

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783.

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al. (2024). Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186.

Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. (2024). Gpt-4o system card. arXiv preprint arXiv:2410.21276.

José, M. A. and Cozman, F. G. (2021). mrat-sql+ gap: a portuguese text-to-sql transformer. In Intelligent Systems: 10th Brazilian Conference, BRACIS 2021, Virtual Event, November 29–December 3, 2021, Proceedings, Part II 10, pages 511–525. Springer.

Katsogiannis-Meimarakis, G. and Koutrika, G. (2023). A survey on deep learning approaches for text-to-sql. The VLDB Journal, 32(4):905–936.

Kim, H., So, B.-H., Han, W.-S., and Lee, H. (2020). Natural language to sql: Where are we today? Proceedings of the VLDB Endowment, 13(10):1737–1750.

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. (2020). BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.

Pedroso, B., Pereira, M., and Pereira, D. (2025). Performance evaluation of llms in the text-to-sql task in portuguese. In Anais do XXI Simpósio Brasileiro de Sistemas de Informação, pages 260–269, Porto Alegre, RS, Brasil. SBC.

Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y., Liu, J., Sauvestre, R., Remez, T., et al. (2023). Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.

Shi, L., Tang, Z., Zhang, N., Zhang, X., and Yang, Z. (2024). A survey on employing large language models for text-to-sql tasks. ACM Computing Surveys.

Team, C., Zhao, H., Hui, J., Howland, J., Nguyen, N., Zuo, S., Hu, A., Choquette-Choo, C. A., Shen, J., Kelley, J., et al. (2024). Codegemma: Open code models based on gemma. arXiv preprint arXiv:2406.11409.

Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., et al. (2025). Gemma 3 technical report. arXiv preprint arXiv:2503.19786.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. (2017). Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.

Wang, B., Shin, R., Liu, X., Polozov, O., and Richardson, M. (2019). Rat-sql: Relation-aware schema encoding and linking for text-to-sql parsers. arXiv preprint arXiv:1911.04942.

Xue, S., Jiang, C., Shi, W., Cheng, F., Chen, K., Yang, H., Zhang, Z., He, J., Zhang, H., Wei, G., et al. (2023). Db-gpt: Empowering database interactions with private large language models. arXiv preprint arXiv:2312.17449.

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. (2025). Qwen3 technical report. arXiv preprint arXiv:2505.09388.

Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., Ma, J., Li, I., Yao, Q., Roman, S., Zhang, Z., and Radev, D. (2018). Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In Riloff, E., Chiang, D., Hockenmaier, J., and Tsujii, J., editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3911–3921, Brussels, Belgium. Association for Computational Linguistics.
Published
2025-09-29
CARVALHO, Luís Felippe C. de; S. JÚNIOR, Paulo Sérgio dos; OLIVEIRA, Hilário Tomaz Alves de. Benchmarking Large Language Models for Text-to-SQL in Brazilian Portuguese and English. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 16. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 101-112. DOI: https://doi.org/10.5753/stil.2025.37817.