DBVinci – towards the usage of GPT engine for processing SQL Queries

Vanessa Câmara; Rayol Mendonca-Neto; André Silva; Luiz Cordovil-Jr

Vanessa Câmara Sidia R&D Institute
Rayol Mendonca-Neto Sidia R&D Institute
André Silva Sidia R&D Institute
Luiz Cordovil-Jr Sidia R&D Institute

Resumo

One of the goals of Natural Language Processing (NLP) is transforming sentences to output relevant information in a given context. For instance, relevant applications such as chatbots, translation systems, and sentiment analysis classifiers work that way. The advance of NLP techniques made it possible to automate complex tasks, such as converting text queries to tabular data queries, specifically SQL, to return contextualized data. Since it is crucial in many areas to interpret the data to obtain information and consider the particularities of a text-to-SQL parser, we propose a SQL processing engine whose internals are customized with natural language instructions. DBVinci is our proposed processing model which is based on OpenAI’s GPT-3.5 Text-davinci-003 engine that can perform language tasks such as text-to-SQL, consistent instruction-following, and supports inserting completions within text. Our framework is on top of GPT-3.5 and decomposes complex SQL queries into a series of simple processing steps, described in natural language. DBVinci outperforms well-known text-to-SQL methods (e.g., RAT-SQL and SQLOVA) reaching 89.7% of execution accuracy, considering WikiSQL benchmark. We also obtain impressive performance without the need of large scale annotated dataset for fine-tuning the downstream task, by achieving 90% accuracy in zero-shot setting. Therefore, we conclude that to obtain competitive results using the Pre-trained Language Model (PLM), there is no need of the “pre-training+fine-tuning” paradigm, besides that, when employing zero-shot in the proposed method, we can achieve promising results.

Palavras-chave: Text-to-SQL, GPT-3.5, PLM, zero-shot, SQL processing

Referências

Katrin Affolter, Kurt Stockinger, and Abraham Bernstein. 2019. A comparative survey of recent natural language interfaces for databases. The VLDB Journal 28 (2019), 793–819.

Ruichu Cai, Boyan Xu, Xiaoyan Yang, Zhenjie Zhang, Zijian Li, and Zhihao Liang. 2018. An Encoder-Decoder Framework Translating Natural Language to Database Queries. arxiv:1711.06061 [cs.CL]

Amir Erfan Eshratifar, David Eigen, Michael Gormish, and Massoud Pedram. 2021. Coarse2Fine: a two-stage training method for fine-grained visual classification. Machine Vision and Applications 32, 2 (2021), 49.

Pengcheng He, Yi Mao, Kaushik Chakrabarti, and Weizhu Chen. 2019. X-SQL: reinforce schema representation with context. arxiv:1908.08113 [cs.CL]

Wonseok Hwang, Jinyeong Yim, Seunghyun Park, and Minjoon Seo. 2019. A Comprehensive Exploration on WikiSQL with Table-Aware Word Contextualization. arxiv:1902.01069 [cs.CL]

Qin Lyu, Kaushik Chakrabarti, Shobhit Hathi, Souvik Kundu, Jianwen Zhang, and Zheng Chen. 2020. Hybrid Ranking Network for Text-to-SQL. arxiv:2008.04759 [cs.CL]

Nitarshan Rajkumar, Raymond Li, and Dzmitry Bahdanau. 2022. Evaluating the Text-to-SQL Capabilities of Large Language Models. arxiv:2204.00498 [cs.CL]

Jaydeep Sen, Chuan Lei, Abdul Quamar, Fatma Özcan, Vasilis Efthymiou, Ayushi Dalmia, Greg Stager, Ashish Mittal, Diptikalyan Saha, and Karthik Sankaranarayanan. 2020. Athena++ natural language querying for complex nested sql queries. Proceedings of the VLDB Endowment 13, 12 (2020), 2747–2759.

Tianze Shi, Kedar Tatwawadi, Kaushik Chakrabarti, Yi Mao, Oleksandr Polozov, and Weizhu Chen. 2018. IncSQL: Training Incremental Text-to-SQL Parsers with Non-Deterministic Oracles. arxiv:1809.05054 [cs.CL]

Immanuel Trummer. 2022. CodexDB: Synthesizing code for query processing from natural language instructions using GPT-3 Codex. Proceedings of the VLDB Endowment 15, 11 (2022), 2921–2928.

Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2021. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. arxiv:1911.04942 [cs.CL]

Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong, Torsten Scholak, Michihiro Yasunaga, Chien-Sheng Wu, Ming Zhong, Pengcheng Yin, Sida I. Wang, Victor Zhong, Bailin Wang, Chengzu Li, Connor Boyle, Ansong Ni, Ziyu Yao, Dragomir Radev, Caiming Xiong, Lingpeng Kong, Rui Zhang, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. 2022. UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models. arxiv:2201.05966 [cs.CL]

Prateek Yadav, Qing Sun, Hantian Ding, Xiaopeng Li, Dejiao Zhang, Ming Tan, Xiaofei Ma, Parminder Bhatia, Ramesh Nallapati, Murali Krishna Ramanathan, 2023. Exploring Continual Learning for Code Generation Models. arXiv preprint arXiv:2307.02435 (2023)

Junjie Ye, Xuanting Chen, Nuo Xu, Can Zu, Zekai Shao, Shichun Liu, Yuhan Cui, Zeyang Zhou, Chao Gong, Yang Shen, 2023. A comprehensive capability analysis of gpt-3 and gpt-3.5 series models. arXiv preprint arXiv:2303.10420 (2023)

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 3911–3921

Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. arxiv:1709.00103 [cs.CL]