TableRAG: A Novel Approach for Augmenting LLMs with Information from Retrieved Tables

Resumo


We present TableRAG, a novel pipeline designed to integrate tabular data into traditional Retrieval-Augmented Generation (RAG) systems. Our approach is composed of three main parts: (i) generating textual representations of tables; (ii) indexing table representations in vector databases for retrieval, and (iii) employing large language models to generate SQL or Python code for data manipulation over a given table. We assessed the effectiveness of TableRAG by comparing retrieval and re-ranking accuracies over the OTT-QA benchmark and by utilizing both open and closed-source LLMs to generate code for answering questions from the WikiTableQuestions benchmark. Our best results show 86.7% HITS@5 for retrieval and 74% accuracy for Q&A, demonstrating the feasibility of integrating tabular data into RAG systems with high accuracy.

Palavras-chave: retrieval augmented generation, tabular data, information retrieval, generative AI, large language models

Referências

Abraham, A. N., Rahman, F., and Kaur, D. (2022). Tablequery: Querying tabular data with natural language. arXiv preprint arXiv:2202.00454.

Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A., Cojocaru, R., Debbah, M., Goffinet, E., Heslow, D., Launay, J., Malartic, Q., Noune, B., Pannier, B., and Penedo, G. (2023). Falcon-40B: an open large language model with state-of-the-art performance.

Anand, Y., Nussbaum, Z., Duderstadt, B., Schmidt, B., and Mulyar, A. (2023). Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. [link].

Chen, W., Chang, M.-W., Schlinger, E., Wang, W., and Cohen, W. W. (2020a). Open question answering over tables and text. arXiv preprint arXiv:2010.10439.

Chen, W., Zha, H., Chen, Z., Xiong, W., Wang, H., and Wang, W. (2020b). Hybridqa: A dataset of multi-hop question answering over tabular and textual data. arXiv preprint arXiv:2004.07347.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V.,Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. CoRR, abs/1911.02116. [link]

Dong, Q., Li, L., Dai, D., Zheng, C., Ma, J., Li, R., Xia, H., Xu, J., Wu, Z., Chang, B., Sun, X., Li, L., and Sui, Z. (2024). A survey on in-context learning. [link]

Dubey, A. et al. (2024). The llama 3 herd of models. [link]

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., and Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. (2023). Mistral 7b. [link]

Kandpal, N., Deng, H., Roberts, A., Wallace, E., and Raffel, C. (2023). Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pages 15696–15707. PMLR.

Liang, C., Norouzi, M., Berant, J., Le, Q. V., and Lao, N. (2018). Memory augmented policy optimization for program synthesis and semantic parsing. Advances in Neural Information Processing Systems, 31.

Lin, X. V., Chen, X., Chen, M., Shi,W., Lomeli, M., James, R., Rodriguez, P., Kahn, J., Szilvasy, G., Lewis, M., et al. (2023). Ra-dit: Retrieval-augmented dual instruction tuning. arXiv preprint arXiv:2310.01352.

Liu, T., Wang, F., and Chen, M. (2024). Rethinking tabular data understanding with large language models. In Duh, K., Gomez, H., and Bethard, S., editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 450–482, Mexico City, Mexico. Association for Computational Linguistics.

Mallen, A., Asai, A., Zhong, V., Das, R., Khashabi, D., and Hajishirzi, H. (2022). When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511.

OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., et al. (2024). Gpt-4 technical report. [link]

Pasupat, P. and Liang, P. (2015). Compositional semantic parsing on semi-structured tables. arXiv preprint arXiv:1508.00305.

Tonmoy, S., Zaman, S., Jain, V., Rani, A., Rawte, V., Chadha, A., and Das, A. (2024). A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313.

Wang, L., Yang, N., Huang, X., Yang, L., Majumder, R., and Wei, F. (2024). Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672.

Yin, P., Neubig, G., Yih,W.-t., and Riedel, S. (2020). TaBERT: Pretraining for joint understanding of textual and tabular data. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8413–8426, Online. Association for Computational Linguistics.

Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., Ma, J., Li, I., Yao, Q., Roman, S., et al. (2018). Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887.
Publicado
17/11/2024
SOUZA, Elvis A. de; SILVA, Patricia F. da; GOMES, Diogo; BATISTA, Vitor; BATISTA, Evelyn; PACHECO, Marco. TableRAG: A Novel Approach for Augmenting LLMs with Information from Retrieved Tables. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 15. , 2024, Belém/PA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 182-191. DOI: https://doi.org/10.5753/stil.2024.245371.