Ad-hoc v.s LLM based System for Information Retrieval in Large Tabular Data: A Comparative Study in Public Medicine Procurement Audits
Resumo
Background: Auditing is key when dealing with public expenses. Despite its importance, frequently auditing efforts must prioritize few targets due to a lack of human resources. However, leveraging the auditing process by developing a system that can automatically process large documents is a feasible task. Problem: The Information Retrieval (IR) problem considered in this work relies on two components: (i) the text to be searched and (ii) the data source where the required information is supposed to be. The first component is not standardized, presenting a challenge to an automated solution. The second component is structured; however, it is available in a large data source, which may consist of an obstacle for some automated IR methods. Specifically, given a drug specification, our system must find all available products that match this description in a large data source. Solution: This work investigates two different information retrieval solutions. The first approach basically relies on apriori knowledge of the problem for preprocessing the text and computing words similarity. The second approach leverages a powerful LLM to search in the same data source. IS Theory: Information Processing Theory Research Method: Proof of Concept Experimental Results: The results show that the proposed Ad-hoc method reaches accuracies from 72.4% up to 86.9% while the LLM based approach struggles to find satisfactory results mainly by its non-deterministic behavior and the hallucination problem. Contribution: With regard to the industry, the developed system has the potential to significantly improve the quality and scale of auditing processes. For the academy, the present work unveils limitations of using LLM based approaches for searching in large structured tabular data (± 25000 rows).
Referências
Abdulwahid Ahmad Hashed Abdullah and Faozi A. Almaqtari. 2024. The impact of artificial intelligence and Industry 4.0 on transforming accounting and auditing practices. Journal of Open Innovation: Technology, Market, and Complexity 10, 1 (2024), 100218. DOI: 10.1016/j.joitmc.2024.100218
Ahmad Alobaid and Oscar Corcho. 2022. Balancing coverage and specificity for semantic labelling of subject columns. Knowledge-Based Systems 240 (2022), 108092.
Ron Baker. 2019. Special issue: Government accounting, auditing and accountability: A Canadian perspective. Canadian Journal of Administrative Sciences / Revue Canadienne des Sciences de l’Administration 36, 2 (2019), 288–289.
Dipto Barman, Ziyi Guo, and Owen Conlan. 2024. The Dark Side of Language Models: Exploring the Potential of LLMs in Multimedia Disinformation Generation and Dissemination. Machine Learning with Applications 16 (2024), 100545.
Clodis Boscarioli, Renata Araujo, and Rita Suzana. 2017. I GranDSI-BR Grand Research Challenges in Information Systems in Brazil 2016-2026 Organized by.
Michele A Brandão, Arthur PG Reis, Bárbara MA Mendes, Clara A Bacha de Almeida, Gabriel P Oliveira, Henrique Hott, Larissa D Gomide, Lucas L Costa, Mariana O Silva, Anisio Lacerda, et al. 2023. PLUS: A Semi-automated Pipeline for Fraud Detection in Public Bids. Digital Government: Research and Practice (2023).
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S. Yu, Qiang Yang, and Xing Xie. 2024. A Survey on Evaluation of Large Language Models. 15, 3, Article 39 (2024), 45 pages.
Wenhu Chen, Ming-Wei Chang, Eva Schlinger, William Yang Wang, and William W. Cohen. 2021. Open Question Answering over Tables and Text. In International Conference on Learning Representations. [link]
Adrian de Wynter, Xun Wang, Alex Sokolov, Qilong Gu, and Si-Qing Chen. 2023. An evaluation on large language model outputs: Discourse and memorization. Natural Language Processing Journal 4 (2023), 100024.
Yves Emmanuel, Filipe Silva, George Cabral, and George Valença. 2023. Inovação na Contabilidade Pública - uma Solução que Analisa Atrasos de Pagamentos em Municípios Pernambucanos. 123–125. DOI: 10.5753/sbsi_estendido.2023.229382
Hanchi Gu, Marco Schreyer, Kevin Moffitt, and Miklos Vasarhelyi. 2024. Artificial intelligence co-piloted auditing. International Journal of Accounting Information Systems 54 (2024), 100698. DOI: 10.1016/j.accinf.2024.100698
Jonathan Herzig, Thomas Müller, Syrine Krichene, and Julian Martin Eisenschlos. 2021. Open Domain Question Answering over Tables via Dense Retrieval. arXiv:2103.12011 [cs.CL] [link]
Siqing Huo, Negar Arabzadeh, and Charles Clarke. 2023. Retrieving Supporting Evidence for Generative Question Answering. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. Association for Computing Machinery, 11–20.
Ensan F. Jafarzadeh, P. 2024. An evidence-based approach for open-domain question answering. Knowledge and Information Systems (2024). DOI: 10.1007/s10115-024-02269-2
Emilia Kacprzak, José M Giménez-García, Alessandro Piscopo, Laura Koesten, Luis-Daniel Ibáñez, Jeni Tennison, and Elena Simperl. 2018. Making sense of numerical data-semantic labelling of web tables. In Knowledge Engineering and Knowledge Management: 21st International Conference, EKAW 2018, Nancy, France, November 12-16, 2018, Proceedings 21. Springer, 163–178.
Xinyi Liang, Rui Hu, Yu Liu, and Konglin Zhu. 2024. Open-Domain Question Answering over Tables with Large Language Models. In Advanced Intelligent Computing Technology and Applications. 347–358.
Jixiong Liu, Yoan Chabot, Raphaël Troncy, Viet-Phi Huynh, Thomas Labbé, and Pierre Monnin. 2023. From tabular data to knowledge graphs: A survey of semantic table interpretation tasks and methods. Journal of Web Semantics 76 (2023), 100761.
Jixiong Liu, Viet-Phi Huynh, Yoan Chabot, and Raphaël Troncy. 2022. Radar station: Using kg embeddings for semantic table interpretation and entity disambiguation. In International Semantic Web Conference. Springer, 498–515.
Sebastian Neumaier, Jürgen Umbrich, Josiane Xavier Parreira, and Axel Polleres. 2016. Multi-level semantic labelling of numerical values. In The Semantic Web–ISWC 2016: 15th International Semantic Web Conference, Kobe, Japan, October 17–21, 2016, Proceedings, Part I 15. Springer, 428–445.
OpenAI. 2023. ChatGPT: Conversational Language Model. [link]. Accessed: 2024-11-17.
Arthur Silva, Vicente Sampaio, Adriano Lima, George Cabral, and George Valença. 2024. Ferramenta para Auxílio à Auditoria de Editais Municipais para Compra de Medicamentos. In Anais Estendidos do XX Simpósio Brasileiro de Sistemas de Informação (Juiz de Fora/MG). SBC, 265–268.
Levy Silva and Luciano Barbosa. 2024. Improving dense retrieval models with LLM augmented data for dataset search. Knowledge-Based Systems 294 (2024), 111740.
Sebastian Stephan, Johannes Lahann, and Peter Fettke. 2021. A Case Study on the Application of Process Mining in Combination with Journal Entry Tests for Financial Auditing. In Hawaii International Conference on System Sciences. DOI: 10.24251/HICSS.2021.694
Svitlana Vakulenko and Vadim Savenkov. 2017. TableQA: Question Answering on Tabular Data. arXiv:1705.06504 [cs.IR] [link]
Rafael B Velasco, Igor Carpanese, Ruben Interian, Octavio CG Paulo Neto, and Celso C Ribeiro. 2021. A decision support system for fraud detection in public procurement. International Transactions in Operational Research 28, 1 (2021), 27–47.
Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2024. Software Testing With Large Language Models: Survey, Landscape, and Vision. IEEE Transactions on Software Engineering 50, 4 (April 2024), 911–936.
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. 2024. A survey on large language model based autonomous agents. Frontiers of Computer Science 18, 6 (2024). DOI: 10.1007/s11704-024-40231-1
Yue Wang, Hung Le, Akhilesh Gotmare, Nghi Bui, Junnan Li, and Steven Hoi. 2023. CodeT5+: Open Code Large Language Models for Code Understanding and Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). 1069–1088.
Shuo Zhang and Krisztian Balog. 2020. Web table extraction, retrieval, and augmentation: A survey. ACM Transactions on Intelligent Systems and Technology (TIST) 11, 2 (2020), 1–35.
Yiwei Zhou, Siffi Singh, and Christos Christodoulopoulos. 2021. Tabular data concept type detection using star-transformers. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 3677–3681.