Evaluation of retrieve-augmented applications by implict feedback

  • Alessandro Marinho de Albuquerque Federal University of Santa Catarina (UFSC) http://orcid.org/0000-0003-0217-9705
  • Igor May Wensing Court of Accounts of Santa Catarina (TCE-SC)
  • Nelson Luiz Joppi Filho Federal University of Santa Catarina (UFSC)
  • Carina Dorneles Federal University of Santa Catarina (UFSC)

Abstract


In a scenario of evolution of large language models in recent years, with the emergence of a specific market niche, corporate applications began to emerge and become strategic. However, in corporate scenarios, the need to evaluate the results of these applications becomes critical. How do you know if one model is better than another? How do you know if the prompt or question can be improved? How to perform error diagnosis? This article addresses a new proposal for implicit feedback in Retrieval-Augmented Generation (RAG) architectures. The results demonstrate the potential of the proposal applied on VIGIA, a RAG application which detects irregularities on public documents.

Keywords: Large Language Models, Fraud Detection, Public procurement

References

Finardi, P., Avila, L., Castaldoni, R., Gengo, P., Larcher, C., Piau, M., ... & Caridá, V. (2024). “The Chronicles of RAG: The Retriever, the Chunk and the Generator”. arXiv preprint arXiv:2401.07883.

Rodrigues Cássio S, Cardoso, Geovane E., Ramos, Vinicius F. C. “Inteligência artificial no controle de sobrepreço em compras públicas”. Revista do Tribunal de Contas de Santa Catarina. Belo Horizonte. Ano 2. Número 2, p 225-252, nov. 2023/abr. 2024.

Gao, M., Hu, X., Ruan, J., Pu, X., & Wan, X. (2024). “Llm-based nlg evaluation: Current status and challenges”. arXiv preprint arXiv:2402.01383.Dyer, S., Martin, J. and Zulauf, J. (1995) “Motion Capture White Paper”, [link], December.

Huang, H., Qu, Y., Liu, J., Yang, M., & Zhao, T. (2024). “An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers”. arXiv preprint arXiv:2403.02839.

OpenAI, “Moving AI governance forward”. Disponível em: [link]. Acesso em: 02/06/2024.

Reddy, S., Rogers, W., Makinen, V. P., Coiera, E., Brown, P., Wenzel, M., ... & Kelly, B. (2021). “Evaluation framework to guide implementation of AI systems into healthcare settings”. BMJ health & care informatics, 28(1).

Shankar, S., Zamfirescu-Pereira, J. D., Hartmann, B., Parameswaran, A. G., & Arawjo, I. (2024). “Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences”. arXiv preprint arXiv:2404.12272.

Stahl, B. C., Antoniou, J., Bhalla, N., Brooks, L., Jansen, P., Lindqvist, B., ... & Wright, D. (2023). “A systematic review of artificial intelligence impact assessments”. Artificial Intelligence Review, 56(11), 12799-12831.

Yu, H., Gan, A., Zhang, K., Tong, S., Liu, Q., & Liu, Z. (2024). “Evaluation of Retrieval-Augmented Generation: A Survey”. arXiv preprint arXiv:2405.07437.
Published
2024-10-14
ALBUQUERQUE, Alessandro Marinho de; WENSING, Igor May; JOPPI FILHO, Nelson Luiz; DORNELES, Carina. Evaluation of retrieve-augmented applications by implict feedback. In: WORKSHOP ON DATA SCIENCE AGAINST CORRUPTION IN THE PUBLIC SECTOR (DS-COPS) - BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 39. , 2024, Florianópolis/SC. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 253-259. DOI: https://doi.org/10.5753/sbbd_estendido.2024.243903.