Evaluation of retrieve-augmented applications by implict feedback
Abstract
In a scenario of evolution of large language models in recent years, with the emergence of a specific market niche, corporate applications began to emerge and become strategic. However, in corporate scenarios, the need to evaluate the results of these applications becomes critical. How do you know if one model is better than another? How do you know if the prompt or question can be improved? How to perform error diagnosis? This article addresses a new proposal for implicit feedback in Retrieval-Augmented Generation (RAG) architectures. The results demonstrate the potential of the proposal applied on VIGIA, a RAG application which detects irregularities on public documents.
References
Rodrigues Cássio S, Cardoso, Geovane E., Ramos, Vinicius F. C. “Inteligência artificial no controle de sobrepreço em compras públicas”. Revista do Tribunal de Contas de Santa Catarina. Belo Horizonte. Ano 2. Número 2, p 225-252, nov. 2023/abr. 2024.
Gao, M., Hu, X., Ruan, J., Pu, X., & Wan, X. (2024). “Llm-based nlg evaluation: Current status and challenges”. arXiv preprint arXiv:2402.01383.Dyer, S., Martin, J. and Zulauf, J. (1995) “Motion Capture White Paper”, [link], December.
Huang, H., Qu, Y., Liu, J., Yang, M., & Zhao, T. (2024). “An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers”. arXiv preprint arXiv:2403.02839.
OpenAI, “Moving AI governance forward”. Disponível em: [link]. Acesso em: 02/06/2024.
Reddy, S., Rogers, W., Makinen, V. P., Coiera, E., Brown, P., Wenzel, M., ... & Kelly, B. (2021). “Evaluation framework to guide implementation of AI systems into healthcare settings”. BMJ health & care informatics, 28(1).
Shankar, S., Zamfirescu-Pereira, J. D., Hartmann, B., Parameswaran, A. G., & Arawjo, I. (2024). “Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences”. arXiv preprint arXiv:2404.12272.
Stahl, B. C., Antoniou, J., Bhalla, N., Brooks, L., Jansen, P., Lindqvist, B., ... & Wright, D. (2023). “A systematic review of artificial intelligence impact assessments”. Artificial Intelligence Review, 56(11), 12799-12831.
Yu, H., Gan, A., Zhang, K., Tong, S., Liu, Q., & Liu, Z. (2024). “Evaluation of Retrieval-Augmented Generation: A Survey”. arXiv preprint arXiv:2405.07437.
