Evaluating RAG-based QA Systems: A Comparative Analysis of LLM as a Judge, Traditional Metrics, and Human Alignment

Renato Miyaji; Renato Moulin; Samuel Monção; Leonardo Machado

doi:10.5753/stil.2025.37829

Renato Miyaji Visagio Group
Renato Moulin Visagio Group
Samuel Monção Visagio Group
Leonardo Machado Visagio Group

DOI: https://doi.org/10.5753/stil.2025.37829

Resumo

Evaluating RAG based Question Answering systems presents ongoing challenges, as traditional NLP metrics often inadequately capture nuanced answer quality and the reliability of LLM as a Judge paradigms requires further validation. This study comprehensively compares two distinct RAG QA systems on a domain-specific dataset from a consulting company. We investigate the efficacy and human alignment of LLM as a Judge configurations (Fine-Tuning and In-Context Learning) benchmarking them against NLP metrics and human evaluations. Results indicate that BERTScore is more indicative of semantic similarity than lexical-based metrics. For LLM as a Judge evaluations, Prometheus 2 using Pairwise Comparison demonstrated the strongest human alignment.

Referências

Bai, Y., Ying, J., Cao, Y., Lv, X., He, Y., Wang, X., Yu, J., Zeng, K., Xiao, Y., Lyu, H., Zhang, J., Li, J., and Hou, L. (2023). Benchmarking foundation models with language model-as-an-examiner. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA. Curran Associates Inc.

Banerjee, S. and Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Goldstein, J., Lavie, A., Lin, C.-Y., and Voss, C., editors, Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.

Chroma (2025). Chroma. Available on: [link]. Acessed in 18 May 2025.

Google (2025). Gemini 2.0 flash. Available on: [link]. Acessed in 18 May 2025.

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, S., Zhang, K., Wang, Y., Gao, W., Ni, L., and Guo, J. (2024). A survey on llm-as-a-judge. ArXiv.

Ho, X., Huang, J., Boudin, F., and Aizawa, A. (2025). LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA. arXiv preprint arXiv:2504.11972.

Kamalloo, E., Dziri, N., Clarke, C., and Rafiei, D. (2023). Evaluating open-domain question answering in the era of large language models. In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5591–5606, Toronto, Canada. Association for Computational Linguistics.

Kim, S., Suk, J., Longpre, S., Lin, B. Y., Shin, J., Welleck, S., Neubig, G., Lee, M., Lee, K., and Seo, M. (2024). Prometheus 2: An open source language model specialized in evaluating other language models. In Al-Onaizan, Y., Bansal, M., and Chen, Y.-N., editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4334–4353, Miami, Florida, USA. Association for Computational Linguistics.

Ko, M., Lee, J., Kim, H., Kim, G., and Kang, J. (2020). Look at the first sentence: Position bias in question answering. In Webber, B., Cohn, T., He, Y., and Liu, Y., editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1109–1121, Online. Association for Computational Linguistics.

Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.

OpenAI (2025). Gpt-4o mini: advancing cost-efficient intelligence. Available on: [link]. Acessed in 18 May 2025.

Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on As sociation for Computational Linguistics, ACL ’02, page 311–318, USA. Association for Computational Linguistics.

Qin, Z., Jagerman, R., Hui, K., Zhuang, H., Wu, J., Yan, L., Shen, J., Liu, T., Liu, J., Metzler, D., Wang, X., and Bendersky, M. (2024). Large language models are effective text rankers with pairwise ranking prompting. In Duh, K., Gomez, H., and Bethard, S., editors, Findings of the Association for Computational Linguistics: NAACL 2024, pages 1504–1518, Mexico City, Mexico. Association for Computational Linguistics.

Sai, A. B., Mohankumar, A. K., and Khapra, M. M. (2022). A survey of evaluation metrics used for nlg systems. ACM Comput. Surv., 55(2).

Schluter, N. (2017). The limits of automatic summarisation according to ROUGE. In Lapata, M., Blunsom, P., and Koller, A., editors, Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 41–45, Valencia, Spain. Association for Computational Linguistics.

Wang, Y., Yu, Z., Zeng, Z., Yang, L., Wang, C., Chen, H., Jiang, C., Xie, R., Wang, J., Xie, X., Ye, W., Zhang, S.-B., and Zhang, Y. (2023). Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. ArXiv, abs/2306.05087.

Yu, Q., Zheng, Z., Song, S., Li, Z., Xiong, F., Tang, B., and Chen, D. (2024). xfinder: Large language models as automated evaluators for reliable evaluation. In International Conference on Learning Representations.

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. (2020). Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). Judging llm-as-a-judge with mt-bench and chatbot arena. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA. Curran Associates Inc.

Zhu, L., Wang, X., and Wang, X. (2025). Judgelm: Fine-tuned large language models are scalable judges. International Conference on Learning Representations.