LLM Agents for Search via Reinforcement Learning with Trajectory-Level Self-Evaluation

  • Leandro Yamachita da Costa UFRJ
  • João Baptista de Oliveira e Souza Filho UFRJ

Resumo


Agentic search enables language models to iteratively query and reason over retrieved information to answer complex questions. While reinforcement learning is used to train such agents, most approaches rely solely on final-answer accuracy as the reward signal, which can limit learning. We analyze a reward strategy combining final-answer correctness with trajectory-level feedback, guiding the agent to improve its search behavior. The agent is trained with QLoRA for efficient fine-tuning and performs self-evaluation by comparing answers to references and scoring search trajectories using predefined rubrics, without external supervision. Experiments on multi-hop QA tasks show gains of up to 5.7 points over standard reward setup.

Referências

Chen, Junying, et al. "Huatuogpt-o1, towards medical complex reasoning with llms." arXiv preprint arXiv:2412.18925 (2024).

Chen, Mingyang, et al. "Learning to reason with search for llms via reinforcement learning." arXiv preprint arXiv:2503.19470 (2025).

Dettmers, Tim, et al. "Qlora: Efficient finetuning of quantized llms." Advances in neural information processing systems 36 (2023): 10088-10115.

El-Kishky, Ahmed, et al. "Competitive programming with large reasoning models." arXiv preprint arXiv:2502.06807 (2025).

Feng, Jiazhan, et al. "Retool: Reinforcement learning for strategic tool use in llms." arXiv preprint arXiv:2504.11536 (2025).

Google DeepMind. Gemini Deep Research. Google, 11 Dec. 2024, [link].

Gu, Jiawei, et al. "A survey on llm-as-a-judge." arXiv preprint arXiv:2411.15594 (2024). Guo, Daya, et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." arXiv preprint arXiv:2501.12948 (2025).

Hu, Jingcheng, et al. "Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model." arXiv preprint arXiv:2503.24290 (2025).

Jaech, Aaron, et al. "Openai o1 system card." arXiv preprint arXiv:2412.16720 (2024).

Jin, Bowen, et al. "Search-r1: Training llms to reason and leverage search engines with reinforcement learning." arXiv preprint arXiv:2503.09516 (2025).

Kandpal, Nikhil, et al. "Large language models struggle to learn long-tail knowledge." International Conference on Machine Learning. PMLR, 2023.

Karpukhin, Vladimir, et al. "Dense Passage Retrieval for Open-Domain Question Answering." EMNLP (1). 2020.

Lewis, Patrick, et al. "Retrieval-augmented generation for knowledge-intensive nlp tasks." Advances in neural information processing systems 33 (2020): 9459-9474.

Lightman, Hunter, et al. "Let's verify step by step." The Twelfth International Conference on Learning Representations. 2023.

Liu, Zhaowei, et al. "Fin-r1: A large language model for financial reasoning through reinforcement learning." arXiv preprint arXiv:2503.16252 (2025).

Ma, Peixian, et al. "Sql-r1: Training natural language to sql reasoning model by reinforcement learning." arXiv preprint arXiv:2504.08600 (2025).

Ma, Xinbei, et al. "Query rewriting in retrieval-augmented large language models." Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023.

OpenAI. Deep Research System Card. OpenAI, 25 Feb. 2025, [link].

Paschoal, André FA, et al. "Pirá: A bilingual portuguese-english dataset for question-answering about the ocean." Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2021.

Qian, Cheng, et al. "Toolrl: Reward is all tool learning needs." arXiv preprint arXiv:2504.13958 (2025).

Schnitzler, Julian, et al. "Morehopqa: More than multi-hop reasoning." arXiv preprint arXiv:2406.13397 (2024).

Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).

Shao, Zhihong, et al. "Deepseekmath: Pushing the limits of mathematical reasoning in open language models." arXiv preprint arXiv:2402.03300 (2024).

Singh, Joykirat, et al. "Agentic reasoning and tool integration for llms via reinforcement learning." arXiv preprint arXiv:2505.01441 (2025).

Song, Huatong, et al. "R1-searcher: Incentivizing the search capability in llms via reinforcement learning." arXiv preprint arXiv:2503.05592 (2025).

Su, Yi, et al. "Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains." arXiv preprint arXiv:2503.23829 (2025).

Sun, Hao, et al. "Zerosearch: Incentivize the search capability of llms without searching." arXiv preprint arXiv:2505.04588 (2025).

Team, Qwen. "Qwen2.5 technical report." arXiv preprint arXiv:2412.15115 (2024). Trivedi, Harsh, et al. "MuSiQue: Multihop Questions via Single-hop Question Composition." Transactions of the Association for Computational Linguistics 10 (2022): 539-554.

Wang, Liang, et al. "Text embeddings by weakly-supervised contrastive pre-training." arXiv preprint arXiv:2212.03533 (2022).

Wang, Xingyao, et al. "Openhands: An open platform for ai software developers as generalist agents." arXiv preprint arXiv:2407.16741 (2024).

Xie, Tian, et al. "Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning." arXiv preprint arXiv:2502.14768 (2025).

Xu, Lingling, et al. "Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment." arXiv preprint arXiv:2312.12148 (2023).

Yan, Shi-Qi, et al. "Corrective retrieval augmented generation." (2024).

Yao, Shunyu, et al. "tau-bench: A Benchmark for Tool-Agent-User Interaction in RealWorld Domains." arXiv preprint arXiv:2406.12045 (2024).

Zhao, Andrew, et al. "Absolute zero: Reinforced self-play reasoning with zero data." arXiv preprint arXiv:2505.03335 (2025).

Zhao, Xuandong, et al. "Learning to reason without external rewards." arXiv preprint arXiv:2505.19590 (2025).

Zheng, Yuxiang, et al. "Deepresearcher: Scaling deep research via reinforcement learning in real-world environments." arXiv preprint arXiv:2504.03160 (2025).
Publicado
29/09/2025
COSTA, Leandro Yamachita da; OLIVEIRA E SOUZA FILHO, João Baptista de. LLM Agents for Search via Reinforcement Learning with Trajectory-Level Self-Evaluation. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 22. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 1221-1232. ISSN 2763-9061. DOI: https://doi.org/10.5753/eniac.2025.14460.