LiteResearcher Logo

LiteResearcher: A Scalable Agentic RL Training Framework for Deep Research Agent

Wanli Li1,2,*, Bince Qu1,2,*, Bo Pan1, Jianyu Zhang1, Zheng Liu3,†, Pan Zhang2, Wei Chen1, Bo Zhang1,2,†
1 Zhejiang University  •  2 Simplex AI  •  3 The Hong Kong Polytechnic University
* Equal contribution. Work done during internship at Simplex AI.    Corresponding authors
✉ {wanli_li@zju.edu.cn, tonyzhang@simplexai.com}

TL;DR

LiteResearcher-4B is a 4B deep research agent trained with zero marginal RL API cost, outperforming 30B open-source deep research agents and surpassing Claude-4.5-Sonnet and GPT-5-high on selected benchmarks. Its RL stage runs entirely in a local search/browse environment, enabling 73.2M tool calls without live search or browse API consumption.

Key Results

71.3 / 78.0
GAIA / Xbench-DS
open-source SOTA; beats 30B agents
+15.7
GAIA points from RL
SFT 55.6 → RL 71.3; AgentCPM +3.8
73.2M
local RL tool calls
$0 marginal API cost vs. $59K–$243K live web
Performance comparison across models
Performance of LiteResearcher. Left: Accuracy comparison on the Xbench DeepSearch benchmark across models of various scales. Right: Average rollout latency and cost per turn.

Abstract

Reinforcement Learning (RL) has emerged as a powerful training paradigm for LLM-based agents. However, scaling agentic RL for deep research remains constrained by two coupled challenges: hand-crafted synthetic data fail to elicit genuine real-world search capabilities, and real-world search dependency during RL training introduces instability and expensive cost, which limit the scalability of Agentic RL.

LiteResearcher is a training framework to make Agentic RL scalable and low-cost: by constructing a lite virtual world that mirrors the real-world search dynamics, we enabled a continuously improving training recipe that empowers a tiny search agent to outperform large-scale open-source and commercial models (e.g. Tongyi DeepResearch and Claude-4.5 Sonnet). The RL stage runs entirely in a local search/browse environment, removing external API consumption during RL while preserving realistic tool-use dynamics. Specifically, on most common benchmarks like GAIA and Xbench, our LiteResearcher-4B achieves the open-source state-of-the-art results of 71.3% and 78.0% respectively, proving that scalable RL training is essential for Deep Research Agents.

Method Overview

LiteResearcher constructs a virtual world with identical architecture to the real web but isolated in execution. The framework consists of three key components:

(1) Co-constructed Training Data & Corpus: We scale up information sources (32M+ webpages, 1M+ domains) and identify five atomic search capabilities — direct retrieval, aggregation, enumeration, cross-verification, and statistics — to generate diverse, realistic training tasks.

(2) Stable Local Tool Environment: A local search engine (BGE-M3 + Milvus, ~0.15s/query) and local browse tool (PostgreSQL, ~0.17s/page) that enable 73.2M tool calls during training fully locally, with no external API consumption during RL and zero marginal tool cost.

(3) Difficulty-Aware Curriculum RL: Multi-stage training that progressively increases task difficulty and context length, retaining only partially-solvable instances to maintain consistent training signal.

LiteResearcher framework overview
Overview of the LiteResearcher training framework.

Main Results

LiteResearcher-4B consistently outperforms open-source models up to 8× larger and matches or exceeds proprietary systems across eight benchmarks, while remaining a low-cost 4B agent trained with fully local RL tool calls.

Models GAIA-Text Browsecomp Browse.(ZH) HLE Frames Webwalker Seal-0 Xbench-DS
Commercial Models
Claude-4-Sonnet68.312.229.120.380.761.7-64.6
Claude-4.5-Sonnet71.219.640.824.585.0-53.466.0
Deepseek-V3.263.567.665.040.880.2-38.571.0
DeepSeek-V3.163.130.049.229.883.761.2-71.0
Minimax-M275.744.048.531.8---72.0
OpenAI-GPT-5-high76.454.965.035.2--51.477.8
GLM-4.671.945.149.530.4---70.0
Kimi-Researcher---26.978.8-36.069.0
Kimi-K2-090560.27.422.221.758.1-25.261.0
Open-Source Models
Mirothinker 8B66.431.140.221.580.660.640.460.6
Tongyi Deepsearch 30B70.943.446.732.990.672.2-75.0
ASearcher QWQ v2 32B58.7---74.5--51.1
WebSailor 30B53.2------53.3
WebDancer 32B (QwQ)51.53.818.0--47.9-38.3
WebExplorer 8B50.015.732.017.375.762.7-53.7
DeepMiner 32B58.733.540.1----62.0
AFM-RL 32B55.311.1-18.0-63.0--
SFR-DeepResearch 20B66.0--28.782.8---
AgentCPM-Explore 4B63.924.129.119.182.768.140.570.0
LiteResearcher-4B71.327.5*32.5*22.083.172.741.878.0

Best open-source results in bold. Results with * use a 64k context window with a memory mechanism.

Training Dynamics

Our difficulty-aware curriculum learning prevents training saturation. Stage 2 with adjusted difficulty yields +3.6% GAIA accuracy after Stage 1 plateaus, demonstrating the importance of progressive curriculum design.

Training dynamics across stages
GAIA accuracy across training stages, showing continued improvement with curriculum learning.

Trajectory Cases

15 hand-audited rollout trajectories from LiteResearcher-4B across 8 deep-research benchmarks. Every case is judged correct, leak-free, and verified by 4 independent Opus-4.7 (1M-context) subagents to confirm the answer is derived from cited evidence. Click any card for a quick look, or open the full step-by-step viewer.

Loading cases…

Open full trajectory viewer β†’

BibTeX

@article{li2026literesearcher,
  title={LiteResearcher: A Scalable Agentic RL Training Framework for Deep Research Agent},
  author={Li, Wanli and Qu, Bince and Pan, Bo and Zhang, Jianyu and Liu, Zheng and Zhang, Pan and Chen, Wei and Zhang, Bo},
  journal={arXiv preprint arXiv:2604.17931},
  year={2026}
}