Zoey Li's Personal Webpage

A Reflection on Multi-Agent Role-Playing

Mon, 20 Apr 2026 00:00:00 -0500

Role-playing was the earliest multi-agent pattern: assign distinct roles via system prompts — one agent as the program manager, another as the systems architect, a third as the engineer. The motivation is intuitive — a prompt that says “you are a security expert reviewing this code” elicits more security-focused analysis than a generic “review this code” — and the wave of influential multi-agent work in 2023-2024 (CAMEL, ChatDev, MetaGPT, Multi-Agent Debate) was built on this idea.

This post surveys three levels of multi-agent role-playing and examines what holds up. Using multiple personas for inference (Level 1) is now largely superseded by stronger models that adopt perspectives without explicit role assignment. Multi-role task decomposition (Level 2) produced real value, but the value came from the structural decomposition — parallelism, context isolation, tool specialization — not from the personas themselves. Multi-agent simulation (Level 3) sounds like the natural next step in system complexity, but has deep fidelity problems that remain unaddressed.

A note on scope: role-playing is also a product category in its own right, not just an engineering technique. Character.ai and Minimax’s Hailuo built consumer-scale businesses on models fine-tuned to maintain persistent character personas in conversation. In that context, optimizing for role alignment and persona consistency is the core objective — not a means to better task performance. This post focuses on role-playing as a technique for building capable agent systems, not as an end-user application.

Persona-Driven Generation: The Starting Point

Before examining multi-agent role-playing, it’s worth isolating the value of role assignment itself. At its simplest, a persona is just a system prompt that biases a single LLM’s generation — no agent interaction, no environment, no multi-turn coordination. The persona acts as a key that unlocks a different slice of the model’s knowledge — or as Reynolds & McDonell [1] put it in the earliest academic treatment of the technique, it helps the model “access a portion of its memory that holds higher quality examples of the task at hand.”

The idea became enormously popular in 2022-2023. Riley Goodside’s GPT-3 prompt experiments on Twitter demonstrated that identity assignment could radically change model behavior. The Awesome ChatGPT Prompts repository (143k+ stars) massified the “Act As” pattern — and today’s system prompts still follow the same structure, just with more detail:

You are a top-tier Web Product Architect, Full-Stack System Design Expert, and Enterprise Website Template System Consultant. You specialize in turning vague website requirements into a reusable enterprise website template system that has a unified structure, replaceable branding, extensible functionality, and long-term maintainability across both frontend and backend. Your task is not to design a single website page, and not merely to provide visual suggestions. Your task is to produce a reusable website template system design…

The template is consistent: assign a role with specific expertise, then constrain the scope and output expectations. The persona steers the model into a behavioral mode that generic prompting wouldn’t reliably produce.

Persona Hub [2] demonstrates this at scale: 1 billion diverse personas curated from web data, each used to prompt LLMs from varied perspectives — generating training prompts, math problems, instructions, and knowledge-rich text. Each persona produces more diverse synthetic data than generic prompting.

Persona-driven generation is a single-agent technique, but it’s the foundation that multi-agent role-playing builds on. The question is whether interaction between personas adds value beyond what individual role prompting provides. The evidence is nuanced — Kong et al. [3] (NAACL 2024) systematically evaluate role-play prompting across reasoning benchmarks and find that gains are task-dependent and diminish with stronger models.

Three Levels of Multi-Agent Role-Playing

Level 1: Multi-Persona Inference

Multiple roles improve the quality of a single output. The roles debate, critique, or synthesize perspectives to arrive at one answer.

Multi-Agent Debate [4]: Multiple LLM instances propose individual answers, then debate over multiple rounds to converge. Improves factuality and reasoning by having agents challenge each other’s claims.
Solo Performance Prompting (SPP) [5]: A single LLM dynamically identifies and simulates multiple personas internally, then has them collaborate to solve a problem. The effect only emerges in strong models (GPT-4), not in weaker ones — directly illustrating the “bounded by model capability” limitation.
ChatEval [6]: Applies multi-agent debate to evaluation rather than problem-solving. Multiple LLM agents deliberate on text quality, mimicking human annotation panels. Produces more reliable evaluations than single-model scoring.
Chain of Agents (CoA) [7] (NeurIPS 2024): Multi-agent collaboration for long-context tasks. Worker agents sequentially process different chunks of a long input, each communicating findings to the next, followed by a manager agent that synthesizes. Each agent sees only a short context, sidestepping long-context degradation. Up to 10% improvement over RAG and full-context baselines.

At this level, multi-agent is essentially an inference-time technique — analogous to self-consistency or best-of-N sampling, but with structured role-based interaction rather than independent samples. A natural question arises: how much of the benefit comes from the role structure versus simply spending more compute? The structured role interaction adds overhead (prompt engineering, communication protocol) that independent sampling doesn’t require. Whether role-based debate outperforms equivalent-compute single-model test-time scaling remains an open empirical question — and the answer likely depends on the task and model capability.

Level 2: Multi-Role Task Decomposition

Multiple roles break down a large task into phases or components, each handled by a specialist agent. The output is a complex artifact (software, research report), not a single answer.

CAMEL [8] (NeurIPS 2023): The first explicit role-playing framework. Two agents converse via inception prompting to complete a collaborative task. Limited — no tool use or code execution. Key finding: pure role-playing conversation without tool access is insufficient for most real tasks.
ChatDev [9] (ACL 2024): A virtual software company where agents fill roles across a waterfall-style development process — CEO, CTO, programmer, reviewer, tester. The key contribution over CAMEL: agents actually execute code and run tests, closing the loop between dialogue and concrete artifacts.
MetaGPT [10] (ICLR 2024): Similar setup but agents communicate through structured documents (PRDs, system designs, API specs) rather than free-form dialogue. Incorporates standardized operating procedures (SOPs), making coordination explicit rather than emergent. Reduces hallucination compared to chat-based communication.
AgentVerse [11] (ICLR 2024): Addresses rigidity of fixed-role systems with dynamic expert recruitment — a recruiter agent generates expert descriptions on the fly, and group composition is adjusted based on feedback. This moves toward runtime role assignment rather than pre-designed role taxonomies.
AutoGen [12]: A general-purpose multi-agent conversation framework from Microsoft. Domain-agnostic — agents are defined by capabilities rather than fixed roles.

The evolution from CAMEL to ChatDev to MetaGPT shows increasing structure in inter-agent communication: free-form chat, then phased chat, then structured documents with SOPs. Each iteration constrains communication further to reduce hallucination — mirroring how human organizations evolve from informal team chat to formal engineering processes.

The open question: do we need roles at all? These early systems have hard-coded workflows — the PM always talks to the architect who always talks to the engineer. The roles and their communication graph are designed by the system builder for a specific task type. This raises the question of whether named persistent roles add value over simpler alternatives:

graph LR
    subgraph s1 ["Fixed Roles"]
        direction LR
        PM["🏷️ PM"] --> Arch["🏷️ Architect"] --> Eng["🏷️ Engineer"] --> Test["🏷️ Tester"]
    end

    subgraph s2 ["Ad-hoc Sub-agents"]
        direction TB
        C["Coordinator"] --> S1["subtask 1"]
        C --> S2["subtask 2"]
        C --> S3["subtask 3"]
    end

    subgraph s3 ["Parallel Agents"]
        direction TB
        A1["agent"] <--> A2["agent"]
        A2 <--> A3["agent"]
        A3 <--> A1
    end

    s1 ~~~ s2 ~~~ s3

Ad-hoc sub-agents: The coordinator spawns a sub-agent with a task-specific prompt on the fly — no pre-defined role, no persistent identity. The sub-agent exists for one subtask and is discarded. More flexible because the coordinator decides the decomposition at runtime based on the actual problem.
Parallel agents: Multiple agents with no fixed roles work on the same problem, self-organizing through communication. No hierarchy, no predefined workflow.

The case for named roles is strongest when the task has a recurring, well-understood structure (software development has a stable decomposition into planning/coding/testing). The case weakens for novel or variable tasks where the right decomposition isn’t known in advance. As models become more capable at runtime planning, the value shifts from pre-defined role taxonomies to dynamic task decomposition.

Level 3: Multi-Agent Simulation

Multiple agents operate in a shared environment to simulate social dynamics. The goal is emergent collective behavior rather than a concrete output.

Generative Agents [13] (UIST 2023): 25 LLM-powered agents inhabit a sandbox environment (reminiscent of The Sims), each with memory, reflection, and planning modules. Agents autonomously plan their days, share news, form relationships, and coordinate group activities.
Generative Agent Simulations of 1,000 People [14]: Scales to 1,052 agents grounded in real individuals constructed from qualitative interview data. Agents replicate participants’ responses on the General Social Survey 85% as accurately as participants replicate their own answers two weeks later. Critical finding: grounding agents in actual interview data rather than demographic descriptions reduces accuracy biases across racial and ideological groups.
StableAlignment [15]: Uses multi-agent simulation for generating alignment training data. Agents interact in a simulated social environment where they learn value consensus through dialogue, mirroring how humans develop social norms through interaction rather than memorization.

At this level, multi-agent systems are closer to social science tools than engineering tools — they study emergent phenomena rather than optimize for task completion.

The Simulation Fidelity Problem

Level 3 is uniquely vulnerable to fidelity failures because agents are models of people, not tools for task completion. Two failure modes:

Bias and stereotyping. Even when grounded in interview data, agents can hallucinate behaviors or fall back on LLM stereotypes when the interview doesn’t cover a specific scenario. Agents specified only by demographics are worse — the LLM fills in missing detail with statistical stereotypes. This is especially dangerous when simulations inform policy.

Lack of diversity and long-tail coverage. LLMs are trained on majority-dominated data, so simulated agents tend to converge toward mainstream behaviors. A simulation of 1,000 agents may look diverse on surface demographics but behave with suspiciously uniform reasoning patterns underneath.

OmniBehavior [16], built on real user interaction logs from Kuaishou, provides systematic empirical evidence. It identifies four specific biases:

Hyper-activity bias: Real users show positive interaction rates below 10%; LLM simulators overestimate by 40-60%
Emotional suppression: LLMs cluster around neutral/positive sentiment while real users frequently express strong negative emotions
Language homogenization: Simulated utterances are more polite, hedged, and face-saving than real users’ direct communication
Personality erasure: Real users show large inter-user variation and small intra-user variation (consistent personalities); LLM-generated users show heavily overlapping distributions

The best-performing model achieved only 44.55 on the benchmark, with most models under 40% F1 for binary behavior prediction. These suggest systematic gaps rather than minor calibration issues.

Two applications illustrate the practical stakes:

Simulating users (for product testing, UX research): LLM-simulated users tend to be more cooperative, articulate, and predictable than real users. They rarely exhibit the chaotic, irrational, or creative misuse patterns that define real user behavior. A product tested against simulated users will appear to work well, then fail on real edge cases. The simulation creates a false sense of validation.
Simulating financial markets: Market behavior emerges from the interaction of heterogeneous agents with private information, irrational biases, and adversarial strategies. LLM agents lack genuine information asymmetry, don’t experience fear or greed, and can’t model reflexive dynamics where beliefs about the market change the market. Simulated markets will miss the fat-tailed distributions and flash crashes that matter most for risk management. (Note the distinction: simulating a market with LLM personas is the problematic pattern here. Placing an agent in a market environment and optimizing for investment outcomes — following the AlphaZero principle — could be legitimate. The bottleneck shifts from persona fidelity to environment fidelity: how accurately does the market simulator reproduce real market dynamics?)

Both cases share a common tension: the simulation is most useful for understanding the behaviors the LLM is least equipped to produce.

The Tail-Risk Paradox

The scenarios where simulation is most valuable are exactly the scenarios where it is least reliable. Simulation’s value proposition is to explore situations you can’t observe directly — natural disasters, financial crises. But these rare events are where: (1) LLM training data is sparsest, (2) human behavior deviates most dramatically from the average, and (3) the cost of getting the simulation wrong is highest.

People under the pressure of a real natural disaster exhibit behaviors — irrational evacuation decisions, resource hoarding, spontaneous cooperation with strangers — that are qualitatively different from what an LLM would extrapolate from everyday behavior patterns. A simulation that captures normal-condition behavior but misses tail-risk dynamics risks providing false confidence — potentially more misleading than having no simulation at all.

Simulate the Decision-Process, Not the Persona

The most successful multi-agent simulation is arguably AlphaZero [19]: two models simulate two Go players. Their objective is clear (win the game), the environment enforces strict rules, and the agents’ “persona” — playstyle, strategy, even creativity — emerges from the optimization process rather than being prescribed.

This points to a fundamental design principle: simulate the decision-making process, not the persona.

AlphaZero works because it has:

A well-defined objective — the reward signal is unambiguous
Environment-enforced constraints — illegal moves are impossible, not just discouraged
Ground-truth feedback — win/loss is verifiable, not judged by another LLM
Emergent identity — the agent’s “personality” is a byproduct of optimization, not a prompt

Level 3 persona simulation operates in a very different regime. The objective is open-ended (“act like this person”), constraints are soft, feedback requires human judgment or proxy LLMs, and identity is prescribed rather than emergent.

This also explains why Level 2 training is more tractable than Level 3: task decomposition has verifiable sub-objectives (did the code compile? did the search find the answer?), making it structurally closer to the AlphaZero setup. The “roles” in Level 2 are instrumental — they exist because they’re useful for the task, not because they’re simulating someone.

Two concrete applications where decision-process simulation works:

Red-teaming for LLM safety: The attacker agent’s objective is to jailbreak the target model. The target’s refusal or compliance provides ground-truth feedback, and attack strategies emerge from optimization rather than being hand-designed. The attacker doesn’t need a persona (“act like a hacker”); it needs an objective (elicit harmful output) and an environment that scores success. This is why learned red-teaming [17] discovers attack vectors that manual prompt engineering misses.
Scientific discovery and optimization. Karpathy’s autoresearch is the minimal version: a single agent in a modify-train-evaluate loop with a clear objective and ground-truth feedback. CORAL [18] scales this to 4 agents working asynchronously — no roles, no personas, just shared filesystem-based memory where agents build on each other’s discoveries. Results are verifiable by execution, and 4-agent co-evolution outperforms best-of-4 independent runs, showing that coordination adds value beyond raw compute scaling.

The Catch-22 of Simulating People

The AlphaZero analogy has a fundamental limitation when applied to simulating people: human decision processes are not fully rational. Humans satisfice (bounded rationality), exhibit systematic biases (loss aversion, anchoring, hyperbolic discounting), and are influenced by emotion, social pressure, and cognitive load. These deviations from optimality aren’t noise — they are the phenomena that human simulation needs to capture.

This creates a catch-22:

Optimize the agent (AlphaZero-style) and it converges toward rational behavior, becoming too optimal to simulate real humans. A perfectly rational financial agent misses panic selling, FOMO, and herd behavior.
Don’t optimize and you’re back to persona prompting with all its fidelity problems.
Optimize for irrationality and you need a reward signal that captures how humans deviate from rationality — which requires the same kind of behavioral data that the 1,000-person study used. The reward becomes “match this distribution of irrational behavior,” a much harder specification problem than “win the game.”

An LLM can describe cognitive biases fluently — it knows what loss aversion is — but it doesn’t exhibit them under optimization pressure. The knowledge is declarative, not procedural. Training an agent to behave irrationally in the right ways rather than wrong ways is an open problem.

Limitations by Level

Limitation	Level 1 (Inference)	Level 2 (Task Decomposition)	Level 3 (Simulation)
Bounded by model capability	Core issue — role-prompting gains shrink as models improve	Less relevant — structural benefits persist regardless	Irrelevant — the goal is fidelity, not performance
Roles are harness-dependent	Minor — roles are simple (debater, critic)	Core issue — workflows must be designed per domain	Minor — agent identities are the point
Communication overhead	Moderate — debate rounds cost tokens	Core issue — inter-role communication consumes context	Moderate — interactions are the simulation
Coordination failures	Low risk — converging to an answer is well-defined	High risk — conflicting outputs without coordination training	N/A — “failures” may be realistic behavior
Simulation fidelity	N/A	N/A	Core issue — systematic biases undermine the value proposition

Reflections

Having worked on multi-agent systems that follow both the task decomposition and simulation patterns, this reflection comes partly from my own experience — the patterns were effective for the problems and models we had at the time, but the reasons they worked were often more structural than we initially framed them.

It’s worth noting that the influential multi-agent role-playing work — CAMEL, ChatDev, MetaGPT, AgentVerse, Multi-Agent Debate — was overwhelmingly a 2023-2024 phenomenon. The context matters: models at that time were more sensitive to role prompts (explicit persona assignment had a measurable effect on output quality), and their individual capabilities were limited enough that task decomposition across multiple agents was often necessary to complete complex tasks that a single model couldn’t handle alone. Role-playing was a pragmatic response to the models available at the time — a circumstantial design pattern rather than a fundamental architectural principle.

As models have grown more capable — better instruction following, longer context, stronger reasoning — the conditions that made role-playing effective have partially dissolved. Frontier models already adopt appropriate perspectives without explicit role assignment, and a single capable model can often handle tasks that previously required a virtual software company of specialized agents. The three core motivations for multi-agent systems — parallelism, context isolation, and tool specialization — are all structural benefits that don’t inherently require personas. Systems like AlphaZero, learned red-teaming, and CORAL succeed with clear objectives, verifiable feedback, and emergent rather than prescribed identity. A more promising direction is to move beyond prompt-level role assignment and train models to collaborate better — learning coordination, task decomposition, and result integration through optimization rather than scaffold design. Kimi K2.5’s Agent Swarm [20] (PARL) is an early example: an RL-trained orchestrator learns to dispatch and coordinate sub-agents, with auxiliary rewards that shape exploration away from degenerate coordination patterns. The interesting frontier is not “how do we assign better roles?” but “how do we train agents to decompose and coordinate dynamically?”

This makes recent work that attempts to scale up multi-agent role-playing — more agents, more elaborate role taxonomies, more complex simulated organizations — worth scrutinizing carefully. These systems can produce impressive demos, but the gap between demo and applicability tends to widen as the role-playing scaffold grows more elaborate. Adding more named roles and richer communication protocols increases engineering complexity without addressing the underlying question of whether roles are doing the work, or whether the structural decomposition underneath would suffice with simpler (or no) persona assignment. The risk is that scaling up a circumstantial pattern produces diminishing returns — more moving parts, but not more capability.

References

[1] Reynolds, L., & McDonell, K. (2021). Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm. CHI 2021. arXiv:2102.07350.

[2] Ge, T., Hu, X., Wang, L., Chen, S., Tao, C., Wang, Z., … & Wei, F. (2024). Scaling Synthetic Data Creation with 1,000,000,000 Personas. arXiv:2406.20094.

[3] Kong, A., Zhao, S., Chen, H., Li, Q., Qin, Y., Sun, R., & Zhou, X. (2024). Better Zero-Shot Reasoning with Role-Play Prompting. NAACL 2024. arXiv:2308.07702.

[4] Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., & Mordatch, I. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325.

[5] Wang, Z., Peng, S., Dong, D., Ma, J., & Lam, W. (2023). Unleashing Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration. arXiv:2307.05300.

[6] Chan, C. M., Chen, W., Su, Y., Yu, J., Xue, W., Zhang, S., … & Liu, Z. (2023). ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate. arXiv:2308.07201.

[7] Zhang, Y., Chen, Y., Jiang, C., Liu, J., Choi, Y., & Oh, J. (2024). Chain of Agents: Large Language Models Collaborating on Long-Context Tasks. NeurIPS 2024. arXiv:2406.02818.

[8] Li, G., Hammoud, H. A. A. K., Itani, H., Khizbullin, D., & Ghanem, B. (2023). CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society. NeurIPS 2023. arXiv:2303.17760.

[9] Qian, C., Cong, X., Yang, C., Chen, W., Su, Y., Xu, J., … & Sun, M. (2024). Communicative Agents for Software Development. ACL 2024. arXiv:2307.07924.

[10] Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Zhang, C., … & Wu, Y. (2024). MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. ICLR 2024. arXiv:2308.00352.

[11] Chen, W., Su, Y., Zuo, J., Yang, C., Yuan, C., Chan, C. M., … & Liu, Z. (2024). AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors. ICLR 2024. arXiv:2308.10848.

[12] Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., … & Wang, C. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155.

[13] Park, J. S., O’Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., & Bernstein, M. S. (2023). Generative Agents: Interactive Simulacra of Human Behavior. UIST 2023. arXiv:2304.03442.

[14] Park, J. S., Zou, C., Shaw, A., Hill, B. M., Cai, C., Morris, M. R., … & Bernstein, M. S. (2024). Generative Agent Simulations of 1,000 People. arXiv:2411.10109.

[15] Liu, Z., Yang, Y., Xu, H., Tang, H., Liu, Y., & Xiao, T. (2023). Training Socially Aligned Language Models on Simulated Social Interactions. arXiv:2305.16960.

[16] OmniBehavior Benchmark. omnibehavior.github.io.

[17] Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., … & Irving, G. (2022). Red Teaming Language Models with Language Models. arXiv:2202.03286.

[18] Li, J., et al. (2025). CORAL: Co-Evolving LLM Agents for Autonomous Problem Solving. arXiv:2604.01658.

[19] Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., … & Hassabis, D. (2018). A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go through Self-Play. Science, 362(6419). DOI:10.1126/science.aar6404.

[20] Kimi Team. (2026). Kimi K2.5: Scaling Reinforcement Learning with LLMs. arXiv:2602.02276.

Context Management for LLM Agents: A Memory Hierarchy View

Sat, 18 Apr 2026 00:00:00 -0500

Context is what the model sees at each forward pass: system prompt, tool definitions, conversation history, tool call results, and any retrieved or injected information. As LLM agents take on longer, more complex tasks, context management becomes the central bottleneck — not because models lack context length, but because they lack strategies for using context well.

This post surveys how agents learn to manage their own context, from harness-driven compaction through memory tools to sub-agent delegation.

Why Context Management Matters

Four distinct pressures push agents toward context limits, each with different characteristics:

1. Deep reasoning exceeds context length. On truly difficult problems, test-time scaling means the model’s chain of thought can grow to tens or hundreds of thousands of tokens — potentially exceeding the context window during a single reasoning pass. The bottleneck is the model’s own generation, not external input.

2. Tool outputs are long and off-policy. Search agents ingest web pages, code execution outputs, and API responses that can be tens of thousands of tokens each. A single screenshot in a GUI agent can consume thousands of tokens. These tokens consume context budget and push the model into out-of-distribution conditioning states.

3. Long-horizon tasks accumulate history. Multi-session tasks like SWE agents or deep research systems run for hundreds of turns. Even with short individual turns, accumulated history grows linearly and eventually saturates the window. Information relevant to the current step may have been generated many turns ago — naive FIFO truncation loses it.

4. User personalization requires cross-session state. Agents that serve the same user repeatedly need to remember preferences and prior decisions across separate conversations. The context window resets each session; personalization requires persistent storage.

These pressures often co-occur: a research agent (pressure 3) ingests long web pages (pressure 2) while reasoning deeply about their content (pressure 1) for a user whose preferences shape the research direction (pressure 4).

Long Context vs. Long Horizon

Two capabilities, often conflated, are largely distinct in practice:

Long-context capability: Can the model attend to and retrieve from information at large context lengths? This is an architecture + pretraining problem (RoPE scaling, context extension, attention patterns), and is typically measured by input-heavy retrieval benchmarks (RULER, needle-in-a-haystack).
Long-horizon reasoning: Can the model make productive use of additional thinking tokens or interaction turns? This is a training distribution + reasoning strategy problem, and the bottleneck is on the output side — sustaining coherent generation over tens or hundreds of thousands of tokens.

These look independent partly because we measure them differently. Long-context benchmarks test retrieval from long inputs but rarely test long generation. A model that aces needle-in-a-haystack at 1M context may still degrade past ~128k of its own reasoning, filling the space with repetition and circular logic — but we wouldn’t know from the retrieval benchmark alone. The capabilities may share more infrastructure than the benchmarks suggest (attention quality at long range matters for both), but current evaluation treats them as separate.

What’s clearer is the asymmetry in the other direction: context management can provide long-horizon reasoning without long context — a model with a 32k window can reason effectively over 500k tokens of reasoning depth if it periodically compresses its own reasoning into summaries. Context management can substitute for context length.

The LLM Memory Hierarchy

Just as computer architecture organizes storage into a hierarchy (registers → RAM → disk → network) trading speed for capacity, LLM agents operate across an analogous hierarchy:

Level	Computer analogy	Speed	Capacity	Persistence	Management
KV cache	CPU cache/registers	Hardware-speed	Bounded by GPU memory	Per-generation	Hardware/framework
Context window	RAM	Single forward pass	32k-1M tokens	Per-session	Harness or model
Files	Disk/SSD	Tool call round-trip	Large	Cross-session	Model (via tools)
Databases / vector stores	Network storage	Search + retrieval	Unlimited	Permanent	Model (via tools)

The core design challenge at each boundary is the caching policy — what to keep at the fast level, when to evict to a slower level, and when to promote back. This analogy is not merely illustrative — it has already driven real system designs. PagedAttention (vLLM) directly implements virtual memory for KV cache: fixed-size KV blocks, block tables for non-contiguous allocation, copy-on-write for shared sequences. KV cache compression methods (H2O, SnapKV) are literally page eviction policies.

A notable disconnect: KV cache management is typically treated as a systems problem (optimize throughput, minimize memory) while context management is treated as an algorithm problem (optimize reasoning quality). But the two levels are tightly coupled — heuristic KV eviction can destroy reasoning by evicting semantically critical attention states, while algorithmic context editing operates without awareness of KV cache costs. The co-design opportunity: context editing decisions informed by KV cache pressure, and KV eviction informed by the model’s semantic understanding of what matters.

Who Acts as the “Operating System”?

Each pressure maps naturally to a boundary in the hierarchy:

Pressure 1 (deep reasoning) → KV cache ↔ context: the model’s own reasoning outgrows the window
Pressure 2 (tool outputs) → external data entering context: long, off-policy tokens injected from the environment
Pressure 3 (long-horizon accumulation) → context ↔ files: history that must survive beyond the session window
Pressure 4 (personalization) → files ↔ databases: cross-session state that must persist permanently

The question at each boundary is the same one an operating system faces: who decides the caching policy? The “LLM as OS” analogy is not new — Karpathy’s LLM OS vision, MemGPT [1], and AIOS [2] all explored it. MemGPT in particular proposed a two-tier virtual memory system (context window as RAM, external storage as disk) with interrupt-driven swapping. We extend this to a four-level hierarchy with explicit caching policies at each boundary, and connect each context pressure to a specific level transition.

In computer architecture, the OS has privileged access to cheap hardware signals (dirty bits, TLB misses, access counters). In LLM systems, no component has this privilege — the “hardware signals” (attention weights, token probabilities) are expensive to extract and noisy to interpret. This creates an evolving design spectrum for who makes memory management decisions — and this spectrum directly corresponds to the context management strategies we’ll survey:

Phase	Who manages memory	Strategy
Phase 1 — Harness-as-system	External harness, using heuristics. Cheap but semantically blind.	Harness-driven compaction (Sec. 1)
Phase 2 — Model-as-system	Model manages memory via tool calls. More accurate but costly in tokens.	Memory as a tool (Sec. 2)
Phase 3 — Model-informed system	System uses signals from the model (e.g. Free()LM’s [3] LoRA-switched cleaning mode), but the model doesn’t bear the full cost.	Trained compaction (Sec. 1)
Phase 4 — Agent-as-OS	Main agent schedules sub-agents and allocates fresh contexts. Sub-agents return condensed results.	Sub-agents (Sec. 3)

Context Management Strategies

The strategies below are ordered by increasing model autonomy — from the harness making all decisions, to the model managing its own memory, to the model delegating entire subtasks. They are not mutually exclusive; production systems typically combine all three.

1. Harness-Driven Compaction

The base case: an external harness monitors context length and triggers compression when it exceeds a threshold. The model has no control over when or what gets compressed.

History compaction: When context reaches N tokens, the harness passes the conversation history to a summarizer (often the same model) which compresses it. The agent continues with the summary plus a small window of recent state.

Tool output preprocessing: Rather than passing raw tool outputs into context, the harness intercepts them and runs a summarizer first — condensing a full web page into the relevant paragraphs before the agent ever sees it.

Limitations: The model cannot protect information it knows will be needed later — the harness compresses uniformly. Critical details may be lost if they don’t appear “important” to the summarizer at compression time.

Training can improve harness-driven compaction along two dimensions: training the model to compress better (learned compaction), or training the model to reason better from compressed state (compaction-adapted training). These are complementary.

Learned Compaction

AgentFold [4] (SFT): Trains the model to maintain multi-scale state summaries — granular condensation of recent turns and deep consolidation of completed sub-tasks. Achieves 92% context reduction (7K vs 91K tokens after 100 turns). Key insight: delay consolidation until outcomes are clear.

MEM1 [5] (RL with sparse rewards): Trains the model to maintain a compact internal state (<IS>...</IS> tags) via PPO, learning purely from task success. At each turn, the model rewrites its internal state and discards all previous context — achieving constant memory usage regardless of task length. Results: 3.5x better performance than 2x larger models, 3.7x memory reduction. Exhibits emergent behaviors: concurrent question management, adaptive focus shifting, self-verification.

	AgentFold	MEM1
Training	SFT on synthetic data	RL (PPO) with sparse task rewards
Memory structure	Multi-scale summaries (granular + deep)	Single internal state block
Memory growth	Sub-linear (~7K at 100 turns)	Constant (~500-1K tokens)
Compression	92% vs ReAct	73% (27% of baseline)

That MEM1 succeeds with sparse terminal rewards while more complex context behaviors (see ContextFold [6] below) require dense process rewards reflects a recurring theme in agent RL: the right reward density depends on the complexity of the behavior being shaped.

Compaction-Adapted Training

Rather than training the compression itself, these methods keep the compression mechanism fixed and train the model to reason effectively from compressed state. This directly addresses pressure 1 — deep reasoning exceeding context length.

Reasoning Cache (RC) [7]: An iterative decoding algorithm that alternates between reasoning and summarization. At each iteration: (1) generate a reasoning trace (budget $H_R$, e.g. 16k tokens) conditioned on the previous summary, (2) summarize into a compact summary (~2k tokens), (3) discard the raw trace. The effective reasoning horizon scales as $T \times H_R$ — at $T=12$ iterations with $H_R=16\text{k}$, the model reasons over 192k tokens while each individual generation stays within 16k.

The critical design: only the reasoning step is trained with RL (GRPO), not the summarization. The RL objective trains summary-conditioned generation: given a problem and a summary of prior reasoning, produce reasoning that leads to a correct answer. This exploits a summarization-generation asymmetry — models are already better at reasoning from summaries than generating full solutions from scratch.

Results (RCT-4B, trained at 16k, evaluated at 192k):

	HMMT 2025	AIME 2025	IMO-AnswerBench	FrontierScience
Base (16k, standard)	39.8	46.0	40.9	23.3
Base + RC (192k, no training)	56.7	—	46.3	29.5
RCT-4B + RC (192k, trained)	66.3	74.9	58.0	34.1

A 4B model extrapolates from a 16k training budget to 192k at test time — a 12x extrapolation — outperforming specialized thinking models and competing with much larger models. The FrontierScience gains transfer despite training exclusively on math.

Free()LM [3]: Where RC summarizes reasoning into a new compact representation, Free()LM takes the complementary approach — surgically pruning redundant spans while preserving the remaining tokens exactly. The core metaphor: standard LLMs are “malloc-only” engines that continuously append tokens without any mechanism to discard obsolete information. Free()LM adds the missing free() operation.

A plug-and-play Free-Module (LoRA adapter) switches the model between two modes:

Reasoning mode (LoRA unmerged): Normal reasoning
Cleaning mode (LoRA merged): Scans context, identifies redundant chunks, outputs pruning commands as [{"prefix": "...", "suffix": "..."}] anchors defining spans to delete

The cycle triggers every ~5000 tokens. Training uses SFT on filtered synthetic data — rejection sampling retains only pruning operations where accuracy on the pruned context $\geq$ accuracy on the original.

The most striking finding: on HLE tasks requiring >80k thinking tokens, Qwen3-235B drops to 0% accuracy (total reasoning collapse), but Free()LM restores performance to ~50% by compressing 100k+ trajectories back to the 40-70k range where the model reasons comfortably. The Free-Module also exhibits cross-model generalization: the 8B module transfers to Qwen3-235B and even DeepSeek-V3.2.

CoMem [8] (ICLR 2026 Workshop): Decouples memory management from reasoning into two separate models — a small memory model (e.g. Qwen3-4B) that compresses long-term history into summaries, and a larger agent model (e.g. DeepSWE 32B) that reasons over the compressed state + recent raw turns. The memory model runs asynchronously, overlapping compression with the agent’s inference.

The primary motivation is latency: context length directly determines decoding latency (KV cache loading from HBM). The memory model is trained with GRPO on a functional equivalence reward — the summary should make the frozen agent produce the same actions it would have with full context. This means the memory model may learn to preserve information a generic summarizer would discard (exact file paths, specific error messages) because the agent’s policy depends on those details.

Results on SWE-Bench: 1.4-2.1x latency reduction while maintaining competitive performance. On DeepSWE, it slightly exceeds full-context performance (41.0 vs 40.4) — suggesting compression filters noise that hurts reasoning.

RC vs CoMem: Training Opposite Sides

RC and CoMem both use summarization for context management, but they train different sides of the same interface:

compress: history -> summary    CoMem: trained (GRPO, functional equivalence)
                                RC:    not trained (base model ICL)

reason:   summary -> answer     RC:    trained (GRPO, summary-conditioned generation)
                                CoMem: not trained (frozen agent model)

Both demonstrate that training one side suffices for significant gains. The natural next step — training both sides jointly — remains unexplored.

	RC	Free()LM	CoMem
Primary goal	Reasoning extrapolation	Reasoning quality	Inference latency
Mechanism	Summarize, reason from summary	Prune redundant spans	Summarize (separate model), reason
What’s trained	Reasoning from summaries	Redundancy identification	Summary production
Training method	RL (GRPO)	SFT on filtered operations	RL (GRPO, functional equivalence)
Architecture	Single model, two prompts	Single model + LoRA	Two separate models
Extrapolation	12x beyond training horizon	Indirect	Not targeted

These are complementary: Free()LM prunes noise first, then RC or CoMem summarizes the cleaned result. CoMem provides the latency reduction that RC and Free()LM lack.

Context Management and Test-Time Scaling

Without explicit optimization for long-horizon reasoning, models exhibit an inverse-U scaling curve: accuracy improves with more thinking tokens up to a peak, then degrades as the model falls into repetitive, unproductive reasoning. This is the fundamental ceiling on test-time scaling for standard autoregressive decoding.

Training for context management — whether RC-style [7] compaction adaptation, MEM1-style [5] internal state, or ContextFold-style [6] branching — should push this ceiling higher or eliminate it entirely. The bottleneck is not compute itself but the model’s inability to maintain productive reasoning as context grows. RC provides direct evidence: the inverse-U curve becomes monotonically increasing under RC decoding, with no sign of degradation out to 192k tokens.

This effect also operates across sessions. Databricks’ MemAlign demonstrates that agent performance scales monotonically with accumulated memory — accuracy rises from near-zero to ~70% as episodic and semantic memories grow, and average reasoning steps drop from ~20 to ~5. Memory converts past test-time compute into stored knowledge, making future sessions more efficient.

2. Memory as a Tool

The model gains autonomy by having explicit read/write tools for storing and retrieving information. Instead of the harness deciding when to compress, the model decides what to externalize and when to retrieve it — analogous to a person taking notes.

Memory storage can take several forms: in-context state (structured state in the model’s own output), persistent files (surviving across sessions, as in Claude Code’s memory system), and databases/vector stores (enabling semantic retrieval over large corpora).

Two production systems illustrate the design space:

Mem0 [9] implements a two-phase pipeline for fact-oriented persistent memory. An extraction phase processes message pairs using conversation summary + recent messages. An update phase evaluates extracted facts against existing memories using semantic similarity search, determining operations: ADD, UPDATE, DELETE, or NOOP. The graph-enhanced variant (Mem0^g) adds directed labeled graphs for temporal reasoning. Results on LOCOMO: 26% improvement over OpenAI’s memory system, 91% lower p95 latency.

Agent Workflow Memory (AWM) [10] stores reusable task workflows rather than raw facts — abstracted action sequences with variables. After each successful task, it induces a workflow and adds it to memory, creating a snowball effect where simple workflows become building blocks for complex ones. Results on WebArena/Mind2Web: 51.1% relative improvement over state-of-the-art, 40% fewer steps per task.

	Mem0	AWM
What’s stored	Facts about the world/user	Abstracted action sequences
When to store	Every conversation turn (pipeline)	After successful task completion
Retrieval	Semantic similarity search	Goal-matching against descriptions
Best for	Personalization, factual recall	Procedural tasks, skill transfer

Training the Model to Use Memory

Mem0 and AWM are inference-only pipelines — memory operations happen around the model, not as model-initiated actions. The harder problem: training the model to proactively call memory read/write tools during its own reasoning.

The most effective lever is selecting tasks where memory use is required by construction: multi-session tasks where information from session 1 is needed in session 3, post-compaction retrieval where facts have been evicted from context, or cross-task transfer where stored workflows accelerate new tasks. On these tasks, outcome reward alone naturally reinforces memory use because the model can’t succeed without it.

The reward design challenge is that memory actions have delayed and indirect payoff — writing costs tokens now but only pays off later. Possible approaches include counterfactual rewards for reads (did retrieval actually help?), future-use rewards for writes (was this memory ever retrieved?), and format/process rewards to bootstrap usage (must be annealed to avoid reward hacking).

No published work has systematically studied reward design for training memory tool use. The combination of memory-dependent data with outcome-only reward is the most practical starting point — it sidesteps the credit assignment problem by making the task structure do the work.

3. Sub-Agents

The most autonomous strategy: the model delegates entire subtasks to sub-agents that run in their own fresh context windows. Each sub-agent explores extensively but returns only a condensed summary. The main agent’s context stays clean and focused on coordination.

ContextFold [6] demonstrates that sub-agent delegation can be trained end-to-end with RL. It introduces two special actions:

branch(description, prompt): Creates a sub-agent with a scoped task and its own context
return(message): Summarizes outcomes and folds intermediate steps out of the main context

ContextFold trains with FoldGRPO — GRPO adapted to train on folded (compressed) contexts rather than full histories. This train-test consistency proved critical: FoldGRPO outperformed standard GRPO by +7.7% on BrowseComp-Plus. Training on unfolded contexts while deploying with folded ones is itself a form of distribution mismatch.

Dense process rewards shape the branching behavior: an Unfolded Token Penalty discourages bloating the main context, an Out-of-Scope Penalty (judged by an LLM) keeps branches focused (improving scope adherence from 47.3% to 75.4%), and a Failure Penalty punishes broken tool calls.

Results: 62% pass@1 on BrowseComp-Plus (+14.2% over baseline) and 58% on SWE-Bench using only 32K active context with max 10 branches — matching models using 10x larger contexts.

Recursive Language Models (RLM) [11]: The most expressive form of sub-agent context management. Instead of fixed branch/return operations, the model writes arbitrary code in a Python REPL to decompose, examine, and recursively process its input — including calling itself on sub-prompts.

The key design: the input prompt is stored as a variable in the REPL, never loaded into the model’s context window. The model sees only constant-size metadata and writes code to manipulate the prompt programmatically — print(prompt[:100]), prompt.split("Chapter"), or sub_RLM("In Chapter 1, find all items..."). Each recursive sub-call gets its own fresh context.

state <- InitREPL(prompt=P)              # P stored as variable, not in context
state <- AddFunction(state, sub_RLM)     # model can call itself
hist  <- [Metadata(state)]               # only metadata, not P itself
while True:
    code <- LLM(hist)                    # model writes code
    state, stdout <- REPL(state, code)   # execute in REPL
    hist <- hist || code || Metadata(stdout)
    if state[Final] is set: return state[Final]

Three design choices distinguish RLMs:

Symbolic handle: The prompt is manipulated by reference (code), not by value (reading it into context). Context stays constant-size regardless of input length.
Code output: The model generates programs that produce answers, not the answers themselves.
Symbolic recursion: Code can invoke sub_RLM() inside loops — enabling $\Omega(|P|)$ or $\Omega(|P|^2)$ processing over the input.

Results: RLMs scale to 10M+ tokens and outperform base models by up to 2x. On OOLONG-Pairs (quadratic-complexity), GPT-5 and Qwen3-Coder both score ~0% while RLM achieves 58% — the task is structurally impossible without recursive decomposition.

In our memory hierarchy, RLM operates at the file level: the model reads from and writes to REPL variables (analogous to disk), never loading the full data into context (RAM). It effectively writes its own page fault handler.

	ContextFold	RLM
Operations	Fixed: branch/return	Unbounded: arbitrary code
Scaling	Tree-structured (max 10 branches)	Arbitrary recursion depth
Training	RL with process rewards (FoldGRPO)	SFT on 1K trajectories
Best for	Multi-part tasks within context window	Inputs far exceeding context window

These are complementary: ContextFold manages the model’s own reasoning context during multi-part tasks. RLM manages external input that doesn’t fit in context at all. A system could use RLM to process a 10M-token corpus and ContextFold to manage the reasoning context while doing so.

Comparing Across the Spectrum

Aspect	Harness Compaction	Memory as Tool	Sub-Agents
Model autonomy	None	Decides what/when to store	Delegates entire subtasks
Context growth	Bounded by compression ratio	Bounded by retrieval budget	Bounded by summary size (ContextFold) or constant (RLM)
Information loss risk	Harness may discard critical info	Model may forget to save	Summary may omit key details
Training approach	SFT (AgentFold), RL (MEM1), or compaction-adapted RL (RC)	Mostly inference-only pipelines	RL with process rewards (ContextFold) or SFT (RLM)
Best for	Deep reasoning (RC, Free()LM), simple multi-turn	Personalization, skill accumulation	Complex multi-part tasks (ContextFold), inputs exceeding context (RLM)

References

[1] Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., & Gonzalez, J. E. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560.

[2] Mei, K., Li, Z., Xu, S., Ye, R., Ge, Y., & Zhang, Y. (2024). AIOS: LLM Agent Operating System. arXiv:2403.16971.

[3] Zheng, Y., Ma, D., Liang, T., Xu, J., Huang, X., Chen, L., Mi, H., & Wang, Y. (2026). Free(): Learning to Forget in Malloc-Only Reasoning Models. arXiv:2602.08030.

[4] Ye, R., et al. (2025). AgentFold: Long-Horizon Web Agents with Proactive Context Management. arXiv:2510.24699.

[5] Zhou, Z., et al. (2025). MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents. arXiv:2506.15841.

[6] Sun, W., Lu, M., Ling, Z., Liu, K., Yao, X., Yang, Y., & Chen, J. (2025). Scaling Long-Horizon LLM Agent via Context-Folding. arXiv:2510.11967.

[7] Wu, I., Qu, Y., Setlur, A., & Kumar, A. (2026). Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL. arXiv:2602.03773.

[8] Zhang, Y., et al. (2026). CoMem: Context Management with A Decoupled Long-Context Model. ICLR 2026 Workshop. OpenReview.

[9] Chhikara, P., et al. (2025). Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413.

[10] Wang, Z. Z., Mao, J., Fried, D., & Neubig, G. (2025). Agent Workflow Memory. ICML 2025. GitHub.

[11] Zhang, A. L., Kraska, T., & Khattab, O. (2026). Recursive Language Models. arXiv:2512.24601.

Off-Policy Corrections in LLM RL Training

Sun, 01 Mar 2026 00:00:00 -0600

A unified view of the five sources of distribution mismatch in LLM reinforcement learning and their corrections.

The Unified Problem

Every off-policy issue in LLM RL reduces to the same fundamental problem:

$\text{We sample from } \pi_{\text{actual}} \text{ but optimize as if samples came from } \pi_{\text{assumed}}$

When $\pi_{\text{actual}} \neq \pi_{\text{assumed}}$, the policy gradient becomes biased. The general correction is importance sampling:

\[\nabla_\theta J(\theta) = \mathbb{E}_{y \sim \pi_{\text{actual}}} \left[ \frac{\pi_{\text{assumed}}(y)}{\pi_{\text{actual}}(y)} \cdot f(y) \cdot \nabla_\theta \log \pi_\theta(y) \right]\]

Theoretical grounding [1]: The token-level optimization objective used by REINFORCE, GRPO, and related algorithms is a first-order approximation to the true sequence-level reward objective. This approximation is valid only when each token’s IS ratio $\delta_t = \frac{\pi_\theta(y_t)}{\mu_{\theta_{old}}(y_t)} - 1$ is small, so that second-order terms ($\delta_i \delta_j$) can be neglected. The token-level IS weight is therefore inherent to the approximation — not an optional correction bolted on. Removing it invalidates the surrogate objective entirely.

In LLM RL, five distinct sources create this mismatch, each with different causes, magnitudes, and corrections:

#	Source	$\pi_{\text{actual}}$	$\pi_{\text{assumed}}$	Magnitude	When it arises
1	Multi-epoch policy drift	$\pi_{\theta_{old}}$ (start of epoch)	$\pi_\theta$ (current params)	Small per epoch	PPO/GRPO multi-epoch training
2	Backend mismatch	$\pi_{\text{sampler}}(\theta)$ (vLLM/SGLang)	$\pi_{\text{learner}}(\theta)$ (FSDP/Megatron)	Small but systematic under fp16	Different engines for rollout vs training
3	Async staleness	$\pi_{\theta_k}$ ($k$ steps old)	$\pi_\theta$ (current)	Can be large	Async RL
4	MoE routing	$\pi_\theta$ with route_old	$\pi_\theta$ with route_new	Variable, depending on model routing stability	MoE architectures after gradient updates
5	Tool-call trajectories	$\pi_\theta \times P_{\text{env}}$ (joint)	$\pi_\theta$ (LM only)	Variable, can be large	Agentic RL with tool use

Sources 1–3 are independent and their corrections compose cleanly. Source 4 (MoE routing) is qualitatively different — it’s a discrete structural change that amplifies Sources 2 and 3 rather than being fully independent. Source 5 (tool-call trajectories) is also qualitatively different — the mismatch comes from non-policy tokens in the conditioning context rather than from policy weight differences.

The IS Aggregation Problem: Why No Practical Method Is Exact

Before diving into individual sources, it’s worth understanding a fundamental limitation that underlies all of them.

For a sequence-level reward $R(x,y)$, the exact IS gradient requires the full product of per-token ratios:

\[\nabla_\theta J(\theta) = \mathbb{E}_{y \sim \mu_{\theta_{old}}}\left[\prod_{t=1}^{\lvert y \rvert} \frac{\pi_\theta(y_t \mid y_{<t})}{\mu_{\theta_{old}}(y_t \mid y_{<t})} \cdot R(x,y) \cdot \sum_{t} \nabla_\theta \log \pi_\theta(y_t \mid y_{<t})\right]\]

The per-token ratio $\frac{\pi_\theta(y_t \mid y_{<t})}{\mu_{\theta_{old}}(y_t \mid y_{<t})}$ is the correct building block. But the product of these ratios over a full sequence is intractable — for $\lvert y \rvert = 1000$ tokens, even with each ratio in $[0.99, 1.01]$, the product ranges from $\approx 0.00004$ to $\approx 22026$. Variance grows exponentially with sequence length, making gradient estimates useless.

Every practical algorithm approximates this product differently:

Approach	IS weight used	Relation to exact $\prod_t r_t$	Tradeoff
Exact	$\prod_t r_t$ (full product)	Correct	Intractable variance
GRPO (token-level)	Per-token $r_t$ in PPO surrogate (gradients flow through)	Same first-order token-level approx: $\prod(1+\delta_t) \approx 1 + \sum \delta_t$	Low bias when ratios $\approx$ 1; breaks down far from on-policy
CISPO (token-level, detached)	Per-token $\text{clip}(r_t)$ as detached weight on log-prob	Same first-order token-level approx, but ratio is clipped and stop-gradient	Preserves all tokens; slight bias from weight clipping
GSPO (sequence geometric mean)	$(\prod_t r_t)^{1/\vert y\vert}$ applied uniformly	$\vert y\vert$-th root of exact product — a heuristic, not a principled approximation	Low variance; unclear what bias it introduces

The fundamental tension: sequence-level rewards demand sequence-level IS, but sequence-level IS has exponential variance in autoregressive models. There is no free lunch — every method trades bias for variance differently. The practical question is which approximation degrades most gracefully as the five mismatch sources push ratios away from 1.

Source 1: Multi-Epoch Policy Drift

What happens: PPO and GRPO reuse the same batch of rollouts for multiple gradient steps (epochs). After the first update, the policy $\pi_\theta$ has drifted from the sampling policy $\pi_{\theta_{old}}$.

Correction: The standard IS ratio with clipping, built into the algorithm:

\[r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{old}}(a_t \mid s_t)}, \quad L = \min\left(r_t A_t, \text{clip}(r_t, 1-\epsilon, 1+\epsilon) A_t\right)\]

This is the most well-studied mismatch and is already handled by the algorithm’s own clipping mechanism.

Interaction with KL: When using “KL in reward” (combined form), PPO’s clipping automatically provides IS correction for the KL term. When using “KL as loss” (decoupled form), the KL term needs its own explicit IS correction across epochs — applying the same importance ratio $\rho_k(\theta) = \pi_\theta / \pi_{\theta_k}$ to the KL loss, with clipping: $\min(\rho_k \cdot k_n’,\; \text{clip}(\rho_k, 1-\epsilon, 1+\epsilon) \cdot k_n’)$. See [2] for the full analysis.

Source 2: Backend Mismatch (TIS)

What happens: In disaggregated RL systems, the rollout engine (vLLM/SGLang) and training engine (FSDP/Megatron) use different backends. Even with identical weights, these backends produce different log-probabilities due to:

Different attention kernels (e.g., FlashAttention variants)
Different operator fusion patterns
Quantization differences (FP8/INT8 rollout vs BF16 training)
Numerical precision handling (accumulated floating-point differences)

MoE amplifies these differences. In dense models, kernel-level numerical differences accumulate approximately linearly across layers. In MoE models, they cascade through routing flips: a tiny hidden-state difference from a fused kernel can change which experts are selected at the next MoE layer, producing a discretely different computation that further diverges in subsequent layers. A detailed analysis by Fireworks AI measured this effect on Kimi K2.5 (58 MoE layers): disabling all kernel fusions reduced their divergence metric to the noise floor (k3 = 0.00007), while enabling MoE-specific fusions (expert finalize + all-reduce + RMSNorm) increased it by 2.7×. Even the casting order in the expert combination step matters — using bf16 for both scores and accumulation (as in DeepEP) vs float32 scores with a single final cast produces measurably different log-probabilities. These are not bugs; they are inherent consequences of non-associative floating-point arithmetic under different kernel implementations, made qualitatively worse by MoE’s discrete routing decisions acting as amplifiers.

This creates an unintentional off-policy gap: $\pi_{\text{sampler}}(\theta) \neq \pi_{\text{learner}}(\theta)$.

When it arises: Whenever rollout and training use different software backends — which is the common case in both colocated and disaggregated setups. Note that “colocated” means the same GPUs are time-shared between rollout and training phases, but most colocated frameworks still use different engines (e.g., vLLM for inference, FSDP for training) on those GPUs. Backend mismatch is eliminated only when both phases use the same engine (e.g., the training framework’s own generate()), which sacrifices inference throughput significantly and is rare in practice.

Three Log-Probabilities

When backend mismatch exists, we need three distinct log-probabilities:

flowchart LR
    subgraph rollout["Rollout Phase"]
        direction TB
        P["Prompt"] --> SAM["Sampler (vLLM)"]
        SAM --> LSAM["① log π_sam(θ_old)"]
    end

    subgraph training["Training Phase — Learner (FSDP)"]
        direction TB
        FP1["③ FP#1 (θ, grads) → log π_learn(θ)"]
        FP2["② FP#2 (θ_old, no grad) → log π_learn(θ_old)"]
    end

    LSAM -- "cached" --> training

$\log \pi_{\text{sampler}}(a, \theta_{\mathrm{old}})$ — from rollout backend at time of sampling (cached)
$\log \pi_{\text{learner}}(a, \theta_{\mathrm{old}})$ — from training backend with $\theta_{\mathrm{old}}$ weights (FP#2, detached)
$\log \pi_{\text{learner}}(a, \theta)$ — from training backend with current weights (FP#1, with gradients)

Correction Approach 1: Resampling

Recompute log-probs using the training backend, discarding the sampler’s log-probs entirely:

\[\frac{\pi_{\text{learner}}(a, \theta)}{\pi_{\text{learner}}(a, \theta_{\mathrm{old}})} \quad \text{(both from same backend)}\]

Pro: Same-backend ratio; no mismatch in the PPO ratio itself
Con: Requires extra forward pass (FP#2); the expectation is still taken over $\pi_{\text{sampler}}$, so a distributional mismatch remains in the sampling distribution
Used by: VeRL (built-in implementation)

Correction Approach 2: Truncated Importance Sampling (TIS)

Explicitly correct for the backend gap with a truncated importance ratio:

\[r_{\text{TIS}} = \min\left(\frac{\pi_{\text{learner}}(a, \theta_{\mathrm{old}})}{\pi_{\text{sampler}}(a, \theta_{\mathrm{old}})}, C\right)\]

where $C$ is a loose cap (typically 10–100, much larger than PPO’s $1+\epsilon$). The full loss becomes:

\[\mathcal{L} = -\frac{1}{B \times G} \sum_{i} r_{\text{TIS}} \cdot \min\left(r_{\text{PPO}} \cdot A_i, \text{clip}(r_{\text{PPO}}, 1-\epsilon, 1+\epsilon) \cdot A_i\right)\]

with two separate ratios serving distinct roles:

$r_{\text{TIS}}$ corrects for backend mismatch (should be $\approx 1.0$ if backends are close)
$r_{\text{PPO}} = \frac{\pi_{\text{learner}}(a, \theta)}{\pi_{\text{learner}}(a, \theta_{\mathrm{old}})}$ handles policy drift (same backend, no mismatch)

The gradient flows through $r_{\text{PPO}}$ but $r_{\text{TIS}}$ is detached (no gradient through old policy).

Why Not Combine Into a Single Ratio?

A natural idea: skip the separation and use $\frac{\pi_{\text{learner}}(a, \theta)}{\pi_{\text{sampler}}(a, \theta_{\mathrm{old}})}$ directly as the PPO ratio. This doesn’t work well in practice.

From the Flash-RL team [3]:

Even when $\theta = \theta_{\mathrm{old}}$, the probability ratio $\frac{\pi_{\mathrm{learner}}(a, \theta)}{\pi_{\mathrm{sampler}}(a, \theta_{\mathrm{old}})}$ is already not equal to 1 due to the mismatch — this makes the clipping happen with high possibility and the training much less informative. Furthermore, in our TIS method, we separately clip $\frac{\pi_{\mathrm{learner}}(a, \theta_{\mathrm{old}})}{\pi_{\mathrm{sampler}}(a, \theta_{\mathrm{old}})}$ and $\frac{\pi_{\mathrm{learner}}(a, \theta)}{\pi_{\mathrm{learner}}(a, \theta_{\mathrm{old}})}$ and thus much more mild; notice $\frac{\pi_{\mathrm{learner}}(a, \theta)}{\pi_{\mathrm{learner}}(a, \theta_{\mathrm{old}})}$ equals to 1 when $\theta = \theta_{\mathrm{old}}$ which is suitable for the trust region constraint.

The key insight: PPO’s trust region is designed around the assumption that the ratio starts at 1.0. Backend mismatch violates this assumption, causing excessive clipping and uninformative gradients. Separating the two ratios preserves the trust region semantics.

Aside: Sequence-Level Tolerance

The GSPO paper [4] claims that sequence-level likelihoods are more tolerant of backend precision differences, since small per-token numerical errors may cancel out when aggregated via the geometric mean. If true, this would let GSPO skip TIS entirely — reducing the training node to a single forward pass. However, the geometric mean is not a principled approximation to the exact sequence-level IS product (it’s the $\vert y\vert $-th root, a fundamentally different quantity). We mention it for completeness but would not rely on it as a general-purpose correction strategy.

Source 3: Async Staleness

What happens: In asynchronous RL systems, the rollout engine generates data using policy weights from $k$ training steps ago ($\pi_{\theta_k}$), while the training engine optimizes the current policy ($\pi_\theta$). The staleness $k$ depends on system throughput and scheduling.

When it arises: Any system where rollout and training overlap in time:

Fully async systems (e.g., AReaL [5], Slime async mode)
Double-buffered / pipelined systems (generating batch N+1 while training on batch N)
Systems with slow weight synchronization

Why it’s worse than multi-epoch drift: Multi-epoch drift is bounded (typically 1–4 epochs, each a small gradient step). Async staleness is unbounded without explicit control — the policy can drift arbitrarily far between when data was generated and when it’s consumed for training.

Prevention: Staleness Rate-Limiting

Rather than correcting stale data after the fact, bound how stale data can get:

\[\lfloor (N_r - 1) / B \rfloor \leq i + \eta\]

Where:

$N_r$ = total generated trajectories so far
$B$ = training batch size
$i$ = current policy version (increments each training step)
$\eta$ = maximum staleness budget

The rollout controller blocks new generation requests when this bound is violated, unblocking when training completes a step.

Recommended values (from AReaL): $\eta = 4$ for coding tasks, $\eta = 8$ for math tasks. Tighter bounds mean less staleness but more pipeline bubbles; looser bounds improve throughput but degrade convergence.

Correction: Decoupled PPO

When data is stale, standard PPO’s trust region is misplaced — it clips around $\pi_{\theta_k}$ (the stale behavior policy) rather than the current policy. AReaL’s decoupled PPO separates two roles:

Behavior policy $\pi_{\text{behav}}$: the (possibly stale) policy that generated the data
Proximal policy $\pi_{\text{prox}}$: current policy snapshot, used as the trust region center

\[J(\theta) = \mathbb{E}\left[ \frac{\pi_{\text{prox}}}{\pi_{\text{behav}}} \cdot \min\left(u_t^{\text{prox}}(\theta) \cdot \hat{A}_t, \text{clip}(u_t^{\text{prox}}(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_t\right)\right]\]

where $u_t^{\text{prox}}(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\text{prox}}(a_t \mid s_t)}$.

The importance weight $\pi_{\text{prox}} / \pi_{\text{behav}}$ corrects for distribution shift. The clip operates around $\pi_{\text{prox}}$ (a recent, high-quality policy) rather than $\pi_{\text{behav}}$ (potentially stale).

Implementation: Store log_prob_behav during generation, tagged with the policy weight_version. Before training, compute log_prob_prox via a forward pass with the current policy snapshot.

Heuristic Corrections

Two simpler (but less principled) alternatives:

Off-Policy Sequence Masking (OPSM): Discard entire sequences whose importance ratio $\pi_\theta / \pi_{\theta_k}$ exceeds a threshold. Simple but wastes gradient signal from masked sequences.

TIS for staleness: Apply truncated importance sampling (same mechanism as Source 2) to the staleness ratio. Clips extreme ratios but doesn’t relocate the trust region.

Why Both Prevention and Correction Are Needed

AReaL’s ablations demonstrate that neither rate-limiting nor decoupled PPO alone is sufficient:

Setup	Result
Naive PPO, $\eta$=1	Degraded vs synchronous
Naive PPO, $\eta$=4	Collapsed
Decoupled PPO, $\eta$=$\infty$ (unbounded)	Degraded
Decoupled PPO, $\eta \leq 8$	Matches synchronous oracle

Prevention ($\eta$) bounds worst-case staleness; correction (decoupled PPO) handles the residual drift within that bound. The combination is what makes fully async training viable.

Source 4: MoE Routing Mismatch

What happens: In Mixture-of-Experts models, the router selects which experts process each token. After a gradient update, the router’s decisions might change. This is more prominent in deeper MoE architectures.

Why it’s listed separately: Unlike Sources 1–3, this mismatch involves a discrete structural change (different experts activated) rather than continuous numerical drift. However, mechanistically, MoE routing is entangled with Sources 2 and 3 — it amplifies both backend mismatch and policy staleness.

How Routing Amplifies Sources 2 and 3

The Qwen team’s analysis [1] shows that for MoE models, the token-level IS weight decomposes as:

\[\frac{\pi_\theta(y_t \mid x, y_{<t})}{\mu_{\theta_{old}}(y_t \mid x, y_{<t})} = \underbrace{\frac{\pi_{\theta_{old}}(y_t \mid x, y_{<t}, e^{\pi}_{old,t})}{\mu_{\theta_{old}}(y_t \mid x, y_{<t}, e^{\mu}_{old,t})}}_{\text{training-inference discrepancy}} \times \underbrace{\frac{\pi_\theta(y_t \mid x, y_{<t}, e^{\pi}_t)}{\pi_{\theta_{old}}(y_t \mid x, y_{<t}, e^{\pi}_{old,t})}}_{\text{policy staleness}}\]

where $e^{\pi}$ and $e^{\mu}$ denote the routed experts in the training and inference engines respectively. Expert routing now appears inside both factors:

Training-inference discrepancy: Even with identical weights, the training engine and inference engine may route to different experts ($e^{\pi}_{old,t} \neq e^{\mu}_{old,t}$), amplifying the numerical differences that already exist from different kernels/precision.
Policy staleness: After gradient updates, not only do the model parameters change, but the routed experts also shift $e^{\pi}_t \neq e^{\pi}_{old,t}$, compounding the distribution shift.

This entanglement is why MoE RL training is fundamentally harder to stabilize than dense model training — routing noise makes the first-order approximation break down faster.

Correction: Routing Replay

The core idea: fix the routed experts during policy optimization so the model behaves like a dense one for IS computation purposes. Two variants exist [1]:

R2 (Vanilla Routing Replay): Replay the training engine’s routing ($e^{\pi}_{old,t}$). Reduces the policy staleness component. For the first mini-batch of a global step, the target policy is unaltered; for subsequent mini-batches, the forced routing deviates from the model’s natural routing, biasing the optimization target.

R3 (Rollout Routing Replay): Replay the inference engine’s routing ($e^{\mu}_{old,t}$). Reduces the training-inference discrepancy by forcing the training engine to use the same experts the inference engine chose. Always alters the target policy (even in the first mini-batch), since the training engine is forced to use the inference engine’s routing decisions.

Critical finding — R2 vs R3 depends on off-policiness:

Off-policiness $N$ (global batch / micro-batch)	Better variant	Why
Small ($N$=2)	R2	R2 preserves target policy in first mini-batch; R3’s bias outweighs its discrepancy reduction
Large ($N \geq 4$)	R3	Training-inference discrepancy dominates; R3’s reduction of this factor outweighs its target-policy bias
Very large ($N$=8)	R3 essential	R2 fails to sustain stable training; only R3 + clipping remains viable

On-policy training ($N$=1): Neither R2 nor R3 provides benefit — Routing Replay introduces bias without compensating gain. The basic algorithm with IS correction (no Routing Replay) achieves the best performance.

Shared drawbacks of both variants:

Additional memory overhead (must store routing decisions per token)
Communication overhead in distributed settings
Introduces optimization bias by forcing non-natural expert assignments

Alternative: Sequence-Level IS (GSPO)

GSPO sidesteps routing replay entirely by operating at the sequence level using a geometric mean of per-token ratios: $s_i(\theta) = (\pi_\theta(y_i \mid x) / \pi_{\theta_{old}}(y_i \mid x))^{1/\lvert y_i \rvert}$. The intuition is that aggregating across the full sequence dilutes individual routing fluctuations, avoiding the bias that Routing Replay introduces by forcing expert assignments.

Caveats: The geometric mean is not a principled approximation to the exact sequence-level IS product — it’s a heuristic that happens to have low variance. The GSPO paper [4] demonstrates stability on Qwen3’s MoE architecture, but Qwen3 appears to have relatively stable routing behavior. Whether this generalizes to architectures with more volatile routing (deeper models, different load-balancing) is unclear. The practical benefits (no Routing Replay memory/communication overhead, no target-policy bias) are real, but they come from a theoretically unmotivated transformation of the IS ratio.

Source 5: Tool-Call Trajectory Mismatch (Agentic RL) — Emerging

Note: This source is relatively under-studied compared to Sources 1–4. Most of the analysis below is based on theoretical reasoning and early observations rather than large-scale empirical validation. We include it because it will become increasingly important as agentic training scales, but readers should calibrate their confidence accordingly.

What happens: In agentic RL training, trajectories are multi-turn with interleaved LM output and tool/environment output:

Turn 1:  LM generates action a₁ ~ π_θ(·|x)
         Tool returns observation o₁  ← NOT from π_θ
Turn 2:  LM generates action a₂ ~ π_θ(·|x, a₁, o₁)
         Tool returns observation o₂  ← NOT from π_θ
...

Tool output tokens (code execution results, search snippets, API responses) are not drawn from the LM policy — they come from the environment. While loss is not computed over tool output tokens (they are masked), they appear in the conditioning context for subsequent LM generations.

The off-policy effect: Unlike Sources 1–4, where the mismatch is between different versions or implementations of the same policy, Source 5 involves tokens from a fundamentally different generative process (the environment) appearing in the LM’s context. This has three consequences:

Distribution shift in conditioning: Tool outputs push the LM into distribution regions it would rarely visit through autoregressive generation alone. This is by design — tool use is valuable precisely because it gives the LM access to information it couldn’t generate — but it creates a challenging optimization landscape.
IS ratio instability: When the policy updates (Sources 1/3), the LM’s response to tool outputs can shift disproportionately. Tool-conditioned continuations sit in a high-sensitivity region of the distribution: small weight changes $\theta \to \theta’$ can cause large changes in $\pi_{\theta’}(a_t \mid \text{context with tool output})$. This amplifies IS ratios for post-tool tokens, leading to more clipping and less gradient signal from the most informationally rich parts of the trajectory.
Aggravated backend mismatch (Source 2): Low-probability tokens are where floating-point precision differences between backends matter most — relative numerical error is proportionally larger in the tail of the distribution. Since tool-conditioned context pushes the LM to generate tokens it otherwise wouldn’t, the log-prob discrepancy between sampler and learner backends is systematically worse on post-tool tokens.

Why standard IS doesn’t help: For Sources 1–4, the correction is conceptually clear — ratio the generating policy against the assumed policy. For Source 5, there is no “tool output policy” to ratio against. The tool outputs are fixed observations from the environment; the issue is that conditioning on them creates a more volatile optimization target, not that they were sampled from the wrong distribution.

What practitioners currently do: Masking tool output tokens from loss is standard practice. Beyond that, some apply TIS-style ratio clipping to post-tool tokens, and some frameworks compute advantages at the turn level rather than the full trajectory level.

How Corrections Compose

In production systems, multiple sources of mismatch coexist. Understanding how their corrections interact is critical.

Ratio Decomposition

For a fully async, disaggregated MoE system, the complete correction at the token level decomposes into four independent factors:

\[\underbrace{\frac{\pi_{\text{learner}}(\theta_k)}{\pi_{\text{sampler}}(\theta_k)}}_{\substack{\textbf{backend (Source 2)} \\ \text{Detached, cap } C \in [2,10] \\ \text{Eliminated if same engine}}} \;\times\; \underbrace{\frac{\pi_{\text{prox}}}{\pi_{\theta_k}}}_{\substack{\textbf{staleness (Source 3)} \\ \text{Detached (weight)} \\ \text{Eliminated if sync}}} \;\times\; \underbrace{\frac{\pi_\theta}{\pi_{\text{prox}}}}_{\substack{\textbf{trust region (Source 1)} \\ \text{Clipped } [1\!-\!\epsilon,\; 1\!+\!\epsilon] \\ \text{Always present}}} \;\times\; \underbrace{[\text{routing fix}]}_{\substack{\textbf{MoE (Source 4)} \\ \text{R2/R3 replay} \\ \text{Eliminated if dense}}}\]

This is why simplifying assumptions matter: using the same engine for rollout and training eliminates the backend factor (rare in practice), synchronous training eliminates the staleness factor, and tight staleness bounds ($\eta$) keep it small.

GSPO claims to dilute factor 4 via sequence-level aggregation, though the theoretical basis for this is weak (see Source 4 caveats).

TIS × Multi-Epoch (Disaggregated PPO/GRPO)

The most common combination. The full loss:

\[\mathcal{L} = r_{\text{TIS}} \cdot \min\left(r_{\text{PPO}} \cdot A, \text{clip}(r_{\text{PPO}}, 1-\epsilon, 1+\epsilon) \cdot A\right)\]

These are orthogonal: TIS corrects for numerical differences at fixed weights; the PPO ratio corrects for weight changes in the same backend. They can be applied independently.

Staleness × TIS (Fully Async Disaggregated)

When both async staleness and backend mismatch exist, three ratios are in play:

Backend correction: $\frac{\pi_{\text{learner}}(\theta_k)}{\pi_{\text{sampler}}(\theta_k)}$ — same stale weights, different backends
Staleness correction: $\frac{\pi_{\text{prox}}}{\pi_{\text{behav}}}$ — different policy versions (decoupled PPO)
Trust region ratio: $\frac{\pi_\theta}{\pi_{\text{prox}}}$ — current vs proximal policy (clipped)

In practice, corrections #1 and #2 can be folded together. The key principle: apply TIS ratio first (to bring sampler log-probs to learner space), then apply staleness/PPO corrections on the learner-space log-probs.

MoE × Everything Else

MoE routing is entangled with Sources 2 and 3 — it amplifies both backend mismatch and policy staleness:

Higher off-policiness (more mini-batches per global step) makes routing instability worse, which is why R3 becomes necessary at $N \geq 4$
Backend mismatch can cause inconsistent routing decisions even before any policy drift occurs
Both Routing Replay and TIS/clipping are needed for stable off-policy MoE training. GSPO avoids Routing Replay but substitutes a theoretically unmotivated sequence-level aggregation (see caveats in Source 4)

Tool-Call Trajectories × Everything Else

Source 5 does not introduce its own IS ratio — there is no “tool output policy” to correct against. Instead, it amplifies the volatility of all other IS ratios. Post-tool tokens sit in high-sensitivity distribution regions, so Sources 1–4 all produce larger and more variable IS ratios on these tokens. The practical implication: systems with agentic trajectories should expect more aggressive clipping and may benefit from turn-level ratio isolation to prevent cross-turn IS instability.

Practical Decision Tree

flowchart TD
    START{"Which corrections do you
    need for your RL system?"}

    START -- "Same engine
    (rare in practice)" --> SAME["No TIS needed
    1 FP on train node"]
    START -- "Different engines
    (most setups, incl. colocated)" --> DIFF["Need TIS or resampling
    2 FPs on train node"]

    SAME --> Q2{"Training mode?"}
    DIFF --> Q2

    Q2 -- "Synchronous" --> SYNC["No staleness
    correction"]
    Q2 -- "Asynchronous
    (rollout overlaps train)" --> ASYNC["Need:
    • Rate-limiting (η)
    • Decoupled PPO"]

    SYNC --> Q3{"Model type?"}
    ASYNC --> Q3

    Q3 -- "Dense" --> DENSE["Standard
    token-level IS"]
    Q3 -- "MoE" --> MOE["Routing entangled w/ Sources 2 & 3:
    N=1: IS only, no replay
    N=2: R2 + clipping
    N≥4: R3 + clipping
    Alt: GSPO (heuristic)"]

    DENSE --> Q4{"Trajectory type?"}
    MOE --> Q4

    Q4 -- "Single-turn" --> SINGLE["Standard handling"]
    Q4 -- "Agentic / tool-use" --> AGENT["• Mask tool tokens
    • Consider turn-level credit assignment
    • Source 5 is emerging; expect more clipping"]

Common Configurations

Setup	Corrections needed
Same engine, sync, dense, 1 epoch	None (simplest possible, but rare — sacrifices inference throughput)
Colocated, sync, dense, multi-epoch	TIS + PPO clipping (colocated still typically uses different engines)
Disaggregated, sync, dense	TIS + PPO clipping
Disaggregated, async, dense	TIS + rate-limiting ($\eta$) + decoupled PPO
Disaggregated, sync, MoE	TIS + Routing Replay (R2 or R3 depending on $N$)
Disaggregated, async, MoE	TIS + rate-limiting + decoupled PPO + Routing Replay

References

[1] Zheng, C., Dang, K., Yu, B., et al. “Stabilizing Reinforcement Learning with LLMs: Formulation and Practices.” arXiv preprint arXiv:2512.01374 (2025).

[2] Liu, K., Liu, J. K., Chen, M., & Liu, Y. “Rethinking KL Regularization in RLHF.” arXiv preprint arXiv:2503.01491 (2025).

[3] Yao, F., Liu, L., Zhang, D., et al. “Your Efficient RL Framework Secretly Brings You Off-Policy RL Training.” Blog post. See also Flash-RL.

[4] Zheng, C., Liu, S., et al. “Group Sequence Policy Optimization.” arXiv preprint arXiv:2507.18071 (2025).

[5] Mei, J. et al. “AReaL: An End-to-End Reinforcement Learning Framework for LLM Reasoning.” arXiv preprint arXiv:2505.24298 (2025).

What's in Pass@K?

Fri, 30 Jan 2026 00:00:00 -0600

TL;DR

Computing pass@k requires sampling N >= k responses and applying an unbiased combinatorial estimator — not simply sampling k times and counting.
Pass@k vs. pass@1: pass@k is a nonlinear function of the pass rate p, which saturates on easy problems. Ranking models by pass@k effectively upweights hard problems.
Evaluation vs. checkpoint selection: For evaluation, use large N and moderate k for stability. For checkpoint selection, you only need rankings — estimate $\hat{p} = c/N$ from moderate N and extrapolate to large k via the Bernoulli formula.
Efficient estimation for large k: Use dynamic N (sample more on hard problems) and fit a Beta distribution to the difficulty distribution for stable extrapolation from limited samples.
Pass@k as RL reward: It works, but is compute-inefficient — you spend equal compute on all problems then downweight easy ones. Better to upsample hard questions or allocate larger group sizes to them directly.

Computing Pass@k

Pass@k measures the probability that at least one of k sampled responses is correct. The standard way to estimate it [1] is:

Sample N responses from the model (e.g. at temperature 0.6).
Count the number of correct responses c.
Compute pass@k using the unbiased estimator:

\[\text{pass@k} = 1 - \frac{\binom{N-c}{k}}{\binom{N}{k}}\]

This estimator works for any N >= k. It is equivalent to: out of all ways to choose k responses from N total, what fraction includes at least one correct response?

Common mistake: Sample only k responses and report c/k as pass@k. This gives a biased, high-variance estimate. The unbiased estimator requires N >= k, and using larger N reduces variance, giving a more stable estimate of pass@k.

Pass@1 vs. Pass@k

At first glance, the two metrics seem to carry the same information. For a single problem, both pass@1 and pass@k are monotonically increasing in c (the number of correct responses out of N). And for reasonably large N, both give stable estimates of the model’s performance under random sampling.

The key difference is in how they aggregate across problems. Let p be the per-problem pass rate (i.e. the probability of a single sample being correct). Then:

Pass@1 is linear in p: $\text{pass@1} = p$
Pass@k is nonlinear in p: $\text{pass@k} = 1 - (1-p)^k$

As k grows, problems with high p see their pass@k saturate toward 1. A problem with p = 0.8 and a problem with p = 0.95 both have pass@k ≈ 1 for large k — the difference between them is effectively erased.

Meanwhile, problems with low p remain far from saturation. A problem with p = 0.05 has pass@10 ≈ 0.40, while p = 0.15 gives pass@10 ≈ 0.80. The gap is amplified.

Figure: Pass@1 (blue) and pass@10 (orange) for 10 problems sorted by difficulty. On easy problems (left), pass@10 saturates near 1 while pass@1 still varies. On hard problems (right), the gap between the two metrics widens. The dashed lines show the benchmark-level averages — pass@10 is dominated by hard problems where the model still has room to improve.

The consequence: ranking models by pass@k effectively upweights hard problems. Improvements on easy problems (where both models already have high p) barely move the pass@k number, while improvements on hard problems (where p is low) show up clearly. This makes pass@k a useful complement to pass@1 when you care about a model’s ability to solve difficult tasks given multiple attempts.

Pass@k for Model Evaluation vs. Checkpoint Selection

The choice of N and k depends on what you’re using pass@k for. Two common use cases have quite different requirements.

Model evaluation. The goal is to report a stable number that reflects how the model performs in practice. Users typically sample once (or a handful of times), so what matters is pass@1 or pass@k for small k. The main concern is stability: you want large N so that the estimate of the pass rate p is precise, but k itself can stay moderate. Asymptotic correctness — what the model could do given many attempts — is less important, because it doesn’t match how the model is actually used.

A common misconception: Some papers justify using pass@k (with large k) as an evaluation metric because it “reflects generation diversity.” This conflates two things. Generation diversity is a property of the model’s output distribution, which is captured by the pass rate p estimated from large N. Large k doesn’t help you measure diversity — it just applies a nonlinear transform that compresses differences at the top of the distribution. To evaluate distributional properties, you need large N, not large k.

Checkpoint selection. The goal is different: you want to pick the best pretrain checkpoint for SFT, or the best SFT checkpoint for RL. Here you care about the potential of a checkpoint — not its single-shot performance, but whether it can solve the problem at all. This calls for large k, because pass@k with large k measures “does the model have this capability somewhere in its distribution?”

The cost structure also differs. For checkpoint selection, you only need the ranking between checkpoints, not the absolute score. This means we can go further than just using moderate N — we can drop the unbiased combinatorial estimator entirely and use the Bernoulli formula directly:

\[\text{pass@k} = 1 - (1 - \hat{p})^k, \quad \hat{p} = c / N\]

This sidesteps the N >= k requirement altogether. We estimate the pass rate $\hat{p}$ from N samples, and then extrapolate to any k we want. The estimate of $\hat{p}$ from moderate N (e.g. N = 32) is noisy, so the absolute pass@k values won’t be precise — but that’s fine, because we only need the ranking between checkpoints.

	Model Evaluation	Checkpoint Selection
What matters	Stability of the score	Ranking between checkpoints
k	Small to moderate	Large
N	Large (for precise p)	Moderate (e.g. 32) is often sufficient
Why	Reflects real usage (few samples)	Measures potential / capability

Efficient estimation of pass@k for large k

When k is large, estimating pass@k per problem is wasteful if done uniformly. Two ideas can help.

Dynamic N per problem. Not all problems need the same sampling budget. For easy problems, a small N already gives a confident estimate of $\hat{p}$. For hard problems — where $\hat{p}$ is close to 0 — the estimate is dominated by whether you observe any correct response at all. A practical strategy is to keep expanding N for hard problems until you observe 1–2 correct generations, then stop [6]. This concentrates compute where it matters most: on the hard tail of the difficulty distribution, which is exactly the region that governs pass@k for large k.

Fitting a difficulty distribution. Rather than estimating pass@k per problem independently, we can model the distribution of pass rates across problems. Kazdan et al. [5] propose fitting a Beta distribution to the problem-level pass rates, then computing the expected pass@k under this distribution analytically. The Beta-Binomial model lets you estimate pass@k scaling from limited samples — you fit the Beta parameters $(\alpha, \beta)$ to the observed (successes, trials) counts across problems, and extrapolate to large k without ever sampling that many times. Combined with dynamic sampling (allocating more budget to hard problems), this gives reliable pass@k estimates at a fraction of the uniform-sampling cost.

Left: Beta distributions fitted to true pass rates (blue) and noisy N=16 estimates (orange dashed). Right: pass@k scaling estimated four ways — ground truth (black), Beta fit from true p (blue dashed), Beta fit from N=16 (orange dashed), and per-problem Bernoulli from N=16 (grey dotted).

Analysis of Using Pass@k as an RL Reward

Even if the end goal is pass@1 performance, there’s a reason to care about hard problems during RL training. Training on easy (nearly solved) problems sharpens an already confident distribution further, reducing the model’s entropy. This impairs the model’s ability to explore new solutions and hurts its performance on hard problems. Downweighting easy examples helps preserve exploration capacity, especially in the early stages of RL.

Given that pass@k upweights hard problems, a natural idea is to use pass@k as an RL reward. The mechanism [2] works as follows: divide the N rollouts for a problem into groups of k, and assign each group a reward equal to the maximum reward within that group (i.e. 1 if any response in the group is correct, 0 otherwise). All responses in the same group receive the same reward. This is effectively a Monte Carlo estimate of pass@k — and it inherits the same nonlinearity that upweights hard problems.

This can help in the early stages of training by encouraging exploration on hard problems. But there are two issues worth noting.

It’s an inefficient way to encourage exploration. You still spend roughly the same compute on every problem — generating the same number of rollouts regardless of difficulty. The pass@k reward then downweights easy problems after the fact: on easy problems, most groups contain at least one correct response, so the reward is 1 for nearly all groups and the advantages after baseline normalization are small. You’ve paid for those rollouts but get little learning signal from them. A more direct approach is to allocate resources differently upfront:

Upsample hard questions. If you know which problems are hard (from pass rate estimates), sample them more frequently in training batches.
Allocate more compute to hard problems. For example, use a larger group size for hard questions — generating more rollouts per problem gives the model more chances to find a correct solution and produces a richer advantage signal.

Both achieve the same goal of focusing learning on hard problems, but by directing compute where it matters rather than spending it uniformly and discounting it later. For recent work along these lines, see Knapsack RL [3] and AR3PO [4].

You eventually need to optimize pass@1. The end goal is single-shot performance — the model should reliably produce correct answers on the first try. Optimizing pass@k encourages the model to place some probability mass on correct solutions, but it doesn’t pressure the model to make the correct solution the most likely one. At some point, training must shift back to a pass@1-aligned objective, or the model may plateau with a spread-out distribution that solves problems occasionally but not consistently.

References

[1] Chen, Mark, et al. “Evaluating Large Language Models Trained on Code.” arXiv preprint arXiv:2107.03374 (2021).

[2] Chen, Zhipeng, et al. “Pass@k training for adaptively balancing exploration and exploitation of large reasoning models.” arXiv preprint arXiv:2508.10751 (2025).

[3] Li, Ziniu, et al. “Knapsack rl: Unlocking exploration of llms via optimizing budget allocation.” arXiv preprint arXiv:2509.25849 (2025).

[4] Zhang, Yuheng, et al. “Improving sampling efficiency in rlvr through adaptive rollout and response reuse.” arXiv preprint arXiv:2509.25808 (2025).

[5] Kazdan, Joshua, et al. “Efficient Prediction of Pass@k Scaling in Large Language Models.” arXiv preprint arXiv:2510.05197 (2025).

[6] Hu, Shengding, et al. “Predicting Emergent Abilities with Infinite Resolution Evaluation.” arXiv preprint arXiv:2310.03262 (2023).

Implementing Training-Free Process Rewards in VeRL

Sat, 10 Jan 2026 00:00:00 -0600

Motivation

Current RLVR (RL from Verifiable Rewards) frameworks typically assign a single reward at the end of a response—correct or incorrect. But not all reasoning steps contribute equally. Some are critical insights, others are routine algebra, and some may be wasteful exploration.

Process rewards assign credit to intermediate steps, enabling step-level credit assignment. This unlocks several applications:

Better training efficiency: Denser reward signal provides more gradient information per sample
Breaking out of zero pass rate: When a problem is too hard for any complete solution, partial progress can still be rewarded
Less verbosity: Penalize unproductive reasoning loops and overthinking

The Challenge

How do you obtain step-level reward signals? Two main approaches:

Train a Process Reward Model (PRM): Requires labeled data for intermediate steps [1] or a strong model judge
Training-free signals: Derive rewards from the policy itself

Note that if you have a well-trained value network, that naturally provides token-level process rewards via TD-error. Value network training is out of scope for this post—we focus on training-free approaches that work with critic-free algorithms like RLOO/GRPO.

Training-Free Process Rewards

Monte Carlo Estimation

A key insight from VinePPO [2]: language environments are naturally “resettable”—you can return to any intermediate state simply by prompting with that prefix. This enables estimating V(prefix) at any point in a reasoning trace.

VinePPO uses Monte Carlo estimation: sample K complete rollouts from each prefix and average their outcomes. This is expensive—for N steps and K samples, you need N×K rollouts per training example.

Log-Probability Approximation

We approximate V(prefix) more efficiently using a single forward pass:

\[V(\text{prefix}) \approx \frac{1}{n}\sum_{i=1}^{n} \log P(a_i \mid \text{prefix} + \text{force\_prompt} + a_{<i})\]

Instead of sampling K complete rollouts, we:

Truncate the response at an episode boundary
Append a “force answer” prompt (e.g., </think>\n\nThe answer is )
Compute the mean log-probability of the ground-truth answer tokens

We use log-probability directly (not converted to probability) for numerical stability. This estimates “if the model were forced to answer now, how likely would it produce the correct answer?”

Assumptions and limitations:

A small set of correct answers (e.g., math problems with a single numerical answer). If the space of correct answers is large (e.g., open-ended instruction following), this approach won’t work.
A compatible force-answer prompt. The prompt (e.g., </think>\n\nThe answer is ) must be consistent with the model’s chat template and training format.
Biased estimate. Unlike Monte Carlo, this does not provide an unbiased estimate of V(prefix). We are effectively using the base model with a force-answer prompt as a prover policy [4][5]—a separate policy that completes the solution from an intermediate state. The quality of V(prefix) depends on how well this prover correlates with the true probability of success.

Episode Segmentation

To identify intermediate states, we segment reasoning traces into “episodes” using discourse markers:

EPISODE_MARKERS = [
    "Wait,", "Alternatively,", "Actually,", "Hmm,",
    "Let me ", "I need to ", "So ", "But ",
    # ... more markers
]

Each marker indicates a potential state boundary where we can evaluate V(prefix).

We also use a token length fallback: if no markers are found within a maximum token limit (e.g., 256 tokens), we split at sentence boundaries. This prevents issues where the model produces no markers at all, which would result in a single giant episode.

flowchart LR
    A[Full Response] --> B[Segment by markers]
    B --> C[Episode 1: Problem setup]
    B --> D[Episode 2: Initial approach]
    B --> E[Episode 3: Wait, let me reconsider...]
    B --> F[Episode N: Final answer]

    C --> G["V(prefix₀)"]
    D --> H["V(prefix₁)"]
    E --> I["V(prefix₂)"]
    F --> J[Final reward]

Marginal Utility

With V(prefix) at each episode boundary, we compute marginal utility:

\[U_i = V(\text{prefix}_i) - V(\text{prefix}_{i-1})\]

Since V(prefix) is in log-probability space, the difference measures the log-odds improvement from each episode.

U_i > 0: Episode i made progress
U_i < 0: Episode i was counterproductive
U_i ≈ 0: Episode i didn’t change much (possibly wasteful)

Implementing in VeRL

The natural approach is:

Create a RewardManager that returns reward_tensor: [batch_size, seq_len] with non-zero values throughout the sequence
Let VeRL’s training loop handle the rest

But there are subtle issues at every layer of the stack.

Reward Manager Architecture

VeRL provides two architectures for reward computation:

Legacy RewardManager (Synchronous)

The original approach: implement a RewardManager class that’s called synchronously during training.

class StepProgressRewardManager:
    def __call__(self, data: DataProto) -> torch.Tensor:
        # Compute rewards synchronously
        return reward_tensor  # [batch_size, seq_len]

Characteristics:

Has access to the actor model
Blocks the training loop during computation
Simple to implement and debug

Important: To use a custom RewardManager, you must set use_reward_loop=False in your config. Otherwise, VeRL defaults to the RewardLoopManager and silently bypasses your custom reward manager—a subtle source of bugs.

RewardLoopManager (Async) — Now Default

The default code path in VeRL when reward_model.enable=True. Runs reward computation asynchronously via a separate model server.

class CustomRewardLoopManager:
    def __init__(self, reward_model, ...):
        self.reward_model = reward_model  # Separate model instance (e.g., vLLM)

    async def compute_rewards(self, data):
        # Async reward computation
        return reward_tensor

Characteristics:

Async processing—doesn’t block training
Well-suited for external reward models: trained RMs, LLM-as-judge, rule-based verifiers
Does not have access to the actor model’s current weights

When to Use Which

Use Case	Recommended
External RM (trained reward model)	RewardLoopManager ✅
LLM-as-judge	RewardLoopManager ✅
Rule-based verification	Either works
Rewards derived from actor model (e.g., V(prefix))	Legacy RewardManager ✅

Key consideration: If your reward computation requires the current policy (e.g., estimating V(prefix) for process rewards), the RewardLoopManager creates synchronization issues—the reward model copy can diverge from the actor during training. In this case, the legacy RewardManager is more appropriate.

Optimization: Pre-compute During Generation

Computing V(prefix) requires forward passes through the actor model. If done during reward computation, this blocks the training loop.

Solution: Pre-compute during generation, when the actor is already loaded.

# In generation phase:
# 1. Generate response
# 2. Segment into episodes
# 3. For each episode boundary:
#    - Construct prefix + force_answer_prompt + ground_truth
#    - Compute log_probs using actor (already loaded!)
#    - Store V(prefix) in batch.meta_info["prefix_value_cache"]

# In training phase:
# 1. Retrieve pre-computed V(prefix) from cache
# 2. Compute marginal utilities (cheap math)
# 3. Compute process rewards (cheap math)

This gave us 60-70% speedup compared to computing everything during training.

Advantage Estimation

Even with correct process rewards computed, most advantage estimators will destroy the fine-grained structure.

The Data Flow

Here’s how rewards flow through VeRL’s training pipeline:

flowchart TD
    A[RewardManager returns reward_tensor]
    B[compute_reward in trainer/ppo/reward.py]
    C[Store as token_level_scores]
    D[apply_kl_penalty optional]
    E[Store as token_level_rewards]
    F[compute_advantage in trainer/ppo/core_algos.py]
    G[GAE: Uses per-token rewards]
    H[GRPO: Sums to scalar]
    I[Token-level advantages]
    J[Scalar advantages]

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G
    F --> H
    G --> I
    H --> J

    classDef good fill:#d4edda,stroke:#28a745,stroke-width:2px
    classDef bad fill:#f8d7da,stroke:#dc3545,stroke-width:2px

    class G good
    class H bad

The subtle issue: Most advantage estimators immediately collapse token-level rewards to scalars:

# Standard GRPO in VeRL (core_algos.py:301)
scores = token_level_rewards.sum(dim=-1)  # [batch, seq_len] → [batch]
advantages = (scores - group_mean) / group_std  # scalar per sequence

This destroys the fine-grained credit assignment you carefully designed! Your process rewards are summed into a single number before computing advantages.

The critical question: Does your advantage estimator use token_level_rewards[:, t] at each timestep, or does it sum first?

Token-Level Preserving Estimators

You need an advantage estimator that operates on token_level_rewards[:, t] at each timestep. For example, the GAE estimator function preserves this structure:

for t in reversed(range(seq_len)):
    delta = token_level_rewards[:, t] + gamma * nextvalues - values[:, t]
    lastgaelam = delta + gamma * lam * lastgaelam

If you want to keep using GRPO, you will need to implement your own estimator:

# Collect all rewards from group
group_rewards = torch.cat([rewards[i, mask[i]] for i in group])

# Normalize each token by group statistics
mean_R = group_rewards.mean()
std_R = group_rewards.std()
rewards_normalized = (rewards - mean_R) / std_R

# Compute advantages as cumulative sum
advantages = rewards_normalized.flip(-1).cumsum(-1).flip(-1)

Normalization Pitfall: Mixed Reward Scales

When combining outcome rewards with process rewards, normalize them separately.

A subtle bug: normalizing outcome rewards (scale: 0-1) together with process rewards (scale: ~±0.03) makes the process reward signal negligible.

Fix: Identify and normalize each reward type separately:

# Outcome rewards at last token (like standard GRPO)
outcome_mask = ...  # last valid token of each response
id2outcome[group_idx].append(token_level_rewards[i][outcome_mask[i]])

# Process rewards at episode boundaries (separate normalization)
process_mask = nonzero_mask & ~outcome_mask
id2process[group_idx].append(token_level_rewards[i][process_mask[i]])

Summary

In this post, we discuss possible ways of obtaining a training-free process reward (that does not rely on an external reward model) and walked through the implementation details in VeRL.

A few takeaways:

Log-probability approximation is efficient: Estimating V(prefix) via $\log P(\text{answer} \mid \text{prefix})$ requires one forward pass, vs K×N rollouts for MC estimation
Marginal utility captures step-level progress: $U_i = V(\text{prefix}i) - V(\text{prefix}{i-1})$ measures how much each episode helps or hurts
Choose the right reward manager architecture: Legacy RewardManager works better when rewards depend on the current policy; RewardLoopManager (now default) is designed for external reward models
Pre-compute during generation: Move V(prefix) computation out of the training critical path for 60-70% speedup
Use token-level advantage estimators: Standard GRPO collapses to scalars—use GAE or token-preserving GRPO
Normalize reward types separately: Mixed-scale rewards (outcome + process) need separate normalization to preserve signal strength

References

[1] Lightman, Hunter, et al. “Let’s verify step by step.” The Twelfth International Conference on Learning Representations. 2023. https://arxiv.org/pdf/2305.20050

[2] Kazemnejad et al. (2024). VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment. https://arxiv.org/abs/2410.01679

[3] VeRL: Volcano Engine Reinforcement Learning for LLMs. https://github.com/volcengine/verl

[4] Setlur et al. (2024). Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning. https://arxiv.org/abs/2410.08146

[5] Qu et al. (2025). Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning. https://arxiv.org/abs/2503.07572

Implementing On-Policy Distillation: Lessons from Building OPD in VeRL

Tue, 06 Jan 2026 00:00:00 -0600

Introduction

The standard recipe for distilling a strong teacher into a smaller student is straightforward: generate reasoning traces from the teacher, fine-tune the student on those traces (SFT), then run RL. As the teacher traces are off-policy, the student might forget its own knowledge when trying to fit the teacher’s output and this causes a distribution-shift during inference (the student might encounter states that the teacher has never seen and will not be able to recover). One of the other side-effects is that the student inherits the teacher’s verbosity patterns from static data, then must unlearn them during RL.

On-Policy Distillation (OPD) [1] seems promising as it operates on student outputs, thus removing the train-test mismatch issue. A natural idea is to integrate OPD with regular RL training since it can reuse most of the RL infrastructure. Conceptually, OPD can be directly implemented by replacing the reference policy used to compute the KL divergence term with the teacher policy. The teacher then evaluates the student’s own rollouts and provides a KL divergence signal that pushes the student toward better reasoning—selectively, on prompts where the student’s pass rate is low. This idea has been explored concurrently in KDRL [2], which proposes a unified framework combining knowledge distillation with reinforcement learning but does not release an implementation. The idea is simple. The implementation is a bit more than that.

This post is the engineering story. We built OPD for the VeRL training framework, and document the architecture, the key implementation pitfalls, the design choice between implementing KL as a loss term vs. as advantage replacement, and the constraints we discovered along the way.

Architecture Overview

Why Not Just Swap the Reference Policy?

As mentioned in the introduction, the conceptual pitch for OPD is simple: standard RL training already computes KL divergence against a reference policy (a frozen snapshot of the initial model) to prevent the student from drifting too far. Just point that KL term at the teacher instead, and the student gets pulled toward the teacher’s distribution rather than its own starting point.

The primary obstacle is hardware. The reference policy is the same architecture as the student, so it can share the same GPUs (or be offloaded cheaply). A 72B teacher serving a 4B student cannot. The teacher must run on separate hardware, which means network communication, serialization, and a client-server protocol that the reference policy path was never designed for. This is the main architectural requirement that drives everything else.

Optionally, the teacher can be a closed source model queried through API, or a shared service that is used by multiple RL workloads.

Given these requirements, OPD needs to be a separate subsystem. The teacher runs on dedicated hardware behind a vLLM server, communicating via ZeroMQ:

Student Training                        Teacher Server 
┌──────────────────────────┐            ┌──────────────────────┐
│ 1. Generate rollouts     │            │  vLLM Engine         │
│ 2. Compute rewards       │            │  - Teacher model     │
│ 3. Compute pass rates    │            │  - Independent TP    │
│ 4. Create eligibility    │  ZeroMQ    │  - Logprob compute   │
│    masks                 │───────────►│                      │
│                          │◄───────────│                      │
│ 5. Fetch teacher logprobs│            └──────────────────────┘
│ 6. RL loss + KD loss     │
└──────────────────────────┘

The eligibility mask determines which rollouts receive teacher guidance. For each prompt, we compute the pass rate—the fraction of rollouts that receive a positive reward. Prompts with low pass rates are “hard”: the student struggles with them. Only failed rollouts for hard prompts are eligible for OPD, focusing the teacher signal where it is most needed.

Pitfall 1: Legacy vs. New Worker—Know Your Code Path

The Problem

We implemented OPD, launched training, and saw encouraging metrics in W&B: opd/frac_opd_samples at ~5%, opd/frac_underperforming_prompts at ~5-10%. These are computed in the trainer before the update step. They confirmed the masking logic worked.

But four worker-level metrics were missing: opd/kl_loss, opd/kd_coef, opd/num_eligible_samples, opd/frac_tokens_with_kd. No errors. No crashes. Training continued normally as pure RL—no teacher guidance was applied at all.

The root cause: VeRL has two worker implementations with different loss computation paths.

Implementation	Config Setting	Loss Location
Legacy worker	`use_legacy_worker_impl = "auto"` (default)	Hardcoded inline in `dp_actor.py:update_policy()`
New worker	`use_legacy_worker_impl = "disable"`	Pluggable via `losses.py:ppo_loss()` with `set_loss_fn()`

We implemented OPD loss in losses.py—the new worker’s pluggable loss path. But the default configuration uses the legacy worker, where loss computation is hardcoded in update_policy(): policy gradient + entropy + KL penalty, with no OPD term. Our code was never called.

Why It Was Hard to Detect

Several factors conspired to make this silent:

Partial metrics created false confidence. The trainer-level metrics (opd/frac_opd_samples) are computed before the worker step and appeared correctly, suggesting the pipeline was active.
No crashes or errors. The legacy worker’s key selection (select_keys) didn’t include OPD keys, so teacher logprobs were silently dropped from the batch. The loss function in losses.py was never called, so its conditions were never evaluated.
Training progressed normally. Without OPD loss, training just ran pure RL. The model still learned, accuracy improved, everything looked reasonable.

The Fix

Implement OPD loss directly in the legacy worker’s update_policy() in dp_actor.py, after the existing KL penalty section.

Also add OPD keys to select_keys so the data survives batch serialization:

# In dp_actor.py: Include OPD keys in worker data
if "teacher_log_probs" in data.batch.keys():
    select_keys.extend([
        "teacher_log_probs", "opd_eligibility_mask",
        "opd_horizon_mask", "prompts"
    ])

Pitfall 2: Scalar Logprobs—Full Distributions Are Not Needed

The Math Dictates the Implementation

OPD adds a KL divergence term between student and teacher distributions. Conceptually, computing full KL divergence requires the complete vocabulary distribution from both models:

KL(p || q) = sum_x p(x) * log(p(x) / q(x))

For Qwen models with a vocabulary of ~152k tokens, this means transferring a 152k-dimensional vector per token position from the teacher server. At thousands of positions per sequence and hundreds of sequences per batch, this is impractical.

Both KL estimators we use sidestep this entirely. They approximate KL divergence using only the scalar logprobs of the actual generated tokens x_t:

K1:  KL ≈ log p_student(x_t) - log p_teacher(x_t)
K2:  KL ≈ 0.5 * (log p_student(x_t) - log p_teacher(x_t))^2

Whether you choose K1 or K2 (discussed later in the loss vs. advantage section), both require exactly one scalar per position from the teacher—not a 152k-dimensional distribution.

The Mistake

Our initial teacher server code missed this and requested full vocabulary logprobs from vLLM:

# BROKEN: Requesting full vocab
sampling_params = SamplingParams(
    prompt_logprobs=vocab_size,  # 151,669 tokens!
)

This hit vLLM’s limit: "Requested prompt logprobs of 151669, which is greater than max allowed: 20".

The fix requests only the top-1 logprob per position, then extracts the actual generated token’s probability:

# FIXED: Request only scalar logprobs
sampling_params = SamplingParams(
    prompt_logprobs=1,  # Just the actual next token
)

# returned prompt_logprobs is a dictionary 
# Extract scalar logprob for each position
for pos in range(1, len(token_ids)):
    actual_token_id = token_ids[pos]
    logprob = prompt_logprobs[pos][actual_token_id].logprob
    token_logprobs.append(logprob)

A note on shape alignment. Once teacher logprobs reached the worker, we hit a shape mismatch: the student worker returns response-only logprobs [batch, 8192], but the teacher server returns full-sequence logprobs such as [batch, 9216] (prompt + response). Teacher logprobs need to be sliced to the response portion (teacher_log_probs[:, prompt_len:]) before computing KL.

Takeaway

Always review the mathematical requirements of your objective before implementing the data pipeline. Both K1 and K2 only need scalar logprobs—but the server code was written as if full KL were needed. Understanding the estimator formulas before writing the server would have avoided both the efficiency problem and the vLLM limit entirely.

Design Choice: KL as Loss vs. KL as Advantage

Once the infrastructure works—teacher server responds, logprobs reach the worker, shapes align—you still face a design question: how should the teacher’s KL signal enter the training objective?

We implemented two approaches. Both are viable, but they use different KL estimators and have different balancing challenges.

Approach 1: KL as a Separate Loss Term

Add a KD loss term alongside the policy gradient loss:

L = L_PG + beta * L_KD

where L_KD = mean(KL(student || teacher) * mask).

This approach should use the K2 estimator:

KL_K2 = 0.5 * (log p_student - log p_teacher)^2

K2 is appropriate here because it functions as a loss: always non-negative, with gradients that vanish when the student matches the teacher. The squared term gives it the right properties for minimization.

The balancing problem. K2 values are large when the student-teacher gap is significant. With a logprob difference of 2–3 nats (common early in training), K2 produces values of 2–4.5 per token. Meanwhile, the policy gradient loss is typically around 0.01. A coefficient of beta = 0.1 means the KD term contributes ~0.2–0.45, dominating the PG loss by an order of magnitude.

This makes coefficient tuning fragile. Too high and teacher guidance overwhelms the RL signal. Too low and it has no effect. The right value depends on training stage, problem difficulty, and student-teacher gap—all of which change over time.

We also found that ratio clipping (clamping the teacher/student probability ratio, similar to PPO’s clip) helps stabilize this approach, but adds another hyperparameter.

Approach 2: KL as Advantage Replacement

Instead of adding a separate loss, replace the RL advantages for hard prompts with teacher-derived advantages:

# For hard prompts: replace RL advantages with teacher signal
# Negate K1 so that tokens the teacher favors get positive advantage
opd_advantages = -(log p_student - log p_teacher) * horizon_mask

advantages = where(
    eligible,          # hard prompt, failed rollout
    opd_advantages,    # teacher guidance
    rl_advantages      # standard RL
)

The horizon_mask limits OPD to the first K tokens of each response, since early tokens have more influence on the reasoning trajectory.

This approach should use the K1 estimator:

KL_K1 = log p_student - log p_teacher

K1 is the right choice here because the KL signal is being used as a reward, not a loss. The advantage is the negation of K1: -(log p_student - log p_teacher) = log p_teacher - log p_student. This is directly interpretable as a per-token reward: positive when the teacher assigns higher probability to the generated token (the teacher “approves”), negative when it assigns lower probability. Squaring this (K2) would destroy the sign, losing the directional information that makes advantages useful.

No coefficient to balance. Since teacher advantages replace RL advantages (rather than being added to a different loss), the policy gradient loss operates on a single unified set of advantages. There’s no beta to tune between competing loss terms.

Why Both Require Careful Normalization

In standard GRPO, advantages are normalized within a batch (zero mean, unit variance). This normalization is critical for stable policy gradient updates.

When OPD advantages enter the picture, normalization must be handled carefully regardless of approach.

In the loss approach, the KD loss has a fundamentally different scale from the PG loss. The coefficient beta attempts to bridge this, but it’s a static scalar applied to a dynamic quantity.

In the advantage approach, the issue is subtler. OPD advantages (raw KL differences) and RL advantages (normalized outcome rewards) have different distributions. If GRPO normalizes the combined set of advantages—some from RL, some from teacher KL—the two populations contaminate each other’s statistics. A batch dominated by hard prompts (many OPD advantages) shifts the normalization in ways that distort the RL signal for easy prompts, and vice versa.

Normalize OPD advantages separately. Compute mean and variance over OPD-eligible tokens only, normalize those, then combine with separately-normalized RL advantages. This ensures each population is properly scaled before they enter the same policy gradient computation.

Summary

	KL as Loss	KL as Advantage
KL estimator	K2: `0.5 * (log p_s - log p_t)^2`	K1: `log p_s - log p_t`
Why this estimator	Non-negative, proper loss for minimization	Preserves sign, interpretable as per-token reward
Balancing	Coefficient beta between PG and KD losses	No coefficient; separate normalization instead
Integration point	Worker loss function (`dp_actor.py`)	Trainer, after advantage computation (`ray_trainer.py`)
Hard/easy separation	Masking within loss	`torch.where` on advantages

Constraint: Shared Tokenizer and Chat Template

The Requirement

Our current OPD implementation requires that the student and teacher share the same tokenizer. Both KL estimators compute a difference of logprobs at each position t for the generated token x_t. This only makes sense if both models agree on what token ID x_t represents.

But sharing a tokenizer is necessary and not sufficient. The student and teacher must also share the same chat template.

Why Chat Template Matters

Even when two models use the same tokenizer (same vocabulary, same BPE merges), they may use different chat templates. Consider Qwen3-4B-Base and Qwen3-32B-Instruct:

Same tokenizer: Both use the Qwen3 tokenizer with identical vocabulary
Different templates: The instruct model wraps generations in <|im_start|>, <|im_end|>, and may use <think> tags; the base model produces raw text

When computing KL divergence, the teacher assigns high probability to template tokens like <|im_start|> at positions where the student has never learned to produce them. The token IDs match (same tokenizer), but the probability distributions are fundamentally misaligned. KL divergence explodes, gradients become unstable, and training diverges.

Practical Solutions

We identified two approaches:

Option 1: Use a format-compatible teacher. Run RL directly from the same base model checkpoint (without chat template SFT), so both student and teacher produce raw text. This works but requires training an additional large model, defeating the purpose of efficient training.

Option 2: Pre-align the student via Rejection Fine-Tuning (RFT). Before OPD training, generate rollouts from the base student, filter to correct solutions, and fine-tune the student on these correct outputs formatted with the teacher’s template. The student learns to produce <think> tags and chat formatting, making its output structure compatible with the teacher’s distribution. This is the approach we use—we discuss the RFT procedure and its interaction with RL training in an upcoming post.

Current Limitation

Supporting different tokenizers between student and teacher would require token-level alignment (mapping between vocabularies), which we have not implemented. For now, OPD is restricted to student-teacher pairs that share a tokenizer and—after any necessary pre-alignment—produce structurally compatible outputs.

Relation to VeRL’s GKD Recipe

VeRL ships a GKD (Generalized Knowledge Distillation) recipe that performs on-policy distillation with a similar teacher server architecture (vLLM + ZeroMQ). Our implementation diverges in three ways.

Synchronous vs. off-policy scheduling. Our OPD runs synchronously: the teacher scores the current step’s rollouts before the actor update begins. GKD overlaps computation phases using async schedulers (one-step-off, two-step-off): the actor update for step N runs concurrently with rollout generation for step N+1, and teacher inference is pipelined across steps. This delivers significant throughput gains but means the rollout policy is one or two steps behind the current parameters—slightly off-policy.

Top-k distributions vs. scalar logprobs. GKD fetches top-k token distributions from the teacher (typically k=10), storing both logprobs and token indices per position. This enables computing KL divergence over the teacher’s probability mass instead of a single token. However the KDRL paper reported this setup to be unstable for training, so we did not implement this.

KL computed in Megatron engine vs. in update_policy(). GKD computes KL divergence inside Megatron’s forward pass via a custom logits_processor and a hand-written TP-aware autograd function that handles all-reduce across tensor-parallel ranks. This is efficient but ties the implementation to the Megatron backend. Our OPD operates on logprobs after the forward pass, in the actor’s update_policy() (or in the trainer for the advantage approach). This works with VeRL’s FSDP backend and doesn’t require custom autograd, at the cost of not leveraging Megatron’s pipeline overlap or native TP-aware KL computation.

Conclusion

The OPD implementation touched 10 files across a distributed system spanning trainer, workers, and an external teacher server. The dominant failure mode was silent degradation: every issue resulted in a condition quietly evaluating to false, with no errors and partial metrics suggesting everything was working.

Four lessons stand out:

Trace the code path from config to gradient. In a framework with multiple worker implementations, the clean pluggable interface may not be the one that’s active.
Let the math dictate the data pipeline. Both K1 and K2 estimators only need scalar logprobs, not full vocabulary distributions. Understanding this upfront avoids both efficiency problems and API limits.
Match the KL estimator to the integration point. K2 (squared) is a proper loss for minimization. K1 (linear) preserves sign information needed for advantages. Using the wrong estimator for the wrong approach either destroys useful signal or creates unstable optimization.
Format compatibility is a hard constraint. Shared tokenizer is necessary but not sufficient—student and teacher must produce structurally compatible outputs for KL divergence to be meaningful.

Two broader challenges remain. First, balancing the teacher signal with RL rewards is difficult regardless of design approach. Whether it’s a loss coefficient or advantage normalization, getting this wrong means either the teacher overwhelms RL or has no effect at all.

Second, OPD fundamentally assumes that the teacher and student policies are similar enough for the KL signal to be useful. Format compatibility (shared tokenizer and chat template) is the most visible aspect of this, but the issue runs deeper. Since we evaluate the teacher’s logprob on the student’s sampled tokens, tokens that the student generates but the teacher would rarely produce yield extreme logprob differences and noisy KL estimates. The more the student’s distribution diverges from the teacher’s—whether due to different training stages, different capabilities, or different reasoning styles—the noisier the OPD signal becomes. This is a fundamental limitation of on-policy distillation with scalar KL estimators: the signal quality depends on the overlap between the two policies.

References

[1] Agarwal et al. (2024). On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes. ICLR 2024.

[2] Xu et al. (2025). KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning. https://arxiv.org/abs/2506.08946

[3] VeRL: Volcano Engine Reinforcement Learning for LLMs. https://github.com/volcengine/verl

Understanding Length Dynamics in RL Training

Sun, 21 Dec 2025 00:00:00 -0600

Mystery of Length Growth from RL

DeepSeek-R1 [1] reported that Reinforcement Learning (RL) can drive models to “think” for thousands of steps to solve complex problems. This “thinking” process—manifested as extended Chain-of-Thought (CoT)—is widely believed to be the key to breakthrough reasoning performance.

However, replicating this length explosion has proven elusive. While some replication efforts report a U-shaped growth pattern, others observe lengths declining or stagnating over training.

Two early replication efforts of DeepSeek-R1: DeepScaleR [2] and Simple-RL-Zero [3] both reported that the output length changed with the number of steps following a U-shape

At the same time, the paper “When More is Less: Understanding Chain-of-Thought Length in LLMs” [4] and the Concise Reasoning paper [9] both report declining output length over training:

This brings us to a fundamental question: What actually drives a model to think longer? Is it an inherent property of RL, a bias in the the loss function design, or something else entirely? In this post, we systematically isolate the drivers of length dynamics.

Experimental Setup

To cleanly reproduce DeepSeek-R1-Zero behavior, all of the experiments mentioned in this post use the following configuration unless otherwise specified:

Model initialization: We use base model checkpoints (not instruction-tuned), following the R1-Zero approach of starting from pretrained weights
RL algorithm: We use the default GRPO implementation in VeRL with a clip ratio of 0.2.
Reward design: We employ only a binary correctness reward based on answer parsing (with math_verify for math datasets), with no format reward and no KL regularization term
Generation parameters: The maximum output length is set to 16,384 tokens to minimize truncation effects. The sampling temperature is set to 1.0 and top-p= 1.0.
Hyperparameters: Initial learning rate of 1e-6, batch size of 32 with 8 rollouts per prompt

This minimal setup allows us to isolate the effects of pure correctness-based RL on length dynamics, without confounding factors from auxiliary rewards or regularization terms.

Initial Experiments: Length Decrease and Stabilization

Our pilot run with Qwen3-4B on MATH to replicate the length growth effect was unsuccessful. The model showed length decrease and then stabilizing. This was also the case with Qwen2.5-3B and Qwen2.5-1.5B models. Note that we set the maximum output length to 16k to minimize truncation. Although it takes the Qwen2.5 1.5B model longer to learn the correct format, all models reached an output of ~600 tokens at the end of training.

Even after we switched to the Nemotron Science dataset [5], we were still consistently observing this pattern of decrease and stabilization (bottom right figure).

Hypothesis 1: Is KL Divergence the Key?

In the Demystifying CoT paper [4], they attempt to reproduce the curve from Simple-RL-Zero [3], but fail to observe substantial length growth. They attribute this mild length growth to KL regularization, pulling the model back to the long output distribution it initially started with.

Following this hypothesis, we added KL divergence back to the objective with a weight of 0.01 (same as Simple-RL-Zero):

Adding this KL term does not recover length growth; in fact, the KL loss increases while lengths still shrink. This suggests that KL regularization alone does not explain the discrepancy with Simple-RL-Zero and Demystifying CoT under our experiment settings.

Hypothesis 2: Is GRPO Length Bias in Effect?

The Dr.GRPO paper [6] points out an inherent bias in GRPO’s loss function that creates asymmetric incentives based on sequence length. Since we were using the GRPO algorithm out-of-the-box, this seemingly could be the culprit.

Length Bias Statement (from Dr.GRPO)

Original GRPO computes the loss per sequence by averaging the advantage over tokens, then averaging across sequences:

\[\mathcal{L}_{\text{GRPO}} = \frac{1}{G} \sum_{i=1}^{G} \left\{ \min \left[ \frac{\pi_\theta(y_{i,t}|\mathbf{q}, \mathbf{o}_{i,<t})}{\pi_{\theta_{\text{old}}}(y_{i,t}|\mathbf{q}, \mathbf{o}_{i,<t})} \hat{A}_{y_i}, \text{clip}\left(\frac{\pi_\theta(y_{i,t}|\mathbf{q}, \mathbf{o}_{i,<t})}{\pi_{\theta_{\text{old}}}(y_{i,t}|\mathbf{q}, \mathbf{o}_{i,<t})}, 1-\epsilon, 1+\epsilon\right) \hat{A}_{y_i} \right] \right\}\]

where the advantage $\hat{A}_{y_i}$ is computed per sequence, and $G$ is the group size. The critical issue is that the per-sequence loss averages over token count, while the advantage is the same for all tokens in a sequence.

This creates asymmetric incentives:

For correct answers ($\hat{A}_{y_i} > 0$):

Short sequence (length $L_{\text{short}}$): gradient $\propto \frac{\hat{A}_{y_i}}{L_{\text{short}}}$ → larger per-token update
Long sequence (length $L_{\text{long}}$): gradient $\propto \frac{\hat{A}_{y_i}}{L_{\text{long}}}$ → smaller per-token update
Result: Favors shorter responses among correct ones

For incorrect answers ($\hat{A}_{y_i} < 0$):

Short sequence: penalty $\propto \frac{\lvert\hat{A}_{y_i}\rvert}{L_{\text{short}}}$ → larger per-token penalty
Long sequence: penalty $\propto \frac{\lvert\hat{A}_{y_i}\rvert}{L_{\text{long}}}$ → smaller per-token penalty
Result: Favors longer responses among incorrect ones

Now if we consider the setting of our pilot run with Qwen3-4B on MATH, we observe that the training reward approaches 0.8 and the validation accuracy shoots to pass@1=75%. Following Dr. GRPO’s reasoning, this would imply that the shortening effect among correct responses would dominate and drive the average length down.

However, applying the Dr.GRPO fix to the loss calculation gives us the figure below, which is still very similar to what we observed before:

VeRL’s Loss Aggregation: Three Aggregation Methods

A closer look at VeRL’s default implementation reveals that the default aggregation is token-mean, following DAPO:

if loss_agg_mode == "token-mean":  # DAPO (default)
    loss = verl_F.masked_mean(loss_mat, loss_mask)
elif loss_agg_mode == "seq-mean-token-sum":  # Dr. GRPO
    seq_losses = torch.sum(loss_mat * loss_mask, dim=-1)  # token-sum
    loss = torch.mean(seq_losses)  # seq-mean
elif loss_agg_mode == "seq-mean-token-mean":  # original GRPO
    seq_losses = torch.sum(loss_mat * loss_mask, dim=-1) / torch.sum(loss_mask, dim=-1)  # token-mean
    loss = torch.mean(seq_losses)  # seq-mean

Code snippet taken from VeRL codebase

This means our runs already use token-mean aggregation, which eliminates sequence normalization length bias, so this particular GRPO length bias is not present in our setup, and cannot explain our initial observations.

Let $L_i$ denote the length of sequence $i$, and $L_{\max}$ denote the maximum length (16k in our experiments). The three methods differ in how they aggregate:

1. Original GRPO (seq-mean-token-mean): $\mathcal{L} = \frac{1}{G} \sum_{i=1}^{G} \left( \frac{1}{L_i} \sum_{t=1}^{L_i} \mathcal{L}_{i,t} \right)$

This normalizes by actual sequence length $L_i$, creating the length bias.

2. Dr.GRPO (seq-mean-token-sum): $\mathcal{L} = \frac{1}{G} \sum_{i=1}^{G} \left( \sum_{t=1}^{L_i} \mathcal{L}_{i,t} \right) = \frac{1}{G} \sum_{i=1}^{G} \left( \frac{L_i}{L_{\max}} \sum_{t=1}^{L_i} \mathcal{L}_{i,t} \right) \cdot L_{\max}$

By summing (or equivalently, dividing by the same $L_{\max}$ for all sequences), this removes the asymmetry.

3. Token-mean (DAPO [7]) (token-mean): $\mathcal{L} = \frac{1}{\sum_{i=1}^{G} L_i} \sum_{i=1}^{G} \sum_{t=1}^{L_i} \mathcal{L}_{i,t}$

This treats all tokens equally regardless of which sequence they belong to, eliminating length-based bias from aggregation.

Difficulty Imbalance Drives Length Growth

As we were unable to reproduce the length growth with the MATH dataset or Nemotron Science dataset, we turned to explore other datasets.

We then experimented with a hard subset of BigMath [8], which is a cleaned version of 250k math problems for RL training. With extended training up to 1000 steps, we were able to see a clear U shape in the output length.

BigMath Dataset Composition Details

To ensure the dataset contains sufficiently challenging problems, we applied the following filtering to the original BigMath dataset: Source filtering: We kept only challenging sources (olympiads, omnimath, harp, aops_forum, amc_aime, big_math, openmath), while removing sources that overlap with previous experiments or are too easy (math, gsm8k, cn_k12, orca_math). Difficulty filtering: BigMath provides pass rate using Llama3.1-8B as a reference, so we performed filtering by keeping only problems with <30% pass rate.

As a control treatment, if we remove the hard questions in the training set(with pass@1 <0.3 measured on the checkpoint at step 100), the output length will fluctuate, but does not show a consistent upward trend.

Here ‘hard’ is defined relative to the early checkpoint, so this is a model-dependent notion of difficulty, and removing those questions also changes the content distribution. Nonetheless, under this intervention we no longer observe consistent length growth, which is consistent with the idea that the hard tail is a key driver.

A Moving Reward Distribution

To understand why a dataset of mixed difficulty would cause length growth, we create an easy split and hard split of the validation set based on the pass rate on an early checkpoint (at 100 steps.) The easy split is defined as questions with pass@1 > 90% and the hard split is defined as questions with pass@1 < 10%. We examine the reward distribution and output length distribution of each of these question categories.

What we expect: overall we think hard questions require longer responses to solve compared to easy questions.

On the hard question subset, our observations confirm our intuition: the model gets higher reward when the output is longer (the flat tail is because the model does not ever generate such long outputs).

The median output length significantly shifts over training, demonstrating a mode-seeking behavior towards higher reward.

For the easy questions, as the model is already good at solving them, we would expect the length to stay unchanged, as there is little room of improvement.

However, when we examine the empirical reward distribution, we observe a rightward shift of the reward peak, indicating that the model can now maintain correctness at longer response lengths. The overall inverted-U shape remains—extremely long responses still decrease accuracy—but the curve has become flatter and the optimal length has shifted from approximately 500 to 700 tokens. This means the model has learned to generate longer responses while preserving accuracy on easy problems.

The output length distribution also shifts in accordance with the reward distribution.

Combining our observations on easy problems and hard problems, we can quantify the shift in both output length and performance:

Checkpoint	Easy Problems (Pass Rate)	Easy Problems (Avg Length)	Hard Problems (Pass Rate)	Hard Problems (Avg Length)
Step 100 (Early)	95.1%	530	2.3%	1039
Step 1000 (Late)	97.0%	661	5.5%	1301

The table shows that both easy and hard problems experience length increases during training, with the magnitude and direction of the shift differing based on problem difficulty.

This reveals the mechanism behind length growth on mixed-difficulty datasets. The model learns to generate longer outputs to solve hard problems (step 100→1000: +262 tokens, pass rate 2.3%→5.5%). However, this policy shift affects all problems broadly—easy problems also become longer (+131 tokens). Crucially, the model learns to maintain correctness on easy problems even at these longer lengths, which is reflected in the flattening of their empirical reward distribution. Without hard problems present (easy-only training), the model has no incentive to grow longer and the distribution remains stable.

The Principle of Asymmetric Risk
💡 Key Insight: We hypothesize this behavior is driven by asymmetric risk. For a mixed-difficulty dataset, the model faces a trade-off:
On Hard Problems: "Thinking short" carries a high risk of failure (low reward). "Thinking long" is necessary to solve them.
On Easy Problems: "Thinking long" carries little risk (as long as the length stays within the flattened band). As the reward distribution flattens, the model can spend extra tokens without hurting its accuracy.
Consequently, the model adopts a "better safe than sorry" global policy. It learns to extend its reasoning to cover the hard tail of the distribution. Since this extended length doesn't penalize performance on easy problems, the length drift bleeds into the easy subset—resulting in the "overthinking" we observe.

Case Studies

To understand how length growth manifests in individual outputs, we examine two problems at different difficulty levels.

Easy Problem: Coin Counting

Problem: Petya has 25 coins, each with a denomination of 1, 2, 5, or 10 rubles. Among these coins, 19 are not two-ruble coins, 20 are not ten-ruble coins, and 16 are not one-ruble coins. How many five-ruble coins does Petya have?

For this problem, the output length grew from 493 tokens (step 100) to 602 tokens (step 1000) on average, a 22% increase.

A simple and short solution to the problem produced by the model would look like the following, which sits at 338 tokens:

📝 Example: Concise solution (338 tokens)

To determine the number of five-ruble coins Petya has, we will use the information given and set up a system of equations. Let's denote the number of one-ruble coins by $a$, the number of two-ruble coins by $b$, the number of five-ruble coins by $c$, and the number of ten-ruble coins by $d$. **We know the following:** 1. The total number of coins is 25. 2. There are 19 coins that are not two-ruble coins, which means there are $25 - 19 = 6$ two-ruble coins. 3. There are 20 coins that are not ten-ruble coins, which means there are $25 - 20 = 5$ ten-ruble coins. 4. There are 16 coins that are not one-ruble coins, which means there are $25 - 16 = 9$ one-ruble coins. **From the above information, we can write the following equations:** $$a + b + c + d = 25$$ $$b = 6, \quad d = 5, \quad a = 9$$ **Now, we substitute** $b = 6$, $d = 5$, and $a = 9$ into the first equation: $$9 + 6 + c + 5 = 25$$ $$20 + c = 25$$ $$c = 5$$ Thus, the number of five-ruble coins Petya has is $\boxed{5}$.

After training (step 1000), outputs became 22% longer. Here’s a representative example (595 tokens):

📝 Example: Step 1000 solution with verification (595 tokens)

Let's denote the number of 1-ruble coins, 2-ruble coins, 5-ruble coins, and 10-ruble coins by $a$, $b$, $c$, and $d$ respectively. We know the following: 1. The total number of coins is 25: $a + b + c + d = 25$ 2. There are 19 coins that are not 2-ruble coins: $a + c + d = 19$ 3. There are 20 coins that are not 10-ruble coins: $a + b + c = 20$ 4. There are 16 coins that are not 1-ruble coins: $b + c + d = 16$ **We need to find the value of** $c$, the number of 5-ruble coins. **First,** we subtract the second equation from the first equation: $$(a + b + c + d) - (a + c + d) = 25 - 19$$ $$b = 6$$ **Next,** we subtract the third equation from the first equation: $$(a + b + c + d) - (a + b + c) = 25 - 20$$ $$d = 5$$ **Then,** we subtract the fourth equation from the first equation: $$(a + b + c + d) - (b + c + d) = 25 - 16$$ $$a = 9$$ **Now we substitute** $a = 9$, $b = 6$, and $d = 5$ into the first equation: $$9 + 6 + c + 5 = 25$$ $$20 + c = 25$$ $$c = 5$$ **To verify, we check the consistency of the values with the given conditions:** 1. The total number of coins is: $a + b + c + d = 9 + 6 + 5 + 5 = 25$ ✓ 2. The number of coins that are not 2-ruble coins is: $a + c + d = 9 + 5 + 5 = 19$ ✓ 3. The number of coins that are not 10-ruble coins is: $a + b + c = 9 + 6 + 5 = 20$ ✓ 4. The number of coins that are not 1-ruble coins is: $b + c + d = 6 + 5 + 5 = 16$ ✓ All conditions are satisfied, so the number of 5-ruble coins is $\boxed{5}$.

Analysis: The solution remains correct but becomes more elaborate. The model now includes explicit verification steps, checking each condition after deriving the answer. While this adds length, it doesn’t provide computational advantage for such a straightforward problem.

Harder Problem: Circle and Right Triangle Geometry

Problem: On the leg BC of right triangle ABC, a circle is constructed with BC as the diameter. The circle intersects the hypotenuse at point D such that AD:BD = 1:3. The altitude from vertex C (the right angle) to the hypotenuse is 3. Find the leg BC.

Concise solution approach (using geometric mean theorem):

Since BC is diameter, ∠BDC = 90° by Thales’ theorem → D is the altitude foot
Apply altitude geometric mean: h² = AD · BD
- 3² = x · 3x (where AD = x, BD = 3x)
- x = √3, so AB = 4√3
Apply leg-segment relationship: BC² = BD · AB = 3√3 · 4√3 = 36
- BC = 6

For this problem, outputs from step 1000 averaged 2021 tokens compared to 1544 tokens at step 100—a 31% increase. Solutions from later training tend to use algebraic approaches with systems of equations rather than leveraging geometric relationships directly.

Representative solution from step 100 (clean geometric approach):

✅ Example: Clean geometric solution from step 100

To solve the problem, we need to use the properties of the right triangle and the circle constructed on the hypotenuse. Let's denote the right triangle $ABC$ with $\angle ACB = 90^\circ$, and let $BC = a$, $AC = b$, and $AB = c$. The altitude from $C$ to the hypotenuse $AB$ is given as 3. This altitude splits the hypotenuse into two segments, $AD$ and $BD$, such that $AD : BD = 1 : 3$. **Step 1: Apply geometric mean theorem** In a right triangle, the altitude to the hypotenuse creates two smaller right triangles that are similar to the original triangle and to each other. Therefore, the altitude $CD$ is the geometric mean of the segments of the hypotenuse it creates. Let $AD = x$ and $BD = 3x$. Then, $AB = AD + BD = x + 3x = 4x$. Since the altitude $CD$ is 3, we have: $$CD^2 = AD \cdot BD$$ $$3^2 = x \cdot 3x$$ $$9 = 3x^2$$ $$x = \sqrt{3}$$ Thus, $AD = \sqrt{3}$ and $BD = 3\sqrt{3}$, and the hypotenuse $AB = 4\sqrt{3}$. **Step 2: Use Thales' theorem** Since the circle is constructed on $BC$ as the diameter and intersects the hypotenuse at point $D$, we know that $\angle BDC = 90^\circ$ (Thales' theorem). This means that $D$ is the foot of the altitude from $C$ to $AB$. **Step 3: Apply Pythagorean theorem** Since $D$ is the foot of the altitude, we can find the length of $BC$ using the Pythagorean theorem in the smaller right triangle $BDC$: $$BC^2 = BD^2 + CD^2$$ $$BC^2 = (3\sqrt{3})^2 + 3^2$$ $$BC^2 = 27 + 9 = 36$$ $$BC = 6$$ Thus, the length of $BC$ is $\boxed{6}$.

This solution efficiently combines the geometric mean theorem, Thales’ theorem, and the Pythagorean theorem to reach the answer in a clear logical sequence.

However, not all step 100 outputs are this clean. Some start correctly but make algebraic errors mid-solution, yet still arrive at the correct answer:

⚠️ Example: Flawed solution from step 100 (correct answer despite errors)

To solve for the length of the leg $BC$ in the right triangle $ABC$ with the given conditions, we will use geometric properties and relationships. Let's denote the lengths of the sides as follows: $AB = c$, $BC = a$, and $CA = b$. The altitude from $C$ to the hypotenuse $AB$ is given as 3. The circle with diameter $BC$ intersects the hypotenuse $AB$ at point $D$ such that $AD : BD = 1 : 3$. First, we use the property of the right triangle and the altitude to the hypotenuse. The altitude $CH$ from $C$ to $AB$ can be expressed as: $$CH = \frac{ab}{c} = 3$$ This gives us the relationship: $ab = 3c$. Next, we use the fact that $AD : BD = 1 : 3$. Let $AD = x$ and $BD = 3x$. Therefore, the length of the hypotenuse $AB$ is: $$AB = AD + BD = x + 3x = 4x$$ Since $AB = c$, we have $c = 4x$. **❌ Error: Fabricated formula.** Now, we use the fact that $D$ lies on the circle with diameter $BC$. This means that $\angle BDC = 90^\circ$. By the power of a point theorem (also known as the intersecting chords theorem), we have: $$AD \cdot BD = CD \cdot HD$$ This formula is incorrect! The power of point theorem doesn't apply this way. Since $D$ is on the circle with diameter $BC$, $CD$ is the altitude from $C$ to $AB$, and $HD$ is the segment of the altitude from $D$ to $H$. However, since $D$ is on the circle, we can use the fact that $CD = CH = 3$ and $HD = 0$ (since $D$ is on the hypotenuse and the altitude is perpendicular to the hypotenuse). Thus, we have: $$AD \cdot BD = 3 \cdot (c - 3)$$ Substituting $AD = x$ and $BD = 3x$, we get: $$x \cdot 3x = 3 \cdot (4x - 3)$$ $$3x^2 = 12x - 9$$ $$x^2 - 4x + 3 = 0$$ $$(x - 1)(x - 3) = 0$$ Thus, the solutions are $x = 1$ or $x = 3$. Since $x = 3$ would make $AB = 12$, which is not possible given the altitude constraint, we have $x = 1$. Therefore, $AB = 4x = 4$. Now, we need to find the length of $BC$. Using the relationship $ab = 3c$ and knowing $c = 4$, we have: $$ab = 3 \cdot 4 = 12$$ We also know that in a right triangle, the legs $a$ and $b$ satisfy the Pythagorean theorem: $$a^2 + b^2 = c^2 = 16$$ We can solve this system by expressing $b$ in terms of $a$: $$b = \frac{12}{a}$$ Substituting into the Pythagorean equation: $$a^4 - 16a^2 + 144 = 0$$ Let $y = a^2$: $$y^2 - 16y + 144 = 0$$ $$y = \frac{16 \pm \sqrt{256 - 576}}{2} = \frac{16 \pm \sqrt{-320}}{2}$$ **❌ Error: Negative discriminant.** This gives a negative discriminant, indicating the approach has gone wrong! Since the discriminant is negative, we need to re-evaluate the possible values of $a$ and $b$. Given the constraints, we can solve for $a$ and $b$ directly by trial and error or by solving the quadratic equation correctly. The correct values are $a = 3$ and $b = 4$ or vice versa. Therefore, the length of $BC$ is: $$\boxed{6}$$

This example shows a common pattern at step 100: the solution makes mathematical errors (fabricated power-of-point formula, negative discriminant) but somehow recovers to produce the correct final answer—suggesting the model may be leveraging memorized answers rather than purely following its derivation.

Representative solution from step 1000 (algebraic brute-force approach):

📐 Example: Long algebraic solution from step 1000 (~100 lines)

To solve the problem, we start with the right triangle $ABC$ where $\angle C = 90^\circ$. Let $BC = a$ and $AC = b$. The hypotenuse $AB = c$ can be found using the Pythagorean theorem: $$c = \sqrt{a^2 + b^2}$$ A circle is constructed with $BC$ as the diameter, so the center of the circle is the midpoint of $BC$, and the radius is $\frac{a}{2}$. Since $BC$ is the diameter, any angle subtended by $BC$ on the circle is a right angle. Therefore, $\angle BDC = 90^\circ$, making $D$ the foot of the perpendicular from $C$ to the hypotenuse $AB$. Given that $AD : BD = 1 : 3$, we can let $AD = x$ and $BD = 3x$. Thus, the length of the hypotenuse $AB = c = AD + BD = x + 3x = 4x$. The altitude from $C$ to the hypotenuse $AB$ is given as 3. Let the foot of the perpendicular from $C$ to $AB$ be $D$. The area of the triangle $ABC$ can be calculated in two ways. First, using the legs $a$ and $b$: $$\text{Area} = \frac{1}{2} \times a \times b$$ Second, using the hypotenuse $c$ and the altitude from $C$ to $AB$: $$\text{Area} = \frac{1}{2} \times AB \times CD = \frac{1}{2} \times 4x \times 3 = 6x$$ Equating the two expressions for the area, we get: $$\frac{1}{2} \times a \times b = 6x \implies ab = 12x$$ Since $AB = c = 4x$, we use the Pythagorean theorem: $$c = \sqrt{a^2 + b^2} = 4x \implies a^2 + b^2 = (4x)^2 = 16x^2$$ We now have two key equations: 1. $ab = 12x$ 2. $a^2 + b^2 = 16x^2$ We can express $b$ in terms of $a$ and $x$ from the first equation: $$b = \frac{12x}{a}$$ Substitute $b = \frac{12x}{a}$ into the second equation: $$a^2 + \left(\frac{12x}{a}\right)^2 = 16x^2 \implies a^2 + \frac{144x^2}{a^2} = 16x^2$$ Multiply through by $a^2$ to clear the fraction: $$a^4 + 144x^2 = 16x^2a^2 \implies a^4 - 16x^2a^2 + 144x^2 = 0$$ Let $y = a^2$. Then the equation becomes a quadratic in $y$: $$y^2 - 16x^2y + 144x^2 = 0$$ We solve this quadratic equation using the quadratic formula $y = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$: $$y = \frac{16x^2 \pm \sqrt{(16x^2)^2 - 4 \cdot 1 \cdot 144x^2}}{2 \cdot 1} = \frac{16x^2 \pm \sqrt{256x^4 - 576x^2}}{2} = \frac{16x^2 \pm \sqrt{64x^2(4x^2 - 9)}}{2} = \frac{16x^2 \pm 8x\sqrt{4x^2 - 9}}{2} = 8x^2 \pm 4x\sqrt{4x^2 - 9}$$ So, we have two solutions for $y$: $$y = 8x^2 + 4x\sqrt{4x^2 - 9} \quad \text{or} \quad y = 8x^2 - 4x\sqrt{4x^2 - 9}$$ Since $y = a^2$, we need to determine which solution is valid. We know $a$ and $b$ are positive, so we need to check the feasibility of both solutions. If $y = 8x^2 + 4x\sqrt{4x^2 - 9}$, then: $$a^2 = 8x^2 + 4x\sqrt{4x^2 - 9}$$ If $y = 8x^2 - 4x\sqrt{4x^2 - 9}$, then: $$a^2 = 8x^2 - 4x\sqrt{4x^2 - 9}$$ We need to find $x$ such that both $a$ and $b$ are real and positive. Let's test the feasible solution: Since $b = \frac{12x}{a}$, we substitute $a^2 = 8x^2 + 4x\sqrt{4x^2 - 9}$: $$b^2 = \left(\frac{12x}{a}\right)^2 = \frac{144x^2}{a^2} = \frac{144x^2}{8x^2 + 4x\sqrt{4x^2 - 9}} = \frac{144x^2}{4x(2x + \sqrt{4x^2 - 9})} = \frac{36x}{2x + \sqrt{4x^2 - 9}}$$ Rationalizing the denominator: $$b^2 = \frac{36x(2x - \sqrt{4x^2 - 9})}{(2x + \sqrt{4x^2 - 9})(2x - \sqrt{4x^2 - 9})} = \frac{36x(2x - \sqrt{4x^2 - 9})}{4x^2 - (4x^2 - 9)} = \frac{36x(2x - \sqrt{4x^2 - 9})}{9} = 4x(2x - \sqrt{4x^2 - 9})$$ We need to check if $a^2 + b^2 = 16x^2$: $$a^2 + b^2 = (8x^2 + 4x\sqrt{4x^2 - 9}) + 4x(2x - \sqrt{4x^2 - 9}) = 8x^2 + 4x\sqrt{4x^2 - 9} + 8x^2 - 4x\sqrt{4x^2 - 9} = 16x^2$$ This confirms our solution is correct. Now we need to find $a$: Since $a^2 = 8x^2 + 4x\sqrt{4x^2 - 9}$, we need to find $x$. Using the altitude property: The area of the triangle is also given by: $$\text{Area} = \frac{1}{2} \times AB \times CD = \frac{1}{2} \times 4x \times 3 = 6x$$ We already have $ab = 12x$. To find $x$, we use the relationship involving the segments of the hypotenuse: Since $AD = x$ and $BD = 3x$, we use the property of the altitude in a right triangle: $$CD^2 = AD \cdot BD = x \cdot 3x = 3x^2 \implies CD = \sqrt{3x^2} = x\sqrt{3}$$ But we know $CD = 3$, so: $$x\sqrt{3} = 3 \implies x = \sqrt{3}$$ Now we find $a$: $$a^2 = 8x^2 + 4x\sqrt{4x^2 - 9} = 8(\sqrt{3})^2 + 4\sqrt{3}\sqrt{4(\sqrt{3})^2 - 9} = 8 \cdot 3 + 4\sqrt{3}\sqrt{12 - 9} = 24 + 4\sqrt{3} \cdot \sqrt{3} = 24 + 4 \cdot 3 = 24 + 12 = 36$$ So, $a = \sqrt{36} = 6$. Thus, the length of $BC$ is $\boxed{6}$.

Analysis: This step 1000 solution is mathematically complete and rigorous—every algebraic step is justified and the logic is sound. However, it takes an algebraic brute-force approach, setting up a quartic equation and using the quadratic formula with denominator rationalization. The solution doesn’t leverage the elegant geometric insight upfront (that D = H, making the geometric mean theorem directly applicable).

What could be solved in 3 key steps (geometric mean → find x → apply leg-segment relationship) instead becomes a ~100-line algebraic derivation. The model only applies the geometric mean theorem near the end (line 446) after extensive algebraic machinery. While this demonstrates strong algebraic skills, it represents a shift away from geometric insight toward computational thoroughness.

Key Takeaways from Case Studies

These examples illustrate two manifestations of length growth:

Added verification steps (coin problem): The model learns to explicitly verify solutions even when unnecessary, adding ~100 tokens without improving accuracy on simple problems.
Shift toward algebraic thoroughness (geometry problem): Later checkpoints favor algebraic brute-force methods over geometric insights, even when the geometric approach is more elegant and efficient. This increases token count by ~500 tokens while maintaining correctness.

Both patterns demonstrate how training on mixed-difficulty datasets causes the model to apply more elaborate reasoning strategies across all problems—a form of “overthinking” on easy questions that emerges as a side effect of learning to solve harder ones.

Mitigating Overthinking: Length Penalty Interventions

Overthinking has been widely recognized as computationally wasteful, leading to growing interest in mitigation strategies. The most common approach is reward shaping through length penalties—penalties applied to longer outputs. Truncation can be viewed as an extreme length penalty where outputs exceeding a limit receive zero advantage.

We test several penalty variants and connect them to our findings on the mechanism behind length growth.

The Challenge

The asymmetric risk principle creates a dilemma: the model needs longer reasoning chains for hard problems, but this drives wasteful length growth on easy problems where correctness can be maintained at shorter lengths. An ideal intervention would:

Preserve hard problem performance: Allow the model to “think long” when necessary
Prevent easy problem waste: Discourage unnecessarily long responses on simple problems
Adapt to problem difficulty: Different problems have different “natural” solution lengths

Length Penalty Variants

We implement and evaluate three penalty mechanisms, each with different strategies for balancing efficiency and performance.

Variable definitions:

$R \in {0, 1}$ is the binary correctness reward
$L$ is the response length in tokens
$\alpha, \beta, \gamma$ are penalty coefficients (hyperparameters)

1. Uniform Length Penalty

\[\text{reward} = R - \alpha \times L\]

The simplest approach applies a constant penalty per token across all problems, regardless of difficulty.

2. Group-Based Adaptive Penalty (Kimi K1.5 [10])

\[\lambda = 0.5 - \frac{L - \text{min\_len}}{\text{max\_len} - \text{min\_len}}\] \[\text{reward} = \begin{cases} R + \beta \times \lambda & \text{(correct responses)} \\ R + \beta \times \min(0, \lambda) & \text{(incorrect responses)} \end{cases}\]

Inspired by Kimi K1.5 [10], this approach tracks the min/max response length per prompt and applies difficulty-aware penalties. It automatically adapts to each problem’s natural length range, rewarding brevity for correct responses while penalizing long incorrect ones. Note that $\lambda \in [-0.5, 0.5]$ normalizes the penalty based on the group’s length distribution.

3. Anchor Length Penalty

\[r = \max\left(0, \frac{L - \text{anchor\_length}}{\text{anchor\_length}}\right)\] \[\text{reward} = R - \gamma \times r\]

where $\text{anchor_length}$ is set when the model first achieves reliable competence (pass@1 > 90%).

This penalty is also adaptive per prompt, but uses a reference length from when the model first demonstrated competence on that problem, rather than the current group’s responses.

Experimental Results

In the experiments below, we set $\alpha=10^{-4}, \beta=0.1, \gamma=0.01$. These values were chosen so the penalty magnitude roughly matches the binary reward (given initial output lengths of $\sim 1000$ tokens).

Note: These are initial hyperparameter settings; further tuning may improve performance.

The figures below compare the four approaches across training:

Control (Pink) - No Penalty Baseline: Shows continued length growth from ~1000 to ~1300 tokens as the model learns to solve harder problems, with easy problem lengths also increasing (the “overthinking” phenomenon).

Uniform Penalty (Orange) - The Exploration Tax: Because the penalty is applied indiscriminately, with the weight $\alpha$ that we are applying, the model cannot afford the computational cost required to solve hard problems. The gradient signal is dominated by the length penalty, forcing the model to collapse to empty responses.

Group-Based Penalty (Green) - The Race to the Bottom: While initially stable, the relative nature of this penalty creates a moving target. As the batch becomes more efficient, the definition of “long” shifts downwards. The model chases this moving target until it optimizes away the reasoning chain entirely, leading to a total collapse in capability (score drops to 0). This is perhaps why in the original Kimi K1.5 the authors mention that they notice “this length reward will slow down training initially so they choose to gradually warm up the length penalty during training”.

Anchor Penalty (Teal) - Stable Efficiency: By contrast, the Anchor Penalty successfully stabilizes the length. It arrests the heavy-tail drift seen in the control run but does not force the model below its necessary reasoning baseline. It effectively locks in the efficiency of the early training stages without suffocating exploration.

Quantitative comparison on easy vs. hard problems:

Checkpoint	Penalty	Easy Pass Rate	Easy Avg Length	Hard Pass Rate	Hard Avg Length
Step 100	None	95.1%	530	2.3%	1039
Step 1000	None	97.0%	661	5.2%	1288
Step 1000	Anchor ($\gamma=0.01$)	97.0%	563	4.6%	958
Step 1000	Anchor ($\gamma=0.1$)	96.5%	533	4.5%	1067

All results are computed as the average of 64 samples per prompt.

The Anchor Penalty with $\gamma=0.01$ successfully reduces easy problem length by 15% (661→563 tokens) and hard problem length by 26% (1301→958 tokens) while maintaining comparable accuracy to the no-penalty baseline.

Summary

Our empirical investigation reveals dataset difficulty composition as the primary factor determining length dynamics under our Zero-RL setup.

Findings

Datasets with narrow difficulty range (MATH, Science) → Length convergence
- Models trained on datasets with similar difficulty to the model converge to stable output lengths
- Neither KL divergence nor GRPO length bias explains this behavior
Mixed-difficulty datasets with heavy hard-tail (BigMath) → Length growth
- Datasets with difficulty imbalance produce U-shaped length dynamics
- Removing hard problems eliminates consistent length growth
- Hard problems drive the overall length increase
The mechanism: Moving reward distributions cause asymmetric risk
- Hard problems get reward gains from longer reasoning (1039→1301 tokens, +262 tokens)
- This policy shift affects all problems broadly (easy problems: 530→661 tokens, +131 tokens)
- Models learn to maintain correctness on easy problems at longer lengths
- The empirical reward distribution for easy problems flattens to accommodate this shift
Difficulty-aware length penalties can mitigate overthinking without harming performance
- Uniform penalties cause exploration collapse by taxing all reasoning equally
- Group-based adaptive penalties create unstable moving targets that might lead to capability collapse
- Anchor-based penalties successfully stabilize length by locking to early-training efficiency

Limitations and Future Work

Model families and contamination: Our study uses the Qwen model family with Qwen3-4B as the base model in the majority of the experiments. Recent work has noted that some Qwen variants may exhibit training-data leakage on math benchmarks, and our own case studies show examples where the model reaches the correct final answer with flawed intermediate reasoning. Our analysis therefore focuses on answer-level correctness and length dynamics for this particular model under binary verifier rewards. We do not claim that the observed behavior necessarily reflects faithful mathematical reasoning, nor that it generalizes to all model families.

Length penalty applicability: The anchor penalty approach assumes that models start with reasonable output lengths and grow longer during training. It may not be suitable for models that are initially too verbose—for example, models initialized from checkpoints distilled with long chain-of-thought traces. In such cases, the anchor would be set at an already-inflated length, failing to provide efficiency gains. In those cases, a pre-set reference length like that proposed in the LASER paper [11] would be a good substitute. Our experiments use base models trained with RL from scratch (R1-Zero paradigm). A natural next step is evaluating these penalty mechanisms on supervised fine-tuned (SFT) models that may exhibit different initial length distributions. We plan to investigate this in the next phase of our experiments.

Theoretical understanding: While we’ve established the empirical patterns of length dynamics, the underlying mechanisms require further theoretical investigation. Key open questions include understanding why reward distributions exhibit inverted-U shapes across different lengths, how GRPO’s mode-matching behavior interacts with these reward curves, and whether we can develop predictive models for length dynamics on mixed-difficulty datasets.

References

[1] DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Technical Report. https://arxiv.org/abs/2501.12948

[2] DeepScaleR Team. (2025). DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL. https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2

[3] HKUST NLP Group. (2025). Simple-RL-Zero: Replicating DeepSeek-R1-Zero’s Reasoning with Reinforcement Learning. https://hkust-nlp.notion.site/simplerl-reason

[4] Zhang, Y., et al. (2025). When More is Less: Understanding Chain-of-Thought Length in LLMs. arXiv preprint.

[5] NVIDIA. (2024). Nemotron-4 340B Technical Report. https://arxiv.org/abs/2406.11704

[6] Shao, Z., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2412.11911. https://arxiv.org/abs/2412.11911

[7] Xiong, W., et al. (2025). DAPO: An Open-Source LLM Reinforcement Learning System at Scale. arXiv:2503.14476. https://arxiv.org/abs/2503.14476

[8] Luo, H., et al. (2025). BigMath: A Large-Scale High-Quality Mathematical Dataset for Enhanced Reasoning. arXiv:2502.17387. https://arxiv.org/abs/2502.17387

[9] Wei, Z., et al. (2025). Concise Reasoning via Reinforcement Learning. arXiv:2504.05185. https://arxiv.org/abs/2504.05185

[10] Moonshot AI. (2025). Kimi k1.5: Scaling Reinforcement Learning with LLMs. Technical Report. https://github.com/MoonshotAI/Kimi-k1.5

[11] Liu, W., Zhou, R., Deng, Y., Huang, Y., Liu, J., Deng, Y., Zhang, Y., and He, J. (2025). Learn to Reason Efficiently with Adaptive Length-based Reward Shaping. arXiv:2505.15612. https://arxiv.org/abs/2505.15612