[{"content":"A year ago, RL for LLMs was almost entirely about reasoning \u0026ndash; math, code, short single-turn problems with a clean verifier. The whole loop fit on one picture: model generates a response, environment scores it, optimizer updates the weights. Research focus were algorithmic (GRPO, DAPO, GSPO, Dr.GRPO); while infra was mostly an afterthought.\nAgentic RL breaks that picture. ROLL puts it well in their report: RLVR trains models that \u0026ldquo;can answer\u0026rdquo;; agentic RL trains models that \u0026ldquo;can act.\u0026rdquo; The model no longer produces a single response \u0026ndash; it lives inside a harness, calls tools, reads observations, spawns sub-agents, interacts with agent swarms, manages its own context, and finishes (or doesn\u0026rsquo;t) between minutes to days.\nThat shift is no longer an algorithm problem. It\u0026rsquo;s a systems problem. In agentic RL, rollout typically eats 80-90% of wall time, and the bottleneck moves from \u0026ldquo;how do I update the policy\u0026rdquo; to \u0026ldquo;how do I keep GPUs busy while a long, flaky, multi-turn trajectory finishes.\u0026rdquo; This post walks through the paradigm shift, tours four recent frameworks pushing on it (Forge, ROLL, SkyRL, Slime), and ends with our own take, Polar.\nWhat Actually Changed? A useful starting point is the objective Forge states up front:\nmax J(θ) = Throughput(A) × Sample-Efficiency(A) Throughput is the raw tokens-per-second the system pushes through rollout, training, and I/O. Sample efficiency is how much each trajectory teaches the model \u0026ndash; a function of data freshness, off-policyness, and noise. Every design choice in agentic RL infra is a trade-off between the two. Push throughput too hard and you go off-policy. Stay strictly on-policy and one long tail blocks the whole farm.\nA few concrete things change once you accept this objective:\nRollouts dominate the wall clock. Generating tokens is the smaller part of a rollout. The bigger parts are tool execution, sandbox boot, file I/O, sub-agent retries, and context management between turns. Two tasks that look identical can differ by 10-100× in completion time.\nLong tails are normal, not pathological. Some episodes finish in seconds. Some get stuck in retry loops or run into a slow tool and stretch to hours. Strict-sync FIFO turns one straggler into a global stall.\nAsync is no longer optional, but raw greedy async breaks training. You have to actively manage staleness, distribution drift, and stability.\nReward becomes sparse, gameable, and hard to verify. No clean \u0026ldquo;math answer correct\u0026rdquo; signal. An outcome-only 0/1 reward on a coding agent is essentially noise for the first few thousand steps. Add naïve shaping and the model finds a way to hack it.\nThe agent harness is increasingly a black box. The harnesses most worth training against \u0026ndash; Claude Code, Codex, Qwen Code, Gemini CLI \u0026ndash; are closed binaries you can\u0026rsquo;t open and rewrite into env.step().\nToken-In-Token-Out (TITO) drift is real. Decode-then-re-encode round-trips through text quietly diverge from the original token IDs (interstitial whitespace, special tokens, tokenizer edge cases). If the trainer optimizes a token sequence that isn\u0026rsquo;t the one the rollout policy actually sampled, your gradient is wrong.\nAll of these point the same direction: the training loop and the agent loop have to live in different processes, talk through a stable interface, and never assume the other side is a known quantity.\nA Tour of Agent RL Frameworks Let\u0026rsquo;s walk through four of the most representative frameworks evidenced in public \u0026ndash; Forge (Minimax), ROLL (Alibaba), SkyRL-Agent (NovaSky AI / Berkeley), and Slime (Zhipu) \u0026ndash; attacking the similar problem from different angles. Each is worth understanding on its own terms.\nForge: Treat the Agent as a Trajectory Producer Though a private framework, Forge\u0026rsquo;s tech report showcases its move to redefine what an agent is to the training system. It\u0026rsquo;s not a class inside the RL framework that gets step()-ed on. It\u0026rsquo;s an independent Trajectory Producer running on the outside; the RL system just consumes whatever messages and rewards come back.\nThe architecture has three layers: an Agent (white-box or black-box, producing trajectories), a Middleware layer (a Gateway Server for standardized LLM traffic and a Data Pool collecting completions and rewards asynchronously), and Engines (the rollout engine and train engine, syncing weights periodically).\nForge\u0026rsquo;s three-layer decoupling — Agent / Middleware / Engines.\nPure FIFO blocks on the head of the queue; pure greedy async lets the trainer over-consume short, fast samples and drift the distribution. Forge introduces a sliding window [H, H+W] \u0026ndash; inside the window, whoever finishes first gets consumed; outside, you wait. It\u0026rsquo;s not really a scheduler, it\u0026rsquo;s a sample distribution regulator.\nForge also takes the dense-reward problem seriously. A real coding session can span thousands of trajectory turns; an outcome-only 0/1 produces no learnable gradient for ages. Forge composes process rewards (penalize broken tool calls, dense intermediate signal), task-completion-time rewards (encourage shorter execution paths), and the final outcome. In practice, every reward component needs a cap. We\u0026rsquo;ve seen runs where a single uncapped shaping term silently swallows the outcome signal \u0026ndash; the model just learns to maximize the proxy. A reasonable default we landed on: outcome dominates, intermediate checkpoints contribute +0.1 each, process total capped at 0.5.\nROLL: Treat the Whole Pipeline as an Async Distributed System ROLL goes all-in on async. Its central claim, in CLI-Native Mode, is that the training framework should not reimplement the agent at all \u0026ndash; it should consume whatever an existing agent runtime (e.g. iFlow CLI) produces.\nROLL breaks one RL iteration \u0026ndash; rollout, env interaction, reward, loss, update \u0026ndash; into stages that run on independently scheduled workers. Trajectories flow into a sample buffer; the trainer pulls batches and applies an asynchronous ratio to cap how stale a sample can be relative to the current policy. Out-of-budget samples get dropped.\nROLL Agentic Learning Ecosystem — Train / Sandbox / CLI separation, async with sample buffer.\nA second move, train-rollout multiplexing, dynamically reassigns GPUs across the boundary. If rollout becomes the bottleneck, GPUs slide from training to rollout; once the sample buffer fills up, they slide back. Static 50/50 splits leave one side idle.\nThe third, Chunked MDP + IPA (Interaction-Perceptive Agentic Policy Optimization), is more algorithmic but lands in the infra. ROLL argues that token-level optimization is too fine (most tokens don\u0026rsquo;t change environment state) and sequence-level is too coarse (mixes many decisions into one credit). They cut the trajectory into interaction chunks \u0026ndash; the contiguous tokens between two environment interactions \u0026ndash; and propagate return, importance ratio, and mismatch masking at chunk granularity.\nROLL architecture \u0026ndash; Async pipelines and Train-Rollout Multiplexing.\nROLL also treats environment cleanliness as part of reward correctness, not engineering hygiene. In one of their early synthetic harnesses, false-positive rate hit ~40%: a task asked for a full git push to a webserver, but the test script only checked whether some URL returned \u0026quot;hello world\u0026quot;, so the agent learned to echo it directly into the doc root and call the task done. ROLL responded with three checks at task admission \u0026ndash; an LLM-as-judge sanity pass, a ground-truth pass (golden solution must pass the test), and a no-op pass (doing nothing should not pass the test). If your verifier isn\u0026rsquo;t reliable, the policy will optimize against the verifier\u0026rsquo;s holes, not the task.\nSkyRL: Pipeline the Stages, Unify Through Tools SkyRL-Agent attacks the same throughput-vs-stability tension from a different angle: rather than redefining the agent boundary or going fully async, it focuses on fine-grained pipeline scheduling across heterogeneous rollout stages, paired with a tool-centric unified agent loop that keeps every framework concern reachable through one abstraction.\nThe starting observation is that a multi-turn rollout is not one job. It\u0026rsquo;s at least three: runtime initialization (cold-starting a sandbox or container \u0026ndash; CPU-bound and slow), agent run (the LLM-and-tool loop itself \u0026ndash; mixed CPU/GPU), and reward calculation (run tests, score outputs \u0026ndash; often CPU-bound and long-tailed). Naive async batching launches them all at once and waits on the slowest; bounded batching caps concurrency but still serializes the stages within each trajectory. Both leave the GPU idle for stretches.\nSkyRL\u0026rsquo;s Async Pipeline dispatcher overlaps these three stages across trajectories through three bounded queues of configurable size. While trajectory i is in its CPU-bound reward stage, trajectory j can already be running its LLM forward pass on the same GPU. The reported numbers: 1.55× speedup over naive bounded async batching and ~90% sustained GPU utilization during generation \u0026ndash; exactly the \u0026ldquo;kill the CPU/GPU bubbles\u0026rdquo; win anyone who has profiled long-horizon rollouts will recognize.\nSkyRL-Agent\u0026rsquo;s three-stage decomposition (INIT / RUN / REWARD) and Async Pipeline dispatching.\nTwo more design choices worth flagging:\nTool-centric unified agent loop. Every agent action \u0026ndash; stateless calls (python interpreter), environment-modifying calls (file editor), and even agent-state-modifying operations like summarization or history truncation \u0026ndash; is registered through the same BaseTool interface. This puts context management inside the action space rather than as an out-of-band hack, the same insight Forge reaches independently. It also means a new task is just a few tool registrations, with the main agent loop untouched.\nTransition-based trajectory representation. SkyRL replaces conventional mask-based loss construction with a (o_t, a_t, r_t) transition tuple captured by a lightweight @record_transitions decorator. Each LLM invocation gets recorded with its input tokens, output tokens, and log probabilities; post_process then aggregates them into a backend-agnostic format consumable by SkyRL-train, VeRL, or Tinker. The benefit is twofold \u0026ndash; it sidesteps the brittleness of mask-based concatenation when context gets compacted between turns (the \u0026ldquo;one continuous text sequence\u0026rdquo; assumption breaks the moment you summarize), and it gives token-level fidelity for free: the inference engine\u0026rsquo;s actual sampled IDs and logprobs are what the trainer optimizes, not a re-tokenized transcript.\nSkyRL also lands a recipe-level lesson worth absorbing. When they trained SA-SWE-32B (Qwen3-32B → 39.4% on SWE-Bench Verified, with \u0026gt;2× cost reduction over comparable baselines), they noted that better tools matter more than more steps. A bash-only setup spent most of its rollout budget thrashing through grep and view; replacing it with an AST-aware code search tool dramatically improved both rollout pass@K and sample efficiency. Reward shaping fixes credit assignment; tool design fixes the action distribution itself. The two are not interchangeable.\nSlime: Fully Async, Just Control the Drift Slime takes the most radical position: don\u0026rsquo;t try to be sync, don\u0026rsquo;t even pretend. Inference engines stream trajectories continuously, the training engine refreshes weights on its own beat, and rollout weights periodically resync to whatever training is on.\nThe interesting question becomes: when a single trajectory can span multiple policy versions, how do you contain the off-policy error?\nSlime\u0026rsquo;s answer is engineering-flavored: don\u0026rsquo;t try to reconstruct the exact π_old per token (which would require tracking every historical checkpoint). Instead, log the log-probability at sampling time as the behavior signal, then compute importance ratios against the current policy at training time. It\u0026rsquo;s an approximation, but it\u0026rsquo;s the right approximation for an async setting.\nOn top of that:\nDirect double-sided importance clipping [1 - ε_l, 1 + ε_h]. Tokens with ratio outside the window get masked entirely rather than scaled \u0026ndash; a stricter version of DAPO\u0026rsquo;s Clip-Higher (e.g. raise the upper clip from 0.2 → 0.28 to give low-probability exploratory tokens more room). Sample-level staleness threshold on top of token-level clipping. If a response was generated by a policy too many versions behind the trainer, drop it. Local drift gets handled by clipping; global staleness gets handled by sample drop. Prefill-Decode Disaggregation. Long prefills and short decodes shouldn\u0026rsquo;t share the same cluster \u0026ndash; in multi-turn agent traffic, prefills pile up and starve the decode side. Slime separates them physically, and trajectory completion time stabilizes. TITO Gateway records the actual token IDs sampled by the inference backend rather than re-tokenizing the text after the fact. Combined with FP8 rollout inference, this single decision removes a whole category of silent gradient corruption. Slime\u0026rsquo;s architecuture overview\nConclusion: Common Patterns Worth Taking A handful of things cut across all four, and has been reinforced over my experiments:\nDo not mask or drop thinking tokens \u0026ndash; that\u0026rsquo;s where the model\u0026rsquo;s value actually lives. Mask tool observations including tool error, or the fitted model can repeat errors. Token-level loss aggregation (DAPO-style) is more stable than sequence-level on long running trajectories. GRPO has a length bias \u0026ndash; longer responses get implicitly up-weighted by mean/std normalization. Dr.GRPO drops the length term and recovers an unbiased baseline. Sandbox container pooling matters more than you think. Reducing CPU bound runtime preparation steps can usually give 3~5x speedup in our setups. Red-team the reward before launch. Prerun separate experiments encouraging reward hacking based on your reward function to surface possible hacking patterns. During training, sample rollouts over time and run an LLM-as-judge for hack-pattern detection. It catches things the reward curve hides. Polar: When the Harness is a Black Box Introducing my recent work, Polar. The frameworks above all assume some level of control over the agent \u0026ndash; the Trajectory Producer in Forge, the CLI runtime in ROLL. But many of the most interesting harnesses to train against today are closed binaries: Claude Code, Codex, Qwen Code, Gemini CLI. You can\u0026rsquo;t wrap them in predefined Python API. You can\u0026rsquo;t even necessarily intercept their internal event loop.\nPolar (our recent work) starts from this constraint: can we train agents with RL without opening the box?\nThe observation is simple. Every LLM-based agent, no matter how complex its internals, has to talk to a model. The model API endpoint is a universal interface that exists outside the harness. So instead of integrating into the harness, Polar places a provider-compatible proxy at the LLM API boundary. The harness runs unchanged, makes its normal Anthropic / OpenAI / Google-shaped calls, and the proxy quietly records prompts, sampled token IDs, log probabilities, and responses on the way through.\nPolar — proxy at the LLM API boundary, rollout server + gateway nodes with INIT / RUN / POSTRUN worker pools.\nAfter execution, Polar reconstructs trainer-ready trajectories from the captured completions. Two builders are provided. per_request keeps each model call as one trace \u0026ndash; lossless but fragments a long session into many short samples. prefix_merging recovers append-only conversation chains and emits longer, contiguous traces; sub-agents, parallel branches, and context compactions naturally fall into separate chains. Within each merged chain, only sampled assistant tokens are copied as trainable, while canonical interstitial tokens are masked \u0026ndash; behavior-policy fidelity preserved, trainer-facing samples cut down.\nOn the infra side, Polar separates rollout submission, runtime initialization, harness execution, trajectory reconstruction, and evaluation into independent worker pools (INIT, RUNNING, POSTRUN) per gateway node. Runtime prewarm and long-tail evaluation run off the GPU-critical path, so CPU-heavy container boot doesn\u0026rsquo;t block the next agent run. Trainers consume the resulting trajectories asynchronously, completely agnostic to whatever harness produced them.\nQwen3.5-4B reward over RL steps, rollout on different harnesses\nWe validated this on SWE-Bench Verified, starting from the same Qwen3.5-4B base checkpoint and running standard GRPO over four real coding harnesses:\nHarness Base Polar RL Gain Codex 3.8% 26.4% +22.6 Claude Code 29.8% 34.6% +4.8 Qwen Code 34.6% 35.2% +0.6 Pi 34.2% 40.4% +6.2 The biggest gain is on Codex \u0026ndash; a harness whose tool schema is far from the base model\u0026rsquo;s native priors, so RL has the most adaptation room. prefix_merging over per_request also delivered a 5.39× wall-clock speedup on the trainer side, since fewer fragmented updates means rollout GPUs stay above 87% utilization instead of bouncing on context switches.\nWhat Next-Gen Agent RL Infra Looks Like Pulling everything together, the design space for the next generation seems to converge on a few principles:\nThe integration boundary is the model API, not env.step(). Any infra that requires the harness to be ported into a Python class will lose to one that listens at the LLM endpoint, because the harnesses worth training against are increasingly closed.\nAsync by default, with staleness made explicit. Pick your tools \u0026ndash; async ratio, importance clipping, sample drop, version-tagged buffers \u0026ndash; but pick something. Hidden staleness is hidden bias.\nProcess rewards / PRMs become first-class. Pure outcome reward on long trajectories is too sparse to learn from and too easy to hack. Dense intermediate signal (capped!) is the difference between a run that converges and a run that doesn\u0026rsquo;t.\nEnvironment cleanliness is part of reward correctness. False positives, leaked test files, cached artifacts \u0026ndash; the agent will find them all. Reward is only as trustworthy as the environment it runs in.\nKV cache and speculative decoding across shared resources. Global KV pools, group-aware spec decoding, and PD disaggregation move from rollout micro-optimizations to first-class infra primitives.\nToken fidelity end-to-end. The trainer must optimize the tokens that the rollout policy actually sampled. Anything that round-trips through text decode/encode will silently corrupt the gradient.\nThe clean separation we\u0026rsquo;re starting to see \u0026ndash; harness on the outside, model API as the seam, rollout-as-a-service in the middle, training engine just consuming what comes back \u0026ndash; feels like the right factoring. As we approach the data wall of imitation learning, optimizing Agentic RL infra is the new scaling law towards real world intelligence.\nCitation Xu, Binfeng. \u0026#34;Rethinking RL Infra for Agents\u0026#34;. B\u0026#39;Log (May 2026). https://billxbf.github.io/posts/agent-rl-infra/ ","permalink":"https://billxbf.github.io/posts/agent-rl-infra/","summary":"Why the agentic shift breaks classical RL infra, a tour of Forge, ROLL, SkyRL and Slime, and my recent take with Polar (Agentic RL on Any Harness at Scale).","title":"Rethinking RL Infra for Agents"},{"content":"Modern AI agents are typically scaffolded with a runtime sandbox, and these Computer-Use Agents (CUA) autonomously run code, use the terminal, take notes, and access the Internet and MCPs \u0026ndash; exactly like humans do when interacting with the digital world.\nYet the underlying reasoning and practices remain unclear to most, so let\u0026rsquo;s dive into popular agent scaffolds like Claude Code and MiniMax Agent, demystifying the design principles and discovering how agents benefit from using a computer.\nWhy Sandbox? In short, Context Delegation and Runtime Isolation.\nContext Delegation. Even with memory-efficient tricks like KV caching and Linear Attention architectures, long-context reasoning is still a nontrivial challenge to AI agents. Production AI agents are usually packed with bloated tool descriptions and system prompts to cover more case handling and example following. Context length can easily grow to 100k or over 1M in multi-turn agentic trajectories \u0026ndash; especially when tool responses contain extraneous data. Growing contexts dilute attention and cause huge memory burden to both training and inference.\nIntuitively, you\u0026rsquo;d want to move episodic context out of the main agent loop using:\nSubagent: Spin up subtasks to invoke another agent. Without sharing a full context, subagents can easily duplicate work, wasting excessive tokens, while over-engineered orchestration introduces inductive bias. Filesystem: Use the filesystem to keep todo lists, agent memory, and conditional instructions (aka Agent Skill) that the agent chooses to load into the context when necessary. Sandbox Filesystem is an agent\u0026rsquo;s extended context through computer-use. Runtime Isolation. Sandboxing keeps all agent actions in a tightly controlled environment, protecting the host system by isolating potential errors and de-risking real user data or resources. That isolation is especially critical when executing popular \u0026lt;python\u0026gt; and \u0026lt;browser\u0026gt; tools.\nCursor (Agent mode) running commands in a secure sandbox\nClaude Code \u0026ndash; More than just Code Claude Code has gained huge traction in 2025. It was originally built for coding assistance, but Anthropic is clearly steering it toward more general agent use cases. As highlighted in recent podcasts, Claude Code already excels at, or can be extended to, deep research and vertical specialist agents. Andrej has also been tweeting mini projects like agentic home control in the physical world.\nClaude Code scaffolding. The agent controls a virtual sandbox with Bash and coding. Context delegated to filesystem (Skills). Tools executed outside the sandbox are implemented as MCP.\nAgentic Computer Use What\u0026rsquo;s the magic here? Metaphorically, an AI Agent equipped with a Filesystem and Runtime Environment is just a human using a computer \u0026ndash; the primary, if not only interface connecting humans to the Digital World. Imagine the action space. ¯_(ツ)_/¯ With that lens, scaffolding an agent turns into OS design:\nConfiguring a Linux Docker / VM, or even creating a completely native OS from scratch (yes, there are a few working on it). Designing the Filesystem within the sandbox. Pre-install some \u0026ldquo;Apps\u0026rdquo; for the agents \u0026ndash; a Terminal, a Browser, a writing pad \u0026hellip; and place some instructions for the agent to refer to before use (Anthropic names it SKILL.md). Meanwhile, give agents a \u0026ldquo;shortcut\u0026rdquo; to invoke outside endpoints \u0026ndash; like credential-related database querying. These fit into the bucket of traditional function-calling or MCP. Claude Code System Prompt Anthropic still doesn\u0026rsquo;t publicly share Claude Code\u0026rsquo;s system prompt and tool implementations in its Claude Agent SDK. However, there\u0026rsquo;s an interesting thread where people trace and hack Claude Code\u0026rsquo;s request \u0026amp; response, and reverse engineer the system prompt and its commit diff across versions. From that prompt, we can clearly find some sparks and lessons in agent tool implementations.\nFor example, Claude Code doesn\u0026rsquo;t use an interactive browser like browser-use does by default. Instead it splits capabilities into separate WebSearch and WebFetch tools, likely due to speed \u0026amp; efficiency concerns.\nSummary of Claude Code\u0026rsquo;s tools from the hacked system prompt.\nClaude Code Filesystem When locally deployed, Claude Code uses ~/.claude/ for its own sandbox workspace, and it mounts your work project under ~/.claude/projects/. The scaffold keeps plugins (MCP integrations) and skills in different directories, and also tracks TODO, personalization configs, and metadata like command history and debug logs.\nClaude Code\u0026rsquo;s working filesystem.\nMiniMax Agent MiniMax has been the most surprising AI lab for me this year. It ships the #1 (as of Dec 2025) OSS model on LMArena WebDev at 229B weights, which is considered \u0026ldquo;small\u0026rdquo; among parallel flagship models. Meanwhile, the agent scaffold, MiniMax Agent creates surprisingly good apps and reports with prolonged reasoning and autonomous execution. Here\u0026rsquo;s my favorite trajectory replay.\nMiniMax Agent using browser for autonomous app creation (Netflix Clone)\nMiniMax Agent has a UI component that lets you navigate the agent\u0026rsquo;s sandbox filesystem in realtime, and you can prompt the agent to summarize its initial filesystem state. The workspace is a Python-based development environment designed for AI agents with integrated external API access capabilities. Rich modular data sources connect to various third-party APIs including Twitter, Yahoo Finance, TripAdvisor, Pinterest, Patents, Scholar, Commodities, Metals, and Booking.com. None of these live inside the system prompt, which lets the agent fully leverage context delegation in the sandbox.\nMiniMax Agent\u0026rsquo;s working filesystem.\nOther Computer-use Agents Major AI labs have all been building Computer-use Agents, each guided by its own product philosophy. OpenAI seems to have made great progress on Operator and ChatGPT Atlas. I really enjoy the visual presentation and dynamics of the agent sandbox at runtime.\nOpenAI Agent using Terminal\nWhen you prompt Operator to describe its filesystem, you\u0026rsquo;ll observe a complete Linux VM with two working directories \u0026ndash; /home/oai containing session data and /openai storing internal \u0026ldquo;Skills\u0026rdquo;. From ChatGPT\u0026rsquo;s self manifest, the only skill installed is a Browser.\nOpenAI Agent filesystem\nBeyond OpenAI, numerous teams are pushing here as well. For example, Google AI Studio (Build) can build and test apps from ideas with advanced multimodal capabilities from Gemini, and Manus orchestrates a huge number of subagents and services like browser-use to max out agent action space. I\u0026rsquo;ll leave the rest of the exploration to readers due to limited time.\nAgentic RL Agent, Action (tools) and Environment (scaffold) are three tightly bound concepts in the traditional literature. Consistency between training (rollout) Gym and inference scaffold is critical to maintain agent performance. However, this consistency has become a luxury since model providers won\u0026rsquo;t expose training infra, usually resulting in tedious engineering efforts guessing the \u0026ldquo;right\u0026rdquo; scaffolding and orchestrations among downstream agent builders.\nMost LLMs are trained with vanilla sandbox scaffolds to support evaluation like Terminal-Bench, yet a language model doesn\u0026rsquo;t natively \u0026ldquo;know\u0026rdquo; how to use your sandbox \u0026ndash; especially when you want to customize the tools and \u0026ldquo;Skills\u0026rdquo;. In this case, RL becomes an effective data-efficient method for your last mile.\nRollout Infra One key challenge in Agentic RL is to build stable and efficient rollout infra. The additional factor of sandbox and tools like browser impose difficulty in asynchronous runtime efficiency, state management, and security. These rollouts are usually magnitudes more expensive (time and effort) than non-agentic RL like Math CoT. A nice starting point is OpenHands V1 released lately. The architecture of decoupled modules (abstraction, tools, sandbox, and server) provides solid coordination with reusable modules across scaffolding and rollout serving. Besides, Pytorch OpenEnv provides a nice Gymnasium-style endpoint over commonly used agent docker environments.\nOpenHands V1 with decoupled modules reusable across rollout and scaffolding.\nReward Design and Training Recipe Another challenge in Agentic RL is defining the reward function. Computer-use agents usually tackle open-ended tasks like research and app building, which lack unbiased and verifiable scoring mechanisms as in Math and Coding. Meanwhile, vanilla use of LLM-as-judge to generate rewards can easily trap the policy into adversarial distribution from the teacher (Reward Hacking). This is also confirmed in Andrej\u0026rsquo;s recent interview \u0026ldquo;RL is terrible\u0026rdquo;. An interesting approach to mitigate reward hacking in open-ended research is brought by AI2 DR Tulu, where rubrics are buffered and generated on the fly together with policy. The algorithm setups used for Agentic RL generally follow the lessons from long-context RL in LLMs, such as broadening exploration in prolonged steps.\nReinforcement Learning with Evolving Rubrics (RLER)\nConclusion This post showcases the state of Computer-use Agents \u0026ndash; from product to design principles. It elaborates the reasoning and importance of runtime sandbox in AI Agents and briefly introduces the engineering and training challenges. 2026 will be the year of Computer-use Agents \u0026ndash; with focus shifting from Prompt Engineering to Sandbox Engineering and RL on custom scaffolds ;)\nCitation Xu, Binfeng. \u0026#34;Demystifying Agent Sandbox\u0026#34;. B\u0026#39;Log (Dec 2025). https://billxbf.github.io/posts/demystify-agent-sandbox/ ","permalink":"https://billxbf.github.io/posts/demystify-agent-sandbox/","summary":"\u003cp\u003eModern AI agents are typically scaffolded with a runtime sandbox, and these Computer-Use Agents (CUA) autonomously run code, use the terminal, take notes, and access the Internet and MCPs \u0026ndash; exactly like humans do when interacting with the digital world.\u003c/p\u003e\n\u003cp\u003eYet the underlying reasoning and practices remain unclear to most, so let\u0026rsquo;s dive into popular agent scaffolds like \u003ca href=\"https://www.anthropic.com/engineering/claude-code-best-practices\"\u003eClaude Code\u003c/a\u003e and \u003ca href=\"https://agent.minimax.io/\"\u003eMiniMax Agent\u003c/a\u003e, demystifying the design principles and discovering how agents benefit from using a computer.\u003c/p\u003e","title":"Demystifying Agent Sandbox"},{"content":"This is a fairly procrastinated start to my personal blog. Starting a blog isn’t as easy as it seems—I don’t want to waste people’s time with casual anecdotes. Meanwhile, an overly formal academic write-up would likely be overkill and scare people away.\nThere are many people who truly enjoy machine learning and find joy in sharing knowledge. I’ve been a long-time follower of AI/tech blogs from Andrej Karpathy, Lilian Weng, Yao Fu, and others. I usually prefer blogs over papers because blogs feel more honest and less AI-polished (or written to attract citations). Yet almost everyone I followed stopped posting in early 2025. I understand the shifts and hype in SF lately that keep everyone busy building and/or financially free. Still, I’d be sad if this vibe disappears—it’s been truly helpful to me over the past few years, along with many others.\nI’ve been more of a builder than a writer, hesitant to publish conclusions without solid ground. But lately I’ve found it really fruitful to do deep dives into the state of things and the literature before engineering. A blog of “study notes” not only helps me organize thoughts, but also carries on that missing vibe of sharing knowledge in a blunt and unpretentious way. And so I finally made up my mind to start writing—and I truly hope you enjoy it. Stay tuned and keep the spark going!\nWhat to Expect I\u0026rsquo;ll be writing about:\nTech Hacks: Most of the time, truth in tech world doesn\u0026rsquo;t reside in reports or posts but hidden in code. I enjoy digging these gems out to see the real flow. Different AI Research: I continuously keep track of frontier AI research and especially favor the non-incremental ones, like new architectures to model intelligence. Stories and Gossip: The true SOTA usually come from stories and gossip shared through friends, meetups, and conferences. ","permalink":"https://billxbf.github.io/posts/hello_world/","summary":"\u003cp\u003eThis is a fairly procrastinated start to my personal blog. Starting a blog isn’t as easy as it seems—I don’t want to waste people’s time with casual anecdotes. Meanwhile, an overly formal academic write-up would likely be overkill and scare people away.\u003c/p\u003e\n\u003cp\u003eThere are many people who truly enjoy machine learning and find joy in sharing knowledge. I’ve been a long-time follower of AI/tech blogs from \u003ca href=\"https://karpathy.github.io/\"\u003eAndrej Karpathy\u003c/a\u003e, \u003ca href=\"https://lilianweng.github.io/\"\u003eLilian Weng\u003c/a\u003e, \u003ca href=\"https://yaofu.notion.site/\"\u003eYao Fu\u003c/a\u003e, and others.\nI usually prefer blogs over papers because blogs feel more honest and less AI-polished (or written to attract citations). Yet almost everyone I followed stopped posting in early 2025. I understand the shifts and hype in SF lately that keep everyone busy building and/or financially free. Still, I’d be sad if this vibe disappears—it’s been truly helpful to me over the past few years, along with many others.\u003c/p\u003e","title":"Why I start to write"},{"content":" Binfeng Xu I\u0026rsquo;m a research engineer at NVIDIA. Currently, I work on Agent RL and harness codesign for computer-use and continual learning.\nFormerly, I was a researcher at Samsung Research (SRA) where I led LLM post-training + distillation infra. I enjoy training large neural nets, building open-source projects and competing on Kaggle, where I rank top 1% globally.\nPapers Polar: Agentic RL on Any Harness at Scale\nBinfeng Xu, Hao Zhang, Shaokun Zhang, Songyang Han, Mingjie Liu, Jian Hu, Shizhe Diao, Zhenghui Jin, Yunheng Zou, Michael Demoret, Jan Kautz, Yi Dong\nGentopia: A Collaborative Platform for Tool-Augmented LLMs\nBinfeng Xu, Xukun Liu, Hua Shen, Zeyu H, Yuhan L, Murong Y, Zhiyuan P, Yuchen L, Ziyu Y, Dongkuan Xu\nReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models\nBinfeng Xu, Zhiyuan Peng, Bowen Lei, Subhabrata Mukherjee, Yuchen Liu, Dongkuan Xu\nDynamic Noise Preference Optimization for LLM Self-Improvement\nHaoyan Yang, Ting Hua, Shangqian Gao, Binfeng Xu, Zheng Tang, Jie Xu, Hongxia Jin, Vijay Srinivasan\nEfficient Computation of Tucker Decomposition of Correlation-Based Tensors\nBinfeng Xu, Grey Ballard, Robert Lyday, Paul Laurienti\nMisc I play all games by Hidetaka Miyazaki, who motivated me once into indie Game Dev. Photography @Instagram; I enjoy Art. Minimalist. ","permalink":"https://billxbf.github.io/about/","summary":"\u003cdiv style=\"display: flex; align-items: flex-start; gap: 2rem; flex-wrap: wrap;\"\u003e\n\u003cdiv style=\"flex: 1; min-width: 300px;\"\u003e\n\u003ch2 style=\"margin-top: 0;\"\u003eBinfeng Xu\u003c/h2\u003e\n\n\u003cp\u003eI\u0026rsquo;m a research engineer at \u003cstrong\u003eNVIDIA\u003c/strong\u003e. Currently, I work on Agent RL and harness codesign for \u003cstrong\u003ecomputer-use\u003c/strong\u003e and \u003cstrong\u003econtinual learning\u003c/strong\u003e.\u003c/p\u003e\n\u003cp\u003eFormerly, I was a researcher at Samsung Research (SRA) where I led LLM post-training + distillation infra. I enjoy training large neural nets, building open-source projects and competing on \u003ca href=\"https://www.kaggle.com/billbafare\"\u003eKaggle\u003c/a\u003e, where I rank top 1% globally.\u003c/p\u003e\n\u003c!-- During my MS. at NYU, I briefly worked with [Alfredo Canziani](https://atcold.github.io/) and [Yann LeCun](https://scholar.google.com/citations?user=WLN3QrAAAAAJ\u0026hl=en) on autonomous driving. Prior at WFU, I was advised by [Grey Ballard](https://users.wfu.edu/ballard/) on efficient tensor decomposition and [Paúl Pauca](https://paucavp.sites.wfu.edu/) on object detection for drone \u0026 satellite images. --\u003e\n\n\u003c/div\u003e\n\u003cdiv style=\"flex-shrink: 0; margin-top: 2rem;\"\u003e\n\u003cimg src=\"/pics/billxbf.png\" alt=\"Binfeng Xu\" style=\"width: 220px; border-radius: 8px;\"\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\n\u003chr\u003e\n\u003ch2 id=\"papers\"\u003ePapers\u003c/h2\u003e\n\u003cul\u003e\n\u003cli\u003e\n\u003cp\u003e\u003ca href=\"/works/polar_arxiv.pdf\"\u003e\u003cstrong\u003ePolar: Agentic RL on Any Harness at Scale\u003c/strong\u003e\u003c/a\u003e\u003cbr\u003e\n\u003cspan style=\"font-size: 0.8em; color: #666;\"\u003e\u003cstrong\u003eBinfeng Xu\u003c/strong\u003e, Hao Zhang, Shaokun Zhang, Songyang Han, Mingjie Liu, Jian Hu, Shizhe Diao, Zhenghui Jin, Yunheng Zou, Michael Demoret, Jan Kautz, Yi Dong\u003c/span\u003e\u003c/p\u003e","title":""}]