Jack (Hao) Bai

Hi there! I’m Jack. I’m a third-year Ph.D. student at UIUC CS, advised by Prof. Tong Zhang. I work closely with Prof. Aviral Kumar @ CMU MLD. I am an incoming research intern at NVIDIA, managed by Prof. Yejin Choi.

Recently, I research on fundamental questions on vision-language model reasoning in multi-step environments, modernly named “agents”, with reinforcement learning. I tackle problems with both empirical insights and theoretical considerations.

I was previously a visiting scholar advised by Sergey Levine @ BAIR, and a research intern at Microsoft Research. I received my dual undergrad degree from UIUC and Zhejiang University. During those wonderful years, I was lucky enough to have worked with great minds like Yi Ma @ BAIR and Chengxiang Zhai @ UIUC.

In my free time, I study music theory, majoring in chord progression.

A public up-to-date resume can be found here.

News

Mar 08, 2026	Our paper WebGym has been accepted to CVPR 2026! Check out the paper on ArXiv and the project page.
Jan 09, 2026	Today, we proudly announce the release of WebGym, the largest yet open-source RL training environment for visual web agents. The preprint can be accessed at ArXiv. We proposed (1) the RL framework with highest rollout speed, (2) recipe that supports training agents on long-horizon tasks, and (3) scaling dimensions that effectively improves the RL performance with the task set proposed.
Jun 11, 2025	My first paper on web agents with RL, TTI is released! Check out the preprint! I am super proud of this work and believe it will lead to a shift of paradigm in multi-step agent reasoning with RL+VLM.

Research Blogs

2023

2024

2025

2026

Mar 11, 2026	rl What Does Flow-Matching Bring to Deep RL?
Feb 15, 2026	rl Generalizable Value Functions and Emotions (?)
Jan 09, 2026	rl How to Use Privileged Information in RL

Oct 01, 2025	agent Position: Why Web is a Good Environment to Study RL?
Sep 01, 2025	llm Pretraining, Post-training, and Test-Time Reasoning
Aug 07, 2025	rl Challenges in Scaling Q-Learning
Jul 22, 2025	agent Are Multi-step Agents Overthinking?
May 27, 2025	rl Policy Optimization without a Critic: The GRPO Family
Mar 15, 2025	rl Can Language Models Be Critic Functions?

Oct 22, 2024	rl RL on Language under Single-step Settings
Aug 01, 2024	llm LLM Optimization Basics: Memory
Jun 15, 2024	llm LLM Optimization Basics: Time
May 22, 2024	rl Importance Sampling: Why and How
Mar 13, 2024	rl Policy Gradient and Actor-Critic

Jun 07, 2023	llm Self-Attention Layer and The Transformers Architecture

Big Minds

Mar 23, 2026	Jensen Huang: NVIDIA and the AI Revolution
Nov 25, 2025	Ilya Sutskever: From the Age of Scaling to the Age of Research
Aug 15, 2023	Ilya Sutskever: An Observation on Generalization
Feb 01, 2018	Ilya Sutskever: Meta Learning and Self Play

Theory Study

Feb 12, 2026	music The Pentatonic Scale
Dec 13, 2025	music Non-Diatonic Notes
Sep 18, 2025	phil Foundations of Reductionism
Aug 24, 2025	music Jazz Chords and Their Variants
Jul 04, 2025	info Kolmogorov Complexity
Jun 13, 2025	music The Komuro Progression

Selected Publications

CVPR 2026

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

Hao Bai , Alexey Taymanov, Tong Zhang, Aviral Kumar, and Spencer Whitehead

Jan 2025

Abs HTML PDF Code

We present WebGym, the largest-to-date open-source environment for training realistic visual web agents. Real websites are non-stationary and diverse, making artificial or small-scale task sets insufficient for robust policy learning. WebGym contains nearly 300,000 tasks with rubric-based evaluations across diverse, real-world websites and difficulty levels. We train agents with a simple reinforcement learning (RL) recipe, which trains on the agent’s own interaction traces (rollouts), using task rewards as feedback to guide learning. To enable scaling RL, we speed up sampling of trajectories in WebGym by developing a high-throughput asynchronous rollout system, designed specifically for web agents. Our system achieves a 4-5x rollout speedup compared to naive implementations. Second, we scale the task set breadth, depth, and size, which results in continued performance improvement. Fine-tuning a strong base vision-language model, Qwen-3-VL-8B-Instruct, on WebGym results in an improvement in success rate on an out-of-distribution test set from 26.2% to 42.9%, significantly outperforming agents based on proprietary models such as GPT-4o and GPT-5-Thinking that achieve 27.1% and 29.8%, respectively. This improvement is substantial because our test set consists only of tasks on websites never seen during training, unlike many other prior works on training visual web agents.
NeurIPS 2025

Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

Hao Bai , Junhong Shen, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, and Aviral Kumar

May 2025

Abs HTML PDF Code

Most paradigms for building foundation model agents rely on prompting or finetuning on existing demonstrations, but this is not sufficient in dynamic environments (e.g., mobile device control). In theory, while on-policy reinforcement learning (RL) should address these limitations, this approach itself is not quite effective at leveraging existing agentic data, especially when it is of low quality. An approach to address this issue is to use offline value-based RL but realizing value-based RL for agents has been elusive due to of stability and efficiency associated with running TD-learning at scale with vision-language models (VLMs). In this paper, we develop a scalable value-based RL approach called Digi-Q that makes it possible to train VLM agents with TD-learning. We situate our study in building GUI agents for Android devices. The key idea in Digi-Q is to perform TD-learning on a frozen, intermediate-layer representation of a VLM rather than training the whole VLM itself. Doing so successfully requires an initial phase of fine-tuning to prime VLM representations to feature actionable information that is critical for TD-learning. When done correctly, our approach is able to attain better performance per-unit compute FLOPS. To make maximal use of the learned Q-function, we devise a novel best-of-N policy extraction operator that imitates the best actions out of multiple candidate actions from the current policy as ranked by the value function. With no REINFORCE-style policy gradients that need careful tiuning and an efficient TD-learning approach, Digi-Q outperforms several strong prior methods on user-scale device control tasks in Android-in-the-Wild, attaining 9.9% of relative improvement over prior best-performing offline RL method in this domain.
ICLR 2025

Digi-Q: Transforming VLMs to Device-Control Agents via Value-Based Offline RL

Hao Bai , Yifei Zhou, Erran Li, Sergey Levine, and Aviral Kumar

Jan 2025

Abs HTML PDF Code

Most paradigms for building foundation model agents rely on prompting or finetuning on existing demonstrations, but this is not sufficient in dynamic environments (e.g., mobile device control). In theory, while on-policy reinforcement learning (RL) should address these limitations, this approach itself is not quite effective at leveraging existing agentic data, especially when it is of low quality. An approach to address this issue is to use offline value-based RL but realizing value-based RL for agents has been elusive due to of stability and efficiency associated with running TD-learning at scale with vision-language models (VLMs). In this paper, we develop a scalable value-based RL approach called Digi-Q that makes it possible to train VLM agents with TD-learning. We situate our study in building GUI agents for Android devices. The key idea in Digi-Q is to perform TD-learning on a frozen, intermediate-layer representation of a VLM rather than training the whole VLM itself. Doing so successfully requires an initial phase of fine-tuning to prime VLM representations to feature actionable information that is critical for TD-learning. When done correctly, our approach is able to attain better performance per-unit compute FLOPS. To make maximal use of the learned Q-function, we devise a novel best-of-N policy extraction operator that imitates the best actions out of multiple candidate actions from the current policy as ranked by the value function. With no REINFORCE-style policy gradients that need careful tiuning and an efficient TD-learning approach, Digi-Q outperforms several strong prior methods on user-scale device control tasks in Android-in-the-Wild, attaining 9.9% of relative improvement over prior best-performing offline RL method in this domain.
Oral @ CPAL 2025

Improving Neuron-level Interpretability with White-box Language Models

Hao Bai , and Yi Ma

Oct 2024

Abs HTML PDF

Neurons in auto-regressive language models like GPT-2 can be interpreted by analyzing their activation patterns. Recent studies have shown that techniques such as dictionary learning, a form of post-hoc sparse coding, enhance this neuron-level interpretability. In our research, we are driven by the goal to fundamentally improve neural network interpretability by embedding sparse coding directly within the model architecture, rather than applying it as an afterthought. In our study, we introduce a white-box transformer-like architecture named Coding RAte TransformEr (CRATE), explicitly engineered to capture sparse, low-dimensional structures within data distributions. Our comprehensive experiments showcase significant improvements (up to 103% relative improvement) in neuron-level interpretability across a variety of evaluation metrics. Detailed investigations confirm that this enhanced interpretability is steady across different layers irrespective of the model size, underlining CRATE’s robust performance in enhancing neural network interpretability. Further analysis shows that CRATE’s increased interpretability comes from its enhanced ability to consistently and distinctively activate on relevant tokens. These findings point towards a promising direction for creating white-box foundation models that excel in neuron-level interpretation.
NeurIPS 2024 Oral @ ICML WS

DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

Hao Bai , Yifei Zhou, Jiayi Pan, Mert Cemri, Alane Suhr, Sergey Levine, and Aviral Kumar

Jun 2024

Abs HTML PDF Code

Training corpuses for vision language models typically lack sufficient amounts of decision-centric data. This renders off-the-shelf VLMs sub-optimal for decision-making tasks such as in-the-wild device control through graphical user interfaces (GUIs). While training with static demonstrations has shown some promise, we show that such methods fall short when controlling real GUIs due to their failure to deal with real world stochasticity not captured in static observational data. This paper introduces a novel autonomous RL approach, called DigiRL, for training in-the-wild device control agents through fine-tuning a pre-trained VLM in two stages: offline RL to initialize the model, followed by offline-to-online RL. To do this, we build a scalable and parallelizable Android learning environment equipped with a VLM-based evaluator and develop a simple yet effective RL approach for learning in this domain. Our approach runs advantage-weighted RL with advantage estimators enhanced to account for stochasticity along with an automatic curriculum for deriving maximal learning signal. We demonstrate the effectiveness of DigiRL using the Android-in-the-Wild (AitW) dataset, where our 1.5B VLM trained with RL achieves a 49.5% absolute improvement – from 17.7% to 67.2% success rate – over supervised fine-tuning with static human demonstration data. These results significantly surpass not only the prior best agents, including AppAgent with GPT-4V (8.3% success rate) and the 17B CogAgent trained with AitW data (14.4%), but also the prior best autonomous RL approach based on filtered behavior cloning (57.8%), thereby establishing a new state-of-the-art for digital agents for in-the-wild device control.
NeurIPS 2024

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Yuexiang Zhai, Hao Bai , Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, and Sergey Levine

May 2024

Abs HTML PDF Code

Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic framework that fine-tunes VLMs with reinforcement learning (RL). Specifically, our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM to efficiently explore intermediate reasoning steps that lead to the final text-based action. Next, the open-ended text output is parsed into an executable action to interact with the environment to obtain goal-directed task rewards. Finally, our framework uses these task rewards to fine-tune the entire VLM with RL. Empirically, we demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks, enabling 7b models to outperform commercial models such as GPT4-V or Gemini. Furthermore, we find that CoT reasoning is a crucial component for performance improvement, as removing the CoT reasoning results in a significant decrease in the overall performance of our method.
JMLR

White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?

Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Hao Bai , Yuexiang Zhai, Benjamin D Haeffele, and Yi Ma

Apr 2024

Abs HTML PDF Code

In this paper, we contend that a natural objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a low-dimensional Gaussian mixture supported on incoherent subspaces. The goodness of such a representation can be evaluated by a principled measure, called sparse rate reduction, that simultaneously maximizes the intrinsic information gain and extrinsic sparsity of the learned representation. From this perspective, popular deep network architectures, including transformers, can be viewed as realizing iterative schemes to optimize this measure. Particularly, we derive a transformer block from alternating optimization on parts of this objective: the multi-head self-attention operator compresses the representation by implementing an approximate gradient descent step on the coding rate of the features, and the subsequent multi-layer perceptron sparsifies the features. This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable. We show, by way of a novel connection between denoising and compression, that the inverse to the aforementioned compressive encoding can be realized by the same class of CRATE architectures. Thus, the so-derived white-box architectures are universal to both encoders and decoders. Experiments show that these networks, despite their simplicity, indeed learn to compress and sparsify representations of large-scale real-world image and text datasets, and achieve performance very close to highly engineered transformer-based models: ViT, MAE, DINO, BERT, and GPT2. We believe the proposed computational framework demonstrates great potential in bridging the gap between theory and practice of deep learning, from a unified perspective of data compression.
EMNLP’23

Social Commonsense-Guided Search Query Generation for Open-Domain Knowledge-Powered Conversations

Revanth Reddy, Hao Bai , Wentao Yao, Sharath Chandra Etagi Suresh, Heng Ji, and ChengXiang Zhai

Oct 2023

Abs PDF

Open-domain dialog involves generating search queries that help obtain relevant knowledge for holding informative conversations. However, it can be challenging to determine what information to retrieve when the user is passive and does not express a clear need or request. To tackle this issue, we present a novel approach that focuses on generating internet search queries that are guided by social commonsense. Specifically, we leverage a commonsense dialog system to establish connections related to the conversation topic, which subsequently guides our query generation. Our proposed framework addresses passive user interactions by integrating topic tracking, commonsense response generation and instructiondriven query generation. Through extensive evaluations, we show that our approach1 overcomes limitations of existing query generation techniques that rely solely on explicit dialog information, and produces search queries that are more relevant, specific, and compelling, ultimately resulting in more engaging responses.