I'm an Associate Professor at Nanjing University in Nanjing, China. I received the Ph.D. degree from City University of Hong Kong and the bachelor's degree from Nanjing University. I was a visiting scholar at University of New South Wales and Institute of Automation, Chinese Academy of Sciences.
[Jan. 26th, 2026] Three papers on RL for LLM reasoning and in-context RL were accepted at ICLR 2026.
[Sept. 19th, 2025] Three papers on in-context RL, language agents, and RL for LLM reasoning were accepted at NeurIPS 2025.
[May 1st, 2025] One paper on hierarchical LLM agents was accepted at ICML 2025.
[Jan. 29th, 2025] One paper on interpretable multi-agent RL was accepted at TPAMI.
[Sept. 26th, 2024] One paper on generalist RL agents was accepted at NeurIPS 2024.
[Jan. 16th, 2024] One paper on efficient multi-agent RL coordination was accepted at ICLR 2024.
Research
I'm interested in reinfocement learning algorithms and applications.
Specifically, I work on how learning algorithms can scale RL agents to i) dynamic environments, ii) offline settings, and iii) multi-agent systems, allowing them to autonomously adapt to i) non-stationary task distributions, ii) non-interactive scenarios, and iii) cooperative or competitive task assignments, facilitating RL's deployment in real-world domains.
Recently, I work on leveraging RL principles for language and vision tasks, as well as leveraging foundation models in decision-making problems. It is a great pleasure to explore ideas of RL fine-tuning for LLMs/generative models, RL for agentic test-time scaling, and vision-language-action models.
Based on a strong positive correlation between global diversity and reasoning capacity, we introduce global diversity incentives as an intrinsic reward to promote deep exploration in a semantically structured space.
ExGRPO: Learning to Reason from Experience
Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, Yu Cheng
International Conferene on Learning Representations (ICLR), 2026
paper / code
We investigate what makes a reasoning experience valuable and identify rollout correctness and entropy as effective indicators of experience value. Based on these insights, we prioritize valuable experiences with a mixed-policy objective to balance exploration with experience exploitation.
Our finding necessitates entropy management for continuous exploration toward scaling compute for RL.
Through understanding the mechanism behind entropy dynamics, we motivate to control entropy by restricting the update of high-covariance tokens.
we propose an innovative framework GLIDER (Grounding Language Models as EffIcient Decision-Making Agents via Offline HiErarchical RL) that introduces a parameter-efficient and generally applicable hierarchy to train competent LLM policies for complex interactive tasks.
we propose Text-to-Decision Agent (T2DA), a simple and scalable pre-training framework for learning generalist policies via aligning language knowledge with environment dynamics of decision tasks.
We introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments zero-RL with off-policy reasoning traces, balancing imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training.
We systematically incentive output diversity throughout the on-policy RL fine-tuning process, reconciling strong task alignment with high generation diversity.
We propose MA-RAG (Multi-Round Agentic RAG), a framework that facilitates test-time scaling for complex medical reasoning by iteratively evolving both external evidence and internal reasoning history within an agentic refinement loop.
In-context RL, Generalization in RL
Scalable In-Context Q-Learning
Jinmei Liu, Fuhong Liu, Zhenhong Sun, Jianye Hao, Bo Wang, Huaxiong Li, Daoyi Dong, Chunlin Chen, Zhi Wang*,
International Conferene on Learning Representations (ICLR), 2026
code
/
paper
we propose SICQL, an innovative framework that harnesses dynamic programming and world modeling to steer ICRL toward efficient reward maximization and task generalization, while retaining the scalability and stability of supervised pretraining.
We propose T2MIR (Token- and Task-wise MoE for In-context RL), an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models.
We leverage the sequential modeling ability of the transformer architecture and robust task representation learning via world model disentanglement to achieve efficient generalization in offline meta-RL.
We develop a lifelong RL agent that can incrementally adapt its behaviors to dynamic environments, via maintaining an ever-expanding policy library with online Bayesian inference.
Our main insight is to learn a compact role representation that can capture complex behavior patterns of agents, and use that role representation to promote behavior heterogeneity, knowledge transfer, and skillful coordination across agents.
We propose a novel architecture based on differentiable soft decision trees to tackle the tension between model interpretability and learning performance in MARL domains, paving the way for interpretable and high-performing MARL systems.
we propose an Instance Weighting based Fine-tuning (IW-Fit) method, which revises the fine-tuning stage to improve the classification accuracy on the target domain when a pre-trained model from the source domain is given.