Zhaorun Chen, University of Chicago

Zhaorun Chen
email: zhaorun (at) uchicago (dot) edu
office: JCL 287, 5801 S Ellis Ave, Chicago, IL 60637

I am a second-year PhD student in the Secure Learning Lab at the Department of Computer Science at University of Chicago advised by Prof. Bo Li. I am also very fortunate to closely work with Prof. Dawn Song at UC Berkeley during my PhD.

I did my internship at Meta Superintelligence Labs during summer 2025 working with Jason Weston, and joined Virtue AI since 2024 as a founding member of technical staff.

Previously, I received my Bachelor degree in Computer Science from Shanghai Jiao Tong University in 2022.

My research aims to build capable, safe, and trustworthy AI agents that learn from both observations and knowledge. I explore this through two pathways: (1) strengthening their internal intelligence and alignment by bootstrapping from experience and hindsight, and (2) applying external constraints grounded in human knowledge to achieve compliance and certifiability.

News

[01/2026] 🎉 Six papers accepted by ICLR 2026!

[09/2025] 🎉 Seven papers accepted by NeurIPS 2025. See you in San Diego!

[05/2025] 🎉 One first-authored paper accepted by ICML 2025!

[01/2025] 🎉 Five papers accepted by ICLR 2025, including two first-authored papers and one oral paper!

Selected publications (see full list on Google Scholar)

	Scaling Agent Learning via Experience Synthesis Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, and Dat Huynh International Conference on Learning Representations (ICLR), 2026 pdf abstract blog While reinforcement learning (RL) can empower large language model (LLM) agents by enabling self-improvement through interaction, its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity, all of which obstruct the collection of scalable experience data. To address these challenges, we introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind to enable effective online RL training for autonomous agents. Rather than relying on expensive real-environment rollouts, DreamGym distills environment dynamics into a reasoning-based experience model that derives consistent state transitions and feedback signals through step-by-step reasoning, enabling scalable agent rollout collection for RL. To improve the stability and quality of transitions, DreamGym leverages an experience replay buffer initialized with offline real-world data and continuously enriched with fresh interactions to actively support agent training. To improve knowledge acquisition, DreamGym adaptively generates new tasks that challenge the current agent policy, enabling more effective online curriculum learning. Experiments across diverse environments and agent backbones demonstrate that DreamGym substantially improves RL training, both in fully synthetic settings and in sim-to-real transfer scenarios. On non-RL-ready tasks like WebArena, DreamGym outperforms all baselines by over 30%. And in RL-ready but costly settings, it matches GRPO and PPO performance using only synthetic interactions. When transferring a policy trained purely on synthetic experiences to real-environment RL, DreamGym yields significant additional performance gains while requiring far fewer real-world interactions, providing a scalable warm-start strategy for general-purpose RL. @article{chen2025scaling, title={Scaling Agent Learning via Experience Synthesis}, author={Chen, Zhaorun and Zhao, Zhuokai and Zhang, Kai and Liu, Bo and Qi, Qi and Wu, Yifan and Kalluri, Tarun and Cao, Sara and Xiong, Yuanhao and Tong, Haibo and others}, journal={arXiv preprint arXiv:2511.03773}, year={2025} }
	ARMs: Adaptive Red-Teaming Agent against Multimodal Models with Plug-and-Play Attacks Zhaorun Chen^, Xun Liu^, Mintong Kang, Jiawei Zhang, Minzhou Pan, Shuang Yang, Bo Li International Conference on Learning Representations (ICLR), 2026 pdf webpage abstract As vision-language models (VLMs) gain prominence, their multimodal interfaces also introduce new safety vulnerabilities, making the safety evaluation challenging and critical. Existing red-teaming efforts are either restricted to a narrow set of adversarial patterns or depend heavily on manual engineering, lacking scalable exploration of emerging real-world adversarial strategies. To bridge this gap, we propose ARMs, an adaptive red-teaming agent that systematically conducts comprehensive risk assessments for VLMs. Given a target harmful behavior or risk definition, ARMs automatically optimizes diverse red-teaming strategies with reasoning-enhanced multi-step orchestration, to effectively elicit harmful outputs from target VLMs. This is the first red teaming framework that provides controllable generation given risk definitions. We propose 11 novel multimodal attack strategies, covering diverse adversarial patterns of VLMs (e.g., reasoning hijacking, contextual cloaking), and integrate 17 red-teaming algorithms with ARMs. To balance the diversity and effectiveness of the attack, we design a layered memory with an epsilon-greedy attack algorithm. Extensive experiments on different instance-based benchmarks and policy-based safety evaluations show that ARMs achieves the state-of-the-art attack success rate (ASR), improving ASR by an average of 52.1% compared to existing baselines and even exceeding 90% ASR on Claude-4-Sonnet, a constitutionally-aligned model widely recognized for its robustness. We show that the diversity of red-teaming instances generated by ARMs is significantly higher, revealing emerging vulnerabilities in VLMs. Leveraging ARMs, we construct ARMs-Bench, a large-scale multimodal safety benchmark comprising 30K red-teaming instances spanning 51 diverse risk categories, grounded in both real-world multimodal threats and regulatory risks. Fine-tuning with ARMs-Bench substantially reduces ASR while preserving general utility of VLMs, providing actionable insights to improve multimodal safety alignment. @article{chen2025arms, title={ARMs: Adaptive Red-Teaming Agent against Multimodal Models with Plug-and-Play Attacks}, author={Chen, Zhaorun and Liu, Xun and Kang, Mintong and Zhang, Jiawei and Pan, Minzhou and Yang, Shuang and Li, Bo}, journal={arXiv preprint arXiv:2510.02677}, year={2025} }
	RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents Chengquan Guo, Chulin Xie, Yu Yang, Zhaorun Chen, Zinan Lin, Xander Davies, Yarin Gal, Dawn Song, Bo Li International Conference on Learning Representations (ICLR), 2026 pdf webpage abstract Code agents have gained widespread adoption due to their strong code generation capabilities and integration with code interpreters, enabling dynamic execution, debugging, and interactive programming capabilities. While these advancements have streamlined complex workflows, they have also introduced critical safety and security risks. Current static safety benchmarks and red-teaming tools are inadequate for identifying emerging real-world risky scenarios, as they fail to cover certain boundary conditions, such as the combined effects of different jailbreak tools. In this work, we propose RedCodeAgent, the first automated red-teaming agent designed to systematically uncover vulnerabilities in diverse code agents. With an adaptive memory module, RedCodeAgent can leverage existing jailbreak knowledge, dynamically select the most effective red-teaming tools and tool combinations in a tailored toolbox for a given input query, thus identifying vulnerabilities that might otherwise be overlooked. For reliable evaluation, we develop simulated sandbox environments to additionally evaluate the execution results of code agents, mitigating potential biases of LLM-based judges that only rely on static code. Through extensive evaluations across multiple state-of-the-art code agents, diverse risky scenarios, and various programming languages, RedCodeAgent consistently outperforms existing red-teaming methods, achieving higher attack success rates and lower rejection rates with high efficiency. We further validate RedCodeAgent on real-world code assistants, e.g., Cursor and Codeium, exposing previously unidentified security risks. By automating and optimizing red-teaming processes, RedCodeAgent enables scalable, adaptive, and effective safety assessments of code agents. @article{guo2025redcodeagent, title={RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents}, author={Guo, Chengquan and Xie, Chulin and Yang, Yu and Chen, Zhaorun and Lin, Zinan and Davies, Xander and Gal, Yarin and Song, Dawn and Li, Bo}, journal={arXiv preprint arXiv:2510.02609}, year={2025} }
	The Devil Behind the Mask: An Emergent Safety Vulnerability of Diffusion LLMs Zichen Wen, Jiashu Qu, Zhaorun Chen, Xiaoya Lu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, Xuyang Liu, Weijia Li, Chaochao Lu, Jing Shao, Conghui He, Linfeng Zhang International Conference on Learning Representations (ICLR), 2026 pdf abstract Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities. To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs, even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score, while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. @article{wen2025devil, title={The devil behind the mask: An emergent safety vulnerability of diffusion llms}, author={Wen, Zichen and Qu, Jiashu and Chen, Zhaorun and Liu, Dongrui and Liu, Zhiyuan and Wu, Ruixi and Yang, Yicun and Jin, Xiangqi and Xu, Haoyun and Liu, Xuyang and Li, Weijia and others}, journal={arXiv preprint arXiv:2507.11097}, year={2025} }
	Grounding and Enhancing Informativeness and Utility in Dataset Distillation Shaobo Wang, Yantai Yang, Guo Chen, Peiru Li, Kaixin Li, Yufa Zhou, Zhaorun Chen, Linfeng Zhang International Conference on Learning Representations (ICLR), 2026 pdf abstract Dataset Distillation (DD) seeks to create a compact dataset from a large, real-world dataset. While recent methods often rely on heuristic approaches to balance efficiency and quality, the fundamental relationship between original and synthetic data remains underexplored. This paper revisits knowledge distillation-based dataset distillation within a solid theoretical framework. We introduce the concepts of Informativeness and Utility, capturing crucial information within a sample and essential samples in the training set, respectively. Building on these principles, we define \textit{optimal dataset distillation} mathematically. We then present InfoUtil, a framework that balances informativeness and utility in synthesizing the distilled dataset. InfoUtil incorporates two key components: (1) game-theoretic informativeness maximization using Shapley Value attribution to extract key information from samples, and (2) principled utility maximization by selecting globally influential samples based on Gradient Norm. These components ensure that the distilled dataset is both informative and utility-optimized. Experiments demonstrate that our method achieves a 6.1% performance improvement over the previous state-of-the-art approach on ImageNet-1K dataset using ResNet-18. @inproceedings{ anonymous2026grounding, title={Grounding and Enhancing Informativeness and Utility in Dataset Distillation}, author={Anonymous}, booktitle={The Fourteenth International Conference on Learning Representations}, year={2026}, url={https://openreview.net/forum?id=ThsYRbpv2F} }
	From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization Haonian Ji, Shi Qiu, Siyang Xin, Siwei Han, Zhaorun Chen, Dake Zhang, Hongyi Wang, Huaxiu Yao International Conference on Learning Representations (ICLR), 2026 pdf abstract While foundation models (FMs), such as diffusion models and large vision-language models (LVLMs), have been widely applied in educational contexts, their ability to generate pedagogically effective visual explanations remains limited. Most existing approaches focus primarily on textual reasoning, overlooking the critical role of structured and interpretable visualizations in supporting conceptual understanding. To better assess the visual reasoning capabilities of FMs in educational settings, we introduce EduVisBench, a multi-domain, multi-level benchmark. EduVisBench features diverse STEM problem sets requiring visually grounded solutions, along with a fine-grained evaluation rubric informed by pedagogical theory. Our empirical analysis reveals that existing models frequently struggle with the inherent challenge of decomposing complex reasoning and translating it into visual representations aligned with human cognitive processes. To address these limitations, we propose EduVisAgent, a multi-agent collaborative framework that coordinates specialized agents for instructional planning, reasoning decomposition, metacognitive prompting, and visualization design. Experimental results show that EduVisAgent substantially outperforms all baselines, achieving a 40.2% improvement and delivering more educationally aligned visualizations. @inproceedings{ anonymous2026from, title={From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization}, author={Anonymous}, booktitle={The Fourteenth International Conference on Learning Representations}, year={2026}, url={https://openreview.net/forum?id=FVCpV04ZRe} }
	C-SafeGen: Certified Safe LLM Generation with Claim-Based Streaming Guardrails Mintong Kang, Zhaorun Chen, Bo Li Advances in Neural Information Processing Systems (NeurIPS), 2025 pdf webpage abstract arXiv blog Despite the remarkable capabilities of large language models (LLMs) across diverse applications, they remain vulnerable to generating content that violates safety regulations and policies. To mitigate these risks, LLMs undergo safety alignment; however, they can still be effectively jailbroken. Off-the-shelf guardrail models are commonly deployed to monitor generations, but these models primarily focus on detection rather than ensuring safe decoding of LLM outputs. Moreover, existing efforts lack rigorous safety guarantees, which are crucial for the universal deployment of LLMs and certifiable compliance with regulatory standards. In this paper, we propose a Claim-based Stream Decoding (CSD) algorithm coupled with a statistical risk guarantee framework using conformal analysis. Specifically, our CSD algorithm integrates a stream guardrail model to safeguard sequential claims generated by LLMs and incorporates a backtracking mechanism to revise claims flagged with high safety risks. We provide theoretical guarantees demonstrating that the CSD algorithm achieves the desired generation distribution subject to safety constraints. Furthermore, we introduce a generation risk certification framework and derive a high-probability upper bound on the safety risk of the proposed CSD algorithm. We extend our approach to online settings, where user queries arrive sequentially, and prove that our method can asymptotically control safety risk to any desired level. Empirical evaluations demonstrate the effectiveness and efficiency of the CSD algorithm compared to state-of-the-art safety decoding approaches. Additionally, we validate the soundness and tightness of the derived safety risk upper bound using realistic data in both offline and online scenarios. @inproceedings{kangc, title={C-SafeGen: Certified Safe LLM Generation with Claim-Based Streaming Guardrails}, author={Kang, Mintong and Chen, Zhaorun and Li, Bo}, booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems} }
	MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation? Zhaorun Chen^, Yichao Du^, Zichen Wen^, Yiyang Zhou^, Chenhang Cui, Zhenzhen Weng, Haoqin Tu, Chaoqi Wang, Zhengwei Tong, Qinglan Huang, Canyu Chen, Qinghao Ye, Zhihong Zhu, Yuqing Zhang, Jiawei Zhou, Zhuokai Zhao, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao Advances in Neural Information Processing Systems (NeurIPS), 2025 pdf webpage abstract arXiv alphaxiv blog Multimodal reward models (RMs) are critical in RLHF and RLAIF, where they serve as judges and provide feedback for aligning foundation models (FMs) with desired behaviors. Despite their significance, these multimodal judges often un- dergo inadequate evaluation of their capabilities and biases, which may lead to potential misalignment and unsafe fine-tuning outcomes. To address this issue, we introduce MJ-Bench, a novel benchmark which incorporates a comprehensive preference dataset to evaluate multimodal judges in providing feedback for image generation models across four key perspectives: alignment, safety, image quality, and bias. Specifically, we evaluate a large variety of multimodal judges includ- ing smaller-sized CLIP-based scoring models, open-source VLMs (e.g. LLaVA family), and close-source VLMs (e.g. GPT-4o, Claude 3) on each decomposed subcategory of our preference dataset. Experiments reveal that close-source VLMs generally provide better feedback, with GPT-4o outperforming other judges in aver- age. Compared with open-source VLMs, smaller-sized scoring models can provide better feedback regarding text-image alignment and image quality, while VLMs provide more accurate feedback regarding safety and generation bias due to their stronger reasoning capabilities. Notably, human evaluations on end-to-end fine- tuned models using separate feedback from these multimodal judges provide similar conclusions, further confirming the effectiveness of MJ-Bench. Further studies in feedback scale reveal that VLM judges can generally provide more accurate and stable feedback in natural language (Likert-scale) than numerical scales. The code and data are available here. @misc{chen2024mjbenchmultimodalrewardmodel, title={MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?}, author={Zhaorun Chen and Yichao Du and Zichen Wen and Yiyang Zhou and Chenhang Cui and Zhenzhen Weng and Haoqin Tu and Chaoqi Wang and Zhengwei Tong and Qinglan Huang and Canyu Chen and Qinghao Ye and Zhihong Zhu and Yuqing Zhang and Jiawei Zhou and Zhuokai Zhao and Rafael Rafailov and Chelsea Finn and Huaxiu Yao}, year={2024}, eprint={2407.04842}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2407.04842}, }
	GuardSet-X: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset Mintong Kang^, Zhaorun Chen^, Chejian Xu^, Jiawei Zhang^, Chengquan Guo^, Minzhou Pan, Ivan Revilla, Yu Sun, Bo Li Advances in Neural Information Processing Systems (NeurIPS*), 2025 pdf webpage abstract arXiv blog As LLMs become widespread across diverse applications, concerns about the security and safety of LLM interactions have intensified. Numerous guardrail models and benchmarks have been developed to ensure LLM content safety. However, existing guardrail benchmarks are often built upon ad hoc risk taxonomies that lack a principled grounding in standardized safety policies, limiting their alignment with real-world operational requirements. Moreover, they tend to overlook domain-specific risks, while the same risk category can carry different implications across different domains. To bridge these gaps, we introduce GuardSet-X, the first massive multi-domain safety policy-grounded guardrail dataset. GuardSet-X offers: (1) broad domain coverage across eight safety-critical domains, such as finance, law, and codeGen; (2) policy-grounded risk construction based on authentic, domain-specific safety guidelines; (3) diverse interaction formats, encompassing declarative statements, questions, instructions, and multi-turn conversations; (4) advanced benign data curation via detoxification prompting to challenge over-refusal behaviors; and (5) \textbf{attack-enhanced instances} that simulate adversarial inputs designed to bypass guardrails. Based on GuardSet-X, we benchmark 19 advanced guardrail models and uncover a series of findings, such as: (1) All models achieve varied F1 scores, with many demonstrating high variance across risk categories, highlighting their limited domain coverage and insufficient handling of domain-specific safety concerns; (2) As models evolve, their coverage of safety risks broadens, but performance on common risk categories may decrease; (3) All models remain vulnerable to optimized adversarial attacks. We believe that \dataset and the unique insights derived from our evaluations will advance the development of policy-aligned and resilient guardrail systems. @article{kang2025polyguard, title={GuardSet-X: Massive Multi-Domain Safety Policy-Grounded Guardrail Dataset}, author={Kang, Mintong and Chen, Zhaorun and Xu, Chejian and Zhang, Jiawei and Guo, Chengquan and Pan, Minzhou and Revilla, Ivan and Sun, Yu and Li, Bo}, journal={arXiv preprint arXiv:2506.19054}, year={2025} }
	MJ-Video: Benchmarking and Rewarding Video Generation with Fine-Grained Video Preference Haibo Tong^, Zhaoyang Wang^, Zhaorun Chen, Haonian Ji, Shi Qiu, Siwei Han, Kexin Geng, Zhongkai Xue, Yiyang Zhou, Peng Xia, Mingyu Ding, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao Advances in Neural Information Processing Systems (NeurIPS), 2025 (Spotlight Presentation) pdf webpage abstract arXiv alphaxiv blog Recent advancements in video generation have significantly improved the ability to synthesize videos from text instructions. However, existing models still struggle with key challenges such as instruction misalignment, content hallucination, safety concerns, and bias. Addressing these limitations, we introduce MJ-BENCH-VIDEO, a large-scale video preference benchmark designed to evaluate video generation across five critical aspects: Alignment, Safety, Fineness, Coherence & Consistency, and Bias & Fairness. This benchmark incorporates 28 fine-grained criteria to provide a comprehensive evaluation of video preference. Building upon this dataset, we propose MJ-VIDEO, a Mixture-of-Experts (MoE)-based video reward model designed to deliver fine-grained reward. MJ-VIDEO can dynamically select relevant experts to accurately judge the preference based on the input text-video pair. This architecture enables more precise and adaptable preference judgments. Through extensive benchmarking on MJ-BENCH-VIDEO, we analyze the limitations of existing video reward models and demonstrate the superior performance of MJ-VIDEO in video preference assessment, achieving 17.58% and 15.87% improvements in overall and fine-grained preference judgments, respectively. Additionally, introducing MJ-VIDEO for preference tuning in video generation enhances the alignment performance. @misc{tong2025mjvideofinegrainedbenchmarkingrewarding, title={MJ-VIDEO: Fine-Grained Benchmarking and Rewarding Video Preferences in Video Generation}, author={Haibo Tong and Zhaoyang Wang and Zhaorun Chen and Haonian Ji and Shi Qiu and Siwei Han and Kexin Geng and Zhongkai Xue and Yiyang Zhou and Peng Xia and Mingyu Ding and Rafael Rafailov and Chelsea Finn and Huaxiu Yao}, year={2025}, eprint={2502.01719}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.01719}, }
	AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration Andy Zhou, Kevin Wu, Francesco Pinto, Zhaorun Chen, Yi Zeng, Yu Yang, Shuang Yang, Sanmi Koyejo, James Zou, Bo Li Advances in Neural Information Processing Systems (NeurIPS), 2025 pdf webpage abstract arXiv blog As large language models (LLMs) become increasingly capable, security and safety evaluation are crucial. While current red teaming approaches have made strides in assessing LLM vulnerabilities, they often rely heavily on human input and lack comprehensive coverage of emerging attack vectors. This paper introduces AutoRedTeamer, a novel framework for fully automated, end-to-end red teaming against LLMs. AutoRedTeamer combines a multi-agent architecture with a memory-guided attack selection mechanism to enable continuous discovery and integration of new attack vectors. The dual-agent framework consists of a red teaming agent that can operate from high-level risk categories alone to generate and execute test cases and a strategy proposer agent that autonomously discovers and implements new attacks by analyzing recent research. This modular design allows AutoRedTeamer to adapt to emerging threats while maintaining strong performance on existing attack vectors. We demonstrate AutoRedTeamer's effectiveness across diverse evaluation settings, achieving 20% higher attack success rates on HarmBench against Llama-3.1-70B while reducing computational costs by 46% compared to existing approaches. AutoRedTeamer also matches the diversity of human-curated benchmarks in generating test cases, providing a comprehensive, scalable, and continuously evolving framework for evaluating the security of AI systems. @article{zhou2025autoredteamer, title={Autoredteamer: Autonomous red teaming with lifelong attack integration}, author={Zhou, Andy and Wu, Kevin and Pinto, Francesco and Chen, Zhaorun and Zeng, Yi and Yang, Yu and Yang, Shuang and Koyejo, Sanmi and Zou, James and Li, Bo}, journal={arXiv preprint arXiv:2503.15754}, year={2025} }
	From Uncertainty to Trust: Enhancing Reliability in Vision-Language Models with Uncertainty-Guided Dropout Decoding Yixiong Fang^, Ziran Yang^, Zhaorun Chen, Zhuokai Zhao, Jiawei Zhou Advances in Neural Information Processing Systems (NeurIPS), 2025 pdf webpage abstract arXiv blog Large vision-language models (LVLMs) demonstrate remarkable capabilities in multimodal tasks but are prone to misinterpreting visual inputs, often resulting in hallucinations and unreliable outputs. To address these challenges, we propose Dropout Decoding, a novel inference-time approach that quantifies the uncertainty of visual tokens and selectively masks uncertain tokens to improve decoding. Our method measures the uncertainty of each visual token by projecting it onto the text space and decomposing it into aleatoric and epistemic components. Specifically, we focus on epistemic uncertainty, which captures perception-related errors more effectively. Inspired by dropout regularization, we introduce uncertainty-guided token dropout, which applies the dropout principle to input visual tokens instead of model parameters, and during inference rather than training. By aggregating predictions from an ensemble of masked decoding contexts, Dropout Decoding robustly mitigates errors arising from visual token misinterpretations. Evaluations on benchmarks including CHAIR, THRONE, and MMBench demonstrate that Dropout Decoding significantly reduces object hallucinations (OH) and enhances both reliability and quality of LVLM outputs across diverse visual contexts. @article{fang2024uncertainty, title={From uncertainty to trust: Enhancing reliability in vision-language models with uncertainty-guided dropout decoding}, author={Fang, Yixiong and Yang, Ziran and Chen, Zhaorun and Zhao, Zhuokai and Zhou, Jiawei}, journal={arXiv preprint arXiv:2412.06474}, year={2024} }
	Efficient Multi-modal Large Language Models via Progressive Consistency Distillation Zichen Wen , Shaobo Wang, Yufa Zhou, Junyuan Zhang, Qintong Zhang, Yifeng Gao, Zhaorun Chen, Bin Wang, Weijia Li, Conghui He, Linfeng Zhang Advances in Neural Information Processing Systems (NeurIPS), 2025 pdf abstract Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model’s parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework. @inproceedings{wenefficient, title={Efficient Multi-modal Large Language Models via Progressive Consistency Distillation}, author={Wen, Zichen and Wang, Shaobo and Zhou, Yufa and Zhang, Junyuan and Zhang, Qintong and Gao, Yifeng and Chen, Zhaorun and Wang, Bin and Li, Weijia and He, Conghui and others}, booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems} }
	ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning Zhaorun Chen, Mintong Kang, and Bo Li International Conference on Machine Learning (ICML), 2025 pdf webpage abstract arXiv alphaxiv blog Autonomous agents powered by foundation models have seen widespread adoption across various real-world applications. However, they remain highly vulnerable to malicious instructions and attacks, which can result in severe consequences such as privacy breaches and financial losses. More critically, existing guardrails for LLMs are not applicable due to the complex and dynamic nature of agents. To tackle these challenges, we propose ShieldAgent, the first guardrail agent designed to enforce explicit safety policy compliance for the action trajectory of other protected agents through logical reasoning. Specifically, ShieldAgent first constructs a safety policy model by extracting verifiable rules from policy documents and structuring them into a set of action-based probabilistic rule circuits. Given the action trajectory of the protected agent, ShieldAgent retrieves relevant rule circuits and generates a shielding plan, leveraging its comprehensive tool library and executable code for formal verification. In addition, given the lack of guardrail benchmarks for agents, we introduce ShieldAgent-Bench, a dataset with 3K safety-related pairs of agent instructions and action trajectories, collected via SOTA attacks across 6 web environments and 7 risk categories. Experiments show that ShieldAgent achieves SOTA on ShieldAgent-Bench and three existing benchmarks, outperforming prior methods by 11.3% on average with a high recall of 90.1%. Additionally, ShieldAgent reduces API queries by 64.7% and inference time by 58.2%, demonstrating its high precision and efficiency in safeguarding agents. @article{chen2025shieldagent, title={ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning}, author={Chen, Zhaorun and Kang, Mintong and Li, Bo}, journal={arXiv preprint arXiv:2503.22738}, year={2025} }
	SafeWatch: An Efficient Safety-Policy Following Video Guardrail Model with Transparent Explanations Zhaorun Chen, Francesco Pinto, Minzhou Pan, and Bo Li International Conference on Learning Representations (ICLR), 2025 pdf webpage abstract arXiv alphaxiv blog With the rise of generative AI and rapid growth of high-quality video generation, video guardrails have become more crucial than ever to ensure safety and security across platforms. Current video guardrails, however, are either overly simplistic, relying on pure classification models trained on simple policies with limited unsafe categories, which lack detailed explanations, or prompting multimodal large language models (MLLMs) with long safety guidelines, which are inefficient and impractical for guardrailing real-world content. To bridge this gap, we propose SafeWatch, an efficient MLLM-based video guardrail model designed to follow customized safety policies and provide multi-label video guardrail outputs with content-specific explanations in a zero-shot manner. In particular, unlike traditional MLLM-based guardrails that encode all safety policies autoregressively, causing inefficiency and bias, SafeWatch uniquely encodes each policy chunk in parallel and eliminates their position bias such that all policies are attended simultaneously with equal importance. In addition, to improve efficiency and accuracy, SafeWatch incorporates a policy-aware visual token pruning algorithm that adaptively selects the most relevant video tokens for each policy, discarding noisy or irrelevant information. This allows for more focused, policy-compliant guardrail with significantly reduced computational overhead. Considering the limitations of existing video guardrail benchmarks, we propose SafeWatch-Bench, a large-scale video guardrail benchmark comprising over 2M videos spanning six safety categories which covers over 30 tasks to ensure a comprehensive coverage of all potential safety scenarios. We have conducted extensive experiments, showing that SafeWatch outperforms all SOTA video guardrails on SafeWatch-Bench by 28.2%, and achieves a 13.6% improvement on existing benchmarks, all while reducing inference costs by an average of 10%. SafeWatch also demonstrates strong policy-following abilities and outperforms previous SOTAs by 5.6% and 15.6% in zero-shot generalizability to new policies and new prompting tasks. Additionally, both LLM-as-a-judge and human evaluators confirm the high quality of the explanations provided by SafeWatch. Our project is open-sourced at here. @article{chen2024safewatch, title={SafeWatch: An Efficient Safety-Policy Following Video Guardrail Model with Transparent Explanations}, author={Chen, Zhaorun and Pinto, Francesco and Pan, Minzhou and Li, Bo}, journal={arXiv preprint arXiv:2412.06878}, year={2024} }
	MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models Chejian Xu^, Jiawei Zhang^, Zhaorun Chen^, Chulin Xie^, Mintong Kang^, Yujin Potter^, Zhun Wang^, Zhuowen Yuan^, Alexander Xiong, Zidi Xiong, Chenhui Zhang, Lingzhi Yuan, Yi Zeng, Peiyang Xu, Chengquan Guo, Andy Zhou, Jeffrey Ziwei Tan, Xuandong Zhao, Francesco Pinto, Zhen Xiang, Yu Gai, Zinan Lin, Dan Hendrycks, Bo Li, Dawn Song International Conference on Learning Representations (ICLR), 2025 pdf webpage abstract arXiv Multimodal foundation models (MMFMs) play a crucial role in various applications, including autonomous driving, healthcare, and virtual assistants. However, several studies have revealed vulnerabilities in these models, such as generating unsafe content by text-to-image models. Existing benchmarks on multimodal models either predominantly assess the helpfulness of these models, or only focus on limited perspectives such as fairness and privacy. In this paper, we present the first unified platform, MMDT (Multimodal DecodingTrust), designed to provide a comprehensive safety and trustworthiness evaluation for MMFMs. Our platform assesses models from multiple perspectives, including safety, hallucination, fairness/bias, privacy, adversarial robustness, and out-of-distribution (OOD) generalization. We have designed various evaluation scenarios and red teaming algorithms under different tasks for each perspective to generate challenging data, forming a high-quality benchmark. We evaluate a range of multimodal models using MMDT, and our findings reveal a series of vulnerabilities and areas for improvement across these perspectives. This work introduces the first comprehensive and unique safety and trustworthiness evaluation platform for MMFMs, paving the way for developing safer and more reliable MMFMs and systems. @inproceedings{ xu2025mmdt, title={{MMDT}: Decoding the Trustworthiness and Safety of Multimodal Foundation Models}, author={Chejian Xu and Jiawei Zhang and Zhaorun Chen and Chulin Xie and Mintong Kang and Yujin Potter and Zhun Wang and Zhuowen Yuan and Alexander Xiong and Zidi Xiong and Chenhui Zhang and Lingzhi Yuan and Yi Zeng and Peiyang Xu and Chengquan Guo and Andy Zhou and Jeffrey Ziwei Tan and Xuandong Zhao and Francesco Pinto and Zhen Xiang and Yu Gai and Zinan Lin and Dan Hendrycks and Bo Li and Dawn Song}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025}, url={https://openreview.net/forum?id=qIbbBSzH6n} }
	MMIE: Massive Multimodal Interleaved Comprehension Benchmark For Large Vision-Language Models Peng Xia^, Siwei Han^, Shi Qiu^, Yiyang Zhou, Zhaoyang Wang, Wenhao Zheng, Zhaorun Chen, Chenhang Cui, Mingyu Ding, Linjie Li, Lijuan Wang, Huaxiu Yao International Conference on Learning Representations (ICLR), 2025 (Oral Presentation)* pdf webpage abstract arXiv alphaxiv blog Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks suffer from limitations in data scale, scope, and evaluation depth, while current evaluation metrics are often costly or biased, lacking in reliability for practical applications. To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs. Specifically, we evaluate eight LVLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. We believe MMIE will drive further advancements in the development of interleaved LVLMs. @article{xia2024mmie, title={MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models}, author={Xia, Peng and Han, Siwei and Qiu, Shi and Zhou, Yiyang and Wang, Zhaoyang and Zheng, Wenhao and Chen, Zhaorun and Cui, Chenhang and Ding, Mingyu and Li, Linjie and Wang, Lijuan and Yao, Huaxiu}, journal={arXiv preprint arXiv:2410.10139}, year={2024} }
	AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li Advances in Neural Information Processing Systems (NeurIPS), 2024 pdf webpage abstract arXiv alphaxiv blog LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a memory module or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with similar embeddings from knowledge bases to inform task planning and execution. However, the reliance on unverified knowledge bases raises significant concerns about their safety and trustworthiness. To uncover such vulnerabilities, we propose a novel red teaming approach AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability. In the meantime, benign instructions without the trigger will still maintain normal performance. Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. We inject the poisoning instances into the RAG knowledge base and long-term memories of these agents, respectively, demonstrating the generalization of AgentPoison. On each agent, AgentPoison achieves an average attack success rate of ≥ 80% with minimal impact on benign performance (≤ 1%) with a poison rate < 0.1%. Code is released here. @article{chen2024agentpoison, title={AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases}, author={Chen, Zhaorun and Xiang, Zhen and Xiao, Chaowei and Song, Dawn and Li, Bo}, journal={arXiv preprint arXiv:2407.12784}, year={2024} }
	HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding Zhaorun Chen^, Zhuokai Zhao^, Hongyin Luo, Huaxiu Yao, Bo Li and Jiawei Zhou International Conference on Machine Learning (ICML), 2024 pdf webpage abstract arXiv alphaxiv blog While large vision-language models (LVLMs) have demonstrated impressive capabilities in interpreting multi-modal contexts, they invariably suffer from object hallucinations (OH). We introduce HALC, a novel decoding algorithm designed to mitigate OH in LVLMs. HALC leverages distinct fine-grained optimal visual information in vision-language tasks and operates on both local and global contexts simultaneously. Specifically, HALC integrates a robust auto-focal grounding mechanism (locally) to correct hallucinated tokens on the fly, and a specialized beam search algorithm (globally) to significantly reduce OH while preserving text generation quality. Additionally, HALC can be integrated into any LVLMs as a plug-and-play module without extra training. Extensive experimental studies demonstrate HALC's effectiveness in reducing OH, outperforming state-of-the-arts across four benchmarks. Code is released here. @article{chen2024halc, title={HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding}, author={Chen, Zhaorun and Zhao, Zhuokai and Luo, Hongyin and Yao, Huaxiu and Li, Bo and Zhou, Jiawei}, journal={arXiv preprint arXiv:2403.00425}, year={2024} }
	Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment Chenhang Cui^, An Zhang^, Yiyang Zhou, Zhaorun Chen, Gelei Deng, Huaxiu Yao, Tat-Seng Chua International Conference on Learning Representations (ICLR), 2025 pdf webpage abstract arXiv alphaxiv The recent advancements in large language models (LLMs) and pre-trained vision models have accelerated the development of vision-language large models (VLLMs), enhancing the interaction between visual and linguistic modalities. Despite their notable success across various domains, VLLMs face challenges in modality alignment, which can lead to issues like hallucinations and unsafe content generation. Current alignment techniques often rely on coarse feedback and external datasets, limiting scalability and performance. In this paper, we propose FiSAO (Fine-Grained Self-Alignment Optimization), a novel self-alignment method that utilizes the model's own visual encoder as a fine-grained verifier to improve vision-language alignment without the need for additional data. By leveraging token-level feedback from the vision encoder, FiSAO significantly improves vision-language alignment, even surpassing traditional preference tuning methods that require additional data. Through both theoretical analysis and experimental validation, we demonstrate that FiSAO effectively addresses the misalignment problem in VLLMs, marking the first instance of token-level rewards being applied to such models. @article{cui2024fine, title={Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment}, author={Cui, Chenhang and Zhang, An and Zhou, Yiyang and Chen, Zhaorun and Deng, Gelei and Yao, Huaxiu and Chua, Tat-Seng}, journal={arXiv preprint arXiv:2410.14148}, year={2024} }
	Anyprefer: An Agentic Framework for Preference Data Synthesis Yiyang Zhou, Zhaoyang Wang, Tianle Wang, Shangyu Xing, Peng Xia, Bo Li, Kaiyuan Zheng, Zijian Zhang, Zhaorun Chen, Wenhao Zheng, Xuchao Zhang, Chetan Bansal, Weitong Zhang, Ying Wei, Mohit Bansal, Huaxiu Yao International Conference on Learning Representations (ICLR), 2025 pdf webpage abstract arXiv High-quality preference data is essential for aligning foundation models with human values through preference learning. However, manual annotation of such data is often time-consuming and costly. Recent methods adopt a self-rewarding approach, where the target model generates and annotates its own preference data, but this can lead to inaccuracies due to the reward model sharing weights with the target model, amplifying inherent biases. To address these issues, we propose Anyprefer, a framework designed to synthesize high-quality preference data for the target model. Anyprefer frames the data synthesis process as a cooperative two-player Markov Game, where the target model and a judge model collaborate. Here, a series of external tools are introduced to assist the judge model in accurately rewarding the target model's responses, mitigating biases in the process. We also introduce a feedback mechanism to optimize prompts for both models, enhancing collaboration and improving data quality. The synthesized data is compiled into a new preference dataset, Anyprefer-V1, consisting of 58K high-quality preference pairs. Extensive experiments show that Anyprefer significantly improves model alignment across four applications, covering 21 datasets, achieving average improvements of 18.55 in five natural language generation datasets, 3.66 in nine vision-language understanding datasets, 30.05 in three medical image analysis datasets, and 14.50 in four visuo-motor control tasks. @inproceedings{ zhou2025mjprefergen, title={{MJ}-PreferGen: An Automatic Framework for Preference Data Synthesis}, author={Yiyang Zhou and Zhaoyang Wang and Tianle Wang and Shangyu Xing and Peng Xia and Bo Li and Kaiyuan Zheng and Zijian Zhang and Zhaorun Chen and Wenhao Zheng and Xuchao Zhang and Chetan Bansal and Weitong Zhang and Ying Wei and Mohit Bansal and Huaxiu Yao}, booktitle={The Thirteenth International Conference on Learning Representations}, year={2025}, url={https://openreview.net/forum?id=WpZyPk79Fu} }
	Calibrated Self-Rewarding Vision Language Models Yiyang Zhou^, Zhiyuan Fan^, Dongjie Cheng^, Sihan Yang, Zhaorun Chen, Chenhang Cui, Xiyao Wang, Yun Li, Linjun Zhang, and Huaxiu Yao Advances in Neural Information Processing Systems (NeurIPS*), 2024 pdf webpage abstract arXiv alphaxiv blog Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning. Despite these advancements, LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image, indicating a misalignment between image and text pairs. This misalignment arises because the model tends to prioritize textual information over visual input, even when both the language model and visual representations are of high quality. Existing methods leverage additional models or human annotations to curate preference data and enhance modality alignment through preference optimization. These approaches are resource-intensive and may not effectively reflect the target LVLM's preferences, making the curated preferences easily distinguishable. Our work addresses these challenges by proposing the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning. In the reward modeling, we employ a step-wise strategy and incorporate visual constraints into the self-rewarding process to place greater emphasis on visual input. Empirical results demonstrate that CSR significantly enhances performance and reduces hallucinations across ten benchmarks and tasks, achieving substantial improvements over existing methods by 7.62%. Our empirical results are further supported by rigorous theoretical analysis, under mild assumptions, verifying the effectiveness of introducing visual constraints into the self-rewarding paradigm. Additionally, CSR shows compatibility with different vision-language models and the ability to incrementally improve performance through iterative fine-tuning. @article{zhou2024calibrated, title={Calibrated self-rewarding vision language models}, author={Zhou, Yiyang and Fan, Zhiyuan and Cheng, Dongjie and Yang, Sihan and Chen, Zhaorun and Cui, Chenhang and Wang, Xiyao and Li, Yun and Zhang, Linjun and Yao, Huaxiu}, journal={arXiv preprint arXiv:2405.14622}, year={2024} }
	EscIRL: Evolving Self-Contrastive IRL for Trajectory Prediction in Autonomous Driving Siyue Wang^, Zhaorun Chen^, Zhuokai Zhao, Chaoli Mao, Yiyang Zhou, Jiayu He, Albert Hu Conference on Robot Learning (CoRL), 2024 pdf abstract arXiv While deep neural networks (DNN) and inverse reinforcement learning (IRL) have both been commonly used in autonomous driving to predict trajectories through learning from expert demonstrations, DNN-based methods suffer from data-scarcity, while IRL-based approaches often struggle with generalizability, making both hard to apply to new driving scenarios. To address these issues, we introduce EscIRL, a novel decoupled bi-level training framework that iteratively learns robust reward models from only a few mixed-scenario demonstrations. At the inner level, EscIRL introduces a self-contrastive IRL module that learns a spectrum of specialized reward functions by contrasting demonstrations across different scenarios. At the outer level, ESCIRL employs an evolving loop that iteratively refines the contrastive sets, ensuring global convergence. Experiments on two multi-scenario datasets, CitySim and INTERACTION, demonstrate the effectiveness of EscIRL, outperforming state-of-the-art DNN and IRL-based methods by 41.3% on average. Notably, we show that EscIRL achieves superior generalizability compared to DNN-based approaches while requiring only a small fraction of the data, effectively addressing data-scarcity constraints. All code and data are available at here. @inproceedings{wang2024escirl, title={EscIRL: Evolving Self-Contrastive IRL for Trajectory Prediction in Autonomous Driving}, author={Wang, Siyue and Chen, Zhaorun and Zhao, Zhuokai and Mao, Chaoli and Zhou, Yiyang and He, Jiayu and Hu, Albert Sibo}, booktitle={8th Annual Conference on Robot Learning}, year={2024} }
	AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition Zhaorun Chen, Zhuokai Zhao, Zhihong Zhu, Ruiqi Zhang, Xiang Li, Bhiksha Raj and Huaxiu Yao North American Chapter of the Association for Computational Linguistics (NAACL), 2024 pdf abstract arXiv alphaxiv blog Recent advancements in large language models (LLMs) have shown promise in multi-step reasoning tasks, yet their reliance on extensive manual labeling to provide procedural feedback remains a significant impediment. To address this challenge, in this paper, we propose a novel self-supervised framework AutoPRM that efficiently enhances the fine-tuning of LLMs for intricate reasoning challenges. Specifically, AutoPRM first decomposes complex problems into more manageable subquestions with a controllable granularity switch, then sequentially apply reinforcement learning to iteratively improve the subquestion solver. Additionally, we propose context-guided-decoding to avoid reward tampering and guide the subquestion solver towards the solution of the holistic problem. Extensive experiments show that AutoPRM significantly improves performance on mathematical and commonsense reasoning tasks over SOTA. More encouragingly, AutoPRM can be easily integrated with other orthogonal reasoning pipelines. @article{chen2024autoprm, title={AutoPRM: Automating Procedural Supervision for Multi-Step Reasoning via Controllable Question Decomposition}, author={Chen, Zhaorun and Zhao, Zhuokai and Zhu, Zhihong and Zhang, Ruiqi and Li, Xiang and Raj, Bhiksha and Yao, Huaxiu}, journal={arXiv preprint arXiv:2402.11452}, year={2024} }
	Safe Reinforcement Learning via Hierarchical Adaptive Chance-Constraint Safeguards Zhaorun Chen, Zhuokai Zhao, Tairan He, Binhao Chen, Xuhao Zhao, Liang Gong, Chengliang Liu IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024 (Oral Pitch) pdf abstract arXiv Ensuring safety in Reinforcement Learning (RL), typically framed as a Constrained Markov Decision Process (CMDP), is crucial for real-world exploration applications. Current approaches in handling CMDP struggle to balance optimality and feasibility, as direct optimization methods cannot ensure state-wise in-training safety, and projection-based methods correct actions inefficiently through lengthy iterations. To address these challenges, we propose Adaptive Chance-constrained Safeguards (ACS), an adaptive, model-free safe RL algorithm using the safety recovery rate as a surrogate chance constraint to iteratively ensure safety during exploration and after achieving convergence. Theoretical analysis indicates that the relaxed probabilistic constraint sufficiently guarantees forward invariance to the safe set. And extensive experiments conducted on both simulated and real-world safety-critical tasks demonstrate its effectiveness in enforcing safety (nearly zero-violation) while preserving optimality (+23.8%), robustness, and fast response in stochastic real-world settings. @misc{chen2024safereinforcementlearninghierarchical, title={Safe Reinforcement Learning via Hierarchical Adaptive Chance-Constraint Safeguards}, author={Zhaorun Chen and Zhuokai Zhao and Tairan He and Binhao Chen and Xuhao Zhao and Liang Gong and Chengliang Liu}, year={2024}, eprint={2310.03379}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2310.03379}, }

Preprints (see full list on Google Scholar)


	AgentXploit: End-to-End Red-Teaming for AI Agents Powered by Multi-Agent Systems Shi Qiu, Zhun Wang, Tianneng Shi, Zhaorun Chen, Wenbo Guo, Dawn Song In submission, 2026 pdf abstract AI agents, powered by Large Language Model (LLM), are vulnerable to indirect prompt injection attacks, where malicious data from external tools and data sources can manipulate agent behavior. Existing works mainly focus on developing strategies to generate attack prompts under a pre-constructed attack patch. However, the major challenge lies in systematically identifying attack vectors and automatically constructing attack paths from the system entry points to the target components. Without such mechanisms, none of the existing red-teaming approaches can be applied to real-world AI agents for end-to-end attacks. We propose AgentXploit, the first fully automatic, multi-agent framework that systematically identifies and demonstrates these vulnerabilities through white-box analysis of the agent's source code. Our framework operates in two stages. First, an Analyzer agent inspects the target agent’s codebase using dynamic task planning, long-term context management, and semantical code browsing. It generates comprehensive reports on agent workflows, high-risk tools that process untrusted data, sensitive tools that can cause harm, and viable attack paths. Second, an Exploiter agent uses these reports to dynamically craft and execute attacks. It leverages a specialized seed corpus, context-aware injection generation guided by real-time agent feedback, and multi-injection collaboration to reliably trigger malicious behavior. We evaluate AgentXploit on two popular open-source agents, successfully demonstrating a range of attacks. On the AgentDojo benchmark, our Exploiter achieves a 79% attack success rate, outperforming prior state-of-the-art by 27%. @misc{ qiu2025agentxploit, title={AgentXploit: End-to-End Red-Teaming for {AI} Agents Powdered by Multi-Agent Systems}, author={Shi Qiu and Zhun Wang and Tianneng Shi and Zhaorun Chen and Wenbo Guo and Dawn Song}, year={2025}, url={https://openreview.net/forum?id=xKJ0lVQEv7} }
	Token-Level LLM Collaboration via FusionRoute Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, Zhuokai Zhao In submission, 2026 pdf abstract blog Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert's next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks. @article{xiong2026token, title={Token-Level LLM Collaboration via FusionRoute}, author={Xiong, Nuoya and Zhou, Yuhang and Zeng, Hanqing and Chen, Zhaorun and Huang, Furong and Bi, Shuchao and Zhang, Lizhu and Zhao, Zhuokai}, journal={arXiv preprint arXiv:2601.05106}, year={2026} }
	Agent Learning via Early Experience Kai Zhang^, Xiangchao Chen^, Bo Liu^, Tianci Xue^, Zeyi Liao^, Zhihan Liu^, Xiyao Wang^, Yuting Ning^, Zhaorun Chen^*, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu In submission, 2025 pdf abstract blog A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents. @article{zhang2025agent, title={Agent Learning via Early Experience}, author={Zhang, Kai and Chen, Xiangchao and Liu, Bo and Xue, Tianci and Liao, Zeyi and Liu, Zhihan and Wang, Xiyao and Ning, Yuting and Chen, Zhaorun and Fu, Xiaohan and others}, journal={arXiv preprint arXiv:2510.08558}, year={2025} }
	GRAPE: Generalizing Robot Policy via Preference Alignment Zijian Zhang^, Kaiyuan Zheng^, Zhaorun Chen^*, Joel Jang, Yi Li, Chaoqi Wang, Mingyu Ding, Dieter Fox, and Huaxiu Yao In submission, 2025 pdf webpage abstract arXiv alphaxiv blog Despite the recent advancements of vision-language-action (VLA) models on a variety of robotics tasks, they suffer from critical issues such as poor generalizability to unseen tasks, due to their reliance on behavior cloning exclusively from successful rollouts. Furthermore, they are typically fine-tuned to replicate demonstrations collected by experts under different settings, thus introducing distribution bias and limiting their adaptability to diverse manipulation objectives, such as efficiency, safety, and task completion. To bridge this gap, we introduce GRAPE: Generalizing Robot Policy via Preference Alignment. Specifically, GRAPE aligns VLAs on a trajectory level and implicitly models reward from both successful and failure trials to boost generalizability to diverse tasks. Moreover, GRAPE breaks down complex manipulation tasks to independent stages and automatically guides preference modeling through customized spatiotemporal constraints with keypoints proposed by a large vision-language model. Notably, these constraints are flexible and can be customized to align the model with varying objectives, such as safety, efficiency, or task success. We evaluate GRAPE across a diverse array of tasks in both real-world and simulated environments. Experimental results demonstrate that GRAPE enhances the performance of state-of-the-art VLA models, increasing success rates on in-domain and unseen manipulation tasks by 51.79% and 60.36%, respectively. Additionally, GRAPE can be aligned with various objectives, such as safety and efficiency, reducing collision rates by 44.31% and rollout step-length by 11.15%, respectively. All code, models, and data are available at here. @article{zhang2024grape, title={Grape: Generalizing robot policy via preference alignment}, author={Zhang, Zijian and Zheng, Kaiyuan and Chen, Zhaorun and Jang, Joel and Li, Yi and Wang, Chaoqi and Ding, Mingyu and Fox, Dieter and Yao, Huaxiu}, journal={arXiv preprint arXiv:2411.19309}, year={2024} }

Organizer

NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models (ResponsibleFM), NeurIPS 2025
COLM 2025 Workshop on AI Agents: Capabilities and Safety (AIA 2025), COLM 2025
The LLM and Agent Safety Competition (CLAS 2024), NeurIPS 2024

Reviewer Service

Area Chair: EMNLP'25
Conference Reviewer: NeurIPS'25, ICML'25, ICCV'25, CVPR'25, NeurIPS'24, ICLR'24'25, COLM'24, ARR'24, IROS'24

University of Chicago | Department of Computer Science
Template modified from Jon Barron