For PhD applicants: thank you for your interests! I am taking on new PhD students. However, there is no need to directly contact me with regard to PhD admissions as it is handled by the admission committee.
Instead please mention my name in your research statement. I look forward to your applications!
For admitted Cornell PhD students: If you are interested in working with me, please send me your CV with some
paragraphs describing your past research experiences and current research interests.
For undergraduate/MS students at Cornell and outside visitors: please send me your CV and (unofficial) transcript along with two paragraphs describing your research
interests, research experience, and why you want to get invovled. Your chance of getting invovled is higher if the more of the followings hold true: you have a high GPA; you did quite well
on courses related to math, statistics, machine learning, robotics, and NLP; you are able to commit 12+ hours per week on research;
you have strong programming skills; you have experience on applications like natural language processing, robotics, and computer vision.
Research
My group works on Reinforcement Learning, AI, and Decision Making.
The most recent research directions of the lab are
We are periodically making updates to the book draft. Content based on the
courses taught by Nan at UIUC, the courses taught by Alekh and Sham at UW, and
CS 6789 at Cornell.
Just as GANs led to GAIL in IRL, how could diffusion models --- a more powerful generative model, be applied to IRL to achieve similar success?
We provide an answer in this work.
We provide a new RL policy optimization algorithm for multi-turn RLHF.
The algorithm enables efficient optimization for a 8B size model in long conversations against a 70B model.
New offline RLHF algorithms that avoid overoptimization and achieve single policy coverage style guarantees by regularizing with the Chi-squared divergence.
New RL algorithm for generative model optimization which outperforms PPO on both text and image generation. State-of-art performance on LLM benchmarks.
We study why DPO is not equivalent to RLHF, and what is the key benefit of online samples in RLHF. The new hybrid algorithm that uses both online and offline dadta
outperforms DPO on standard RLHF benchmarks.
We show that distributional RL enables faster learning when the systems have low variance. This holds for contextual bandits, online and offline RL
simoutaneously.
We provide the first mathematical and rigorous explaination of why and when maximum-likelihood-estimation based distributional RL can be better than regular RL,
in contextual bandits, online RL, and offline RL. The new distributional contextual bandit algorithm outperforms prior CB algorithms empirically.
Combining online data and offline data can solve RL with both statistical and computation efficiency. Experiments on Montezuma's Revenge (a video game)
reveals that hybrid RL works much better than pure online RL and pure offline RL
A general model-free Actor-critic framework for POMDPs which generalizes special instances including tabular POMDPs, Linear Quadratic Gaussians,
POMDPs with Hilbert Space Embeddings, and POMDPs with low-rank structures.
Standard self-supervised representation learning approaches fail to work in offline RL due to distribution shift and the sequential nature of the problem. Our new
self-supervised representation learning approach works in theory and in practice for offline RL.
An efficient rich-observation RL algorithm that learns to decode from rich observations to latent states (via adversarial training), while balancing exploration and exploitation
We show partial coverage and realizability is enough for efficient model-based learning in offline RL; notable examples include low-rank MDPs, KNRs, and factored MDPs.
We show how to mitigate covariate shift by leveraging offline data that only provides partial coverage.
A by-product of this work is new results for offline RL: partial coverage and robustness (i.e., being able to compete agaist any policy covered by the offline data)
IL from Observations is strictly harder than the classic IL; we incoporate exploration into the min-max IL framework (we balance exploration and imitation) to solve IL from observations
near optimally in theory and efficiently in practice.
A general framework that enables (1) active action elimination in RL, and (2) enables provably robust exploration with adversarial corruptions on both rewards and transitions.
An On-policy algorithm that is robust to constant fraction adversarial corruption;
The TRPO/NPG based implementation scales to high-dimension control tasks and is robust to strong data corruption.
We propose a simple model-based algorithm that achieves state-of-art in both dense reward continuous control tasks and sparse reward control tasks that require efficient exploration
A new structural complexity captures generalization in RL with function approximation in both model-free and model-based settings.
Notably, we show that MDPs with linear Q* and linear V* is PAC learnable.
We study the advantages of on-policy policy gradient methods compared to off-policy methods such as Q-learning,
and provide a new PG algorithm with exploration
We study learning-to-control for nonlinear systems captured by RKHS or Gaussian Processes.
While being more general, the regret bound is near optimal when specialized to LQRs
We study Sim-to-Real under a model-based framework resulting an algorithm that enjoyes strong theoretical guarantees and excellent empirical performance
Frame IL with observations alone as a sequence of two-player minmax games. Polynomial sample complexity for learning near-optimal policy with general function approximation.
Exploration in action space can be much more efficient than zero-th order method when the number of policy parameters is way larger than the dimension of action space and planning horizon.
Can be viewed as an actor-critic algorithm with critic being expert's state-action Q function; exponential sample complexity seperation between IL and pure RL