Image
user avatar
Nan Jiang
@nanjiang_cs
machine learning researcher, with focus on reinforcement learning. assoc prof @ uiuc cs. Course on RL theory (w/ videos): nanjiang.cs.illinois.edu/cs542
Joined November 2017
Posts
  • Pinned
    user avatar
    Learning Q* with + poly-sized exploratory data + an arbitrary Q-class that contains Q* ...has seemed impossible for yrs, or so I believed when I talked at @RLtheory 2mo ago. And what's the saying? Impossible is NOTHING arxiv.org/abs/2008.04990 Exciting new work w/@tengyangx! 1/
    Image
    Image
  • user avatar
    after consulting my colleagues, I decided to make my 598 lectures publicly available. The video links can be found on the course website, or from this list (bit.ly/2F2L0Qi). just started proofs of VI and PI, and check out if you are interested in a stat theory of RL!
    Alekh, @ShamKakade6 and I have a (quite drafty) monograph on rl theory rltheorybook.github.io. I am also teaching a phd seminar course on this topic (w/ recordings): nanjiang.cs.illinois.edu/cs598; just did 1st lec 2h ago! still figuring out if I can share the videos publicly...
  • user avatar
    Translation: your junior faculty privileges will end soon…
    Image
  • user avatar
    I received the NSF CAREER award. Each submission was month+ effort and I'm glad I get it the 2nd time. Also the detailed reviews & the process were not as delighting as the decision. Some experience & thoughts below: 1/
    Image
  • user avatar
    The entire RL theory is built on objects like V^π, Q*, π*, T (Bellman up. op.), etc... until you realize that this foundation is quite shaky. arxiv.org/abs/1905.13341 Spoiler: no big deal (yet) but thinking thru this is super useful for resolving some confusions. (1/x)
  • user avatar
    I was surprised by how many didnt know that (1) per token MLE is whole seq MLE, and (2) PG at token level same as PG at seq level (optimizkng one big combinatorial action). story is different if you introduce fitted critic/Q-values or intermediate resets.
    Most RL for LLMs involves only 1 step of RL. It’s a contextual bandit problem and there’s no covariate shift because the state (question, instruction) is given. This has many implications, eg DAgger becomes SFT, and it is trivial to design Expectation Maximisation (EM) maximum
    Image
    Image
    Image
  • user avatar
    Re error propagation: if you believe model-based is a solution but also want the benefits of model-free, perhaps time to investigate (never thoroughly-studied) bellman-error minimization... BRM is, in a way, closer to model-based than TD (small revelation from my l4dc talk)
    Image
    Image
    Q-learning is not yet scalable seohong.me/blog/q-learnin… I wrote a blog post about my thoughts on scalable RL algorithms. To be clear, I'm still highly optimistic about off-policy RL and Q-learning! I just think we haven't found the right solution yet (the post discusses why).
  • user avatar
    friends must have been bored of me saying this, but clearly not nearly enough ppl know this not all equations can be turned into an optimization loss
    Image
    once @ylecun told me (heavily paraphrased), it's not F=ma but \min (F-ma)^2. i didn't realize its importance, but it is perhaps the most enlightning perspective i've ever heard.
  • user avatar
    this paper got Outstanding Paper Award! Congrats to my coauthors (esp. Ching-An and Tengyang). More reasons to check out the details! List of all paper awards: icml.cc/virtual/2022/a…
    Tmr @icmlconf 2:15pm R301, Ching-An will present our ATAC alg: w/ a clever transformation by PD lemma, we turn initial-state pessimistic term from our prior work into *relative* pess and smoothly bridge IL & offline RL, with robust improvement guarantees. icml.cc/Conferences/20…
    Image
  • user avatar
    Alekh, @ShamKakade6 and I have a (quite drafty) monograph on rl theory rltheorybook.github.io. I am also teaching a phd seminar course on this topic (w/ recordings): nanjiang.cs.illinois.edu/cs598; just did 1st lec 2h ago! still figuring out if I can share the videos publicly...
    We have a monograph on deep reinforcement learning (google.com/search?q=an+in…) which covers some of the recent work. Otherwise, much of the non-deep RL work is theory, in which case I am not the expert but perhaps @nanjiang_cs has suggestions.
  • user avatar
    missing ICML, and I used this week to write my first technical blog on some recent thoughts on two different roles of simulators in RL and the confusions/misconceptions around them. Comments welcome! nanjiang.cs.illinois.edu/2025/07/16/sim…
    Image
  • user avatar
    My 3rd blogpost on PG, the topic I am least familiar with but get asked a lot, so I thought I'd just put together the very limited stuff I know on this topic. Somehow the post gets cynical from time to time🙃 nanjiang.cs.illinois.edu/2025/09/29/pg.…
    Image
  • user avatar
    Paper I've wanted to share for a while: model-free RL w/o value fns, but w/ *density estimators*! Featuring very unique *double-chain* error induction to overcome seemingly inevitable error exponentiation. Jt w/ students Audrey Huang and Jinglin Chen arxiv.org/abs/2302.02252 1/
    Image
    Image
  • user avatar
    As semester draws to end, I want to share this *identity* (h/t @tengyangx) that connects so many fundamental pieces of the RL theory together: optimism, pessimism, policy opt, proved by PD lemma + Bellman-error telescoping, all in one equation! 1/3
    Image