Lili (@lchen915) / X

Lili

125 posts

Lili

@lchen915

Ph.D. student @mldcmu. Previously undergrad @berkeley_ai

Joined February 2021

Pinned
Lili
@lchen915
Feb 3
When LLMs don’t do what we want, we often tell them exactly what/how to change. Ideally, models could learn from this feedback, which is much richer and denser than scalar rewards used for RL. In our new paper, we study how to expand the capabilities of RL via text feedback:
Yuda Song
@yus167
Feb 3
RL on LLMs inefficiently uses one scalar per rollout. But users regularly give much richer feedback: "make it formal," "step 3 is wrong." Can we train LLMs on this human-AI interaction? We introduce RL from Text Feedback, with 1) Self-Distillation; 2) Feedback Modeling (1/n) 🧵
16K
Lili
@lchen915
Aug 8, 2025
Self-Questioning Language Models: LLMs that learn to generate their own questions and answers via asymmetric self-play RL. There is no external training data – the only input is a single prompt specifying the topic.
146K
Lili
@lchen915
Apr 13, 2022
So excited to begin my PhD @CarnegieMellon this fall! I'm eternally grateful to @pabbeel and @kimin_le2 for their pivotal roles in my research journey. I'm unbelievably fortunate to have been advised by them. Looking forward to lots of fun research ahead!
Lili
@lchen915
Dec 13, 2023
How can robots perform a wide range of skills? At #CoRL2023, we presented PlayFusion – a language-conditioned discrete diffusion model capable of performing many different tasks! 🌐play-fusion.github.io 1/n
00:00
31K
Lili
@lchen915
May 28, 2025
One fundamental issue with RL – whether it’s for robots or LLMs – is how hard it is to get rewards. For LLM reasoning, we need ground-truth labels to verify answers. We found that maximizing confidence alone allows LLMs to improve their reasoning with RL!
Mihir Prabhudesai
@mihirp98
May 28, 2025
Excited to share our work: Maximizing Confidence Alone Improves Reasoning Humans rely on confidence to learn when answer keys aren’t available (e.g taking an exam). Surprisingly, LLMs can also learn w/o ground-truth answers, simply by reinforcing high-confidence answers via RL!
GIF
11K
Lili
@lchen915
Jun 2, 2021
So excited about our recent paper, which uses standard language modeling frameworks to solve RL problems! Fortunate to work with these incredible collaborators, especially co-lead @_kevinlu!
Igor Mordatch
@IMordatch
Jun 2, 2021
Can RL algorithms be replaced with transformer-based language models? We’ve looked at this question with our work on Decision Transformer: Website: sites.google.com/corp/berkeley.… Code: github.com/kzl/decision-t… 1/8
GIF
Lili
@lchen915
Aug 8, 2025
Replying to @lchen915
Thanks to my collaborators: Mihir Prabhudesai (@mihirp98), Katerina Fragkiadaki (@KaterinaFragiad), Hao Liu (@haoliuhl), and Deepak Pathak (@pathak2206). Website: self-questioning.github.io Github: github.com/lili-chen/self…
Self-Questioning Language Models
From self-questioning.github.io
2K
Lili
@lchen915
Aug 8, 2025
Replying to @lchen915
We evaluate this framework on three benchmarks: three-digit multiplication, algebra problems from the OMEGA benchmark, and coding problems from Codeforces. Without using any external data, LLMs can improve on these benchmarks!
2.4K
Lili
@lchen915
Aug 8, 2025
Replying to @lchen915
Absolute Zero (x.com/_AndrewZhao/st…) also learns w/o external data. We weren’t aware of it, but SQLM is general and extends beyond verifiable domains (e.g., coding) using majority voting. (Will add to related work in next update)
Andrew Zhao
@_AndrewZhao
May 7, 2025
❄️Introducing Absolute Zero Reasoner: Our reasoner learns to both propose tasks that maximize learnability and improve reasoning by solving them, entirely through self-play—with no external data! It overall outperforms other "zero" models in math & coding domains. 🧵 1/
3.1K
Lili
@lchen915
Aug 8, 2025
Replying to @lchen915
We propose SQLM, an asymmetric self-play framework where the proposer generates questions and the solver tries to solve them, and both are trained via RL.
3.1K
Lili
@lchen915
Aug 8, 2025
Replying to @lchen915
Qualitatively, we find that the model learns to generate progressively harder questions, without making them impossibly difficult.
2K
Lili
@lchen915
Aug 8, 2025
Replying to @lchen915
For coding, it is easier to generate unit tests than to generate the correct code. In this case, we ask the proposer to also generate unit tests to compute the solver’s reward. Again, the proposer is trained to generate questions that are not too easy or difficult.
2.6K
Lili
@lchen915
Aug 8, 2025
Replying to @lchen915
The solver is trained via majority voting reward and the proposer is trained to generate questions that are not too easy or difficult (determined by how many answers match the majority answer).
2.8K
Lili
@lchen915
Dec 13, 2023
Replying to @lchen915
💡How to collect real-world data? Instead of demos or offline RL data, we use play data (labeled in hindsight with language). This is what data collection looks like in the real world (s/o to @mohansrirama and @SudeepDasari for the teleop setup!): 2/n
00:00
853