Pinned
When LLMs don’t do what we want, we often tell them exactly what/how to change. Ideally, models could learn from this feedback, which is much richer and denser than scalar rewards used for RL. In our new paper, we study how to expand the capabilities of RL via text feedback:












