Pinned
New research on beneficial RL: models trained on a small amount of beneficial trait data improve on a wide range of alignment and benefits evaluations, even if trained only on health domain data.
We hope it’s a step towards more broadly and persistently beneficial models. 🧵













