Pinned
This paper dramatically changed my understanding of how to think about the internal representations of models. Check it out :)
Interpreting and controlling internal representations should be based on how the model actually uses them!
Turns out: information geometry makes this precise. We show how, and use it to derive a (provably & empirically) robust strategy for steering.
arxiv.org/abs/2602.15293
00:00








