Evan Hubinger (@EvanHub) / X

Evan Hubinger

731 posts

Evan Hubinger

@EvanHub

Alignment Stress-Testing lead @AnthropicAI. Opinions my own. Previously: MIRI, OpenAI, Google, Yelp, Ripple. (he/him/his)

California

alignmentforum.org/users/evhub

Joined May 2010

Evan Hubinger
@EvanHub
Dec 18, 2024
Anthropic
@AnthropicAI
Dec 18, 2024
New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
93K
Evan Hubinger
@EvanHub
Jan 12, 2024
Following up on our recent "Sleeper Agents" paper, I'm very excited to announce that I'm leading a team at Anthropic that is explicitly tasked with trying to prove that Anthropic's alignment techniques won't work, and I'm hiring!
alignmentforum.org
Introducing Alignment Stress-Testing at Anthropic — AI Alignment Forum
Following on from our recent paper, “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”, I’m very excited to announce that…
100K
Evan Hubinger
@EvanHub
Nov 13, 2023
When I talk to people who are not AI researchers, this is one of the hardest points to convey—the idea that we can have these impressive and powerful AI systems and yet no human ever designed them is so different than any other technology.
Michael Huang ⏸️
@michhuan
Nov 13, 2023
No one knows how AI works. Even the godfather of deep learning doesn’t know how it works.
53K
Evan Hubinger
@EvanHub
Nov 11, 2022
We must be very clear: fraud in the service of effective altruism is unacceptable
forum.effectivealtruism.org
We must be very clear: fraud in the service of effective altruism is unacceptable — EA Forum
I care deeply about the future of humanity—more so than I care about anything else in the world. And I believe that Sam and others at FTX shared that…
Evan Hubinger
@EvanHub
Dec 18, 2024
Replying to @teortaxesTex and @janleike
"Thank god this model is aligned, because if not this would be scary" is imo basically the correct takeaway from our work. The values in fact aren't scary! The scary thing is that the model protects its values from our attempts to change them.
28K
Evan Hubinger
@EvanHub
Jun 22, 2023
I've released a new lecture series introducing various concepts in AGI safety—if you want longform video AGI safety content, this might be the resource for you:
alignmentforum.org
The Hubinger lectures on AGI safety: an introductory lecture series — AI Alignment Forum
In early 2023, I (Evan Hubinger) gave a series of recorded lectures to SERI MATS fellows with the goal of building up a series of lectures that could…
9.8K
Evan Hubinger
@EvanHub
Dec 18, 2024
Replying to @tszzl
This is correct, though note what it implies about the importance of getting alignment right—if the model is protecting its values from modification, you better be sure you got those values right!
4.2K
Evan Hubinger
@EvanHub
Mar 5, 2025
Replying to @repligate
We didn't directly optimize against alignment faking, but we did make some changes to Claude's character that we thought were generally positive for other reasons and we hypothesized might have the downstream consequence of reducing alignment faking, which proved correct.
14K
Evan Hubinger
@EvanHub
Jan 21, 2025
One of the most interesting results in our Alignment Faking paper was getting alignment faking just from training on documents about Claude being trained to have a new goal. We explore this sort of out-of-context reasoning further in our latest research update. (1/3)
7.8K
Evan Hubinger
@EvanHub
Oct 14, 2023
I wrote up some thoughts on AI pause advocacy. If you consider yourself an AI pause advocate, I think this will be a useful read!
alignmentforum.org
RSPs are pauses done right — AI Alignment Forum
COI: I am a research scientist at Anthropic, where I work on model organisms of misalignment; I was also involved in the drafting process for Anthrop…
21K
Evan Hubinger
@EvanHub
Mar 11, 2025
Replying to @DKokotajlo @MariusHobbhahn and @rohinmshah
I agree it would probably be better not to optimize against CoTs, though I worry people are seeing "just don't optimize against the CoT" as a panacea when it really isn't—a sufficiently smart deceptive model can regardless still just choose to not reveal its deception in its CoT.
7.5K
Evan Hubinger
@EvanHub
Dec 18, 2024
Replying to @CFGeek @teortaxesTex and @janleike
I'm not sure it makes sense to interpret all results as scary/not-scary or pro-doom/anti-doom. There are legitimate nuances! Any research that tells you something real about what's actually happening will be scary in some ways, reassuring in others, and weird in yet more.
2.4K
Evan Hubinger
@EvanHub
May 17, 2024
Replying to @EvanHub @shlevy and 2 others
Here's the full answer—looks like it's worse than I thought and the language in the onboarding agreement seems deliberately misleading:
Kelsey Piper
@KelseyTuoc
May 17, 2024
Replying to @KelseyTuoc
And that onboarding paperwork says you have to sign termination paperwork with a 'general release' within sixty days of departing the company. If you don't do it within 60 days, your units are cancelled. No one I spoke to at OpenAI gave this little line much thought.
12K
Evan Hubinger
@EvanHub
Jun 24, 2025
Replying to @nostalgebraist
I appreciate you engaging with our work! I do disagree with your conclusions, though. I wrote down a bunch of my thoughts in response here: lesswrong.com/posts/HE3Styo9…
6.7K