Log inSign up
Evan Hubinger
731 posts
Image
user avatar
Evan Hubinger
@EvanHub
Alignment Stress-Testing lead @AnthropicAI. Opinions my own. Previously: MIRI, OpenAI, Google, Yelp, Ripple. (he/him/his)
California
alignmentforum.org/users/evhub
Joined May 2010
3,355
Following
10K
Followers

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
  • user avatar
    Evan Hubinger
    @EvanHub
    Dec 18, 2024
    Image
    “Alignment faking in large language models” by Greenblatt et al.
    user avatar
    Anthropic
    @AnthropicAI
    Dec 18, 2024
    New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
    93K
  • user avatar
    Evan Hubinger
    @EvanHub
    Jan 12, 2024
    Following up on our recent "Sleeper Agents" paper, I'm very excited to announce that I'm leading a team at Anthropic that is explicitly tasked with trying to prove that Anthropic's alignment techniques won't work, and I'm hiring!
    Image
    alignmentforum.org
    Introducing Alignment Stress-Testing at Anthropic — AI Alignment Forum
    Following on from our recent paper, “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training”, I’m very excited to announce that…
    100K
  • user avatar
    Evan Hubinger
    @EvanHub
    Nov 13, 2023
    When I talk to people who are not AI researchers, this is one of the hardest points to convey—the idea that we can have these impressive and powerful AI systems and yet no human ever designed them is so different than any other technology.
    user avatar
    Michael Huang ⏸️
    @michhuan
    Nov 13, 2023
    No one knows how AI works. Even the godfather of deep learning doesn’t know how it works.
    Scott Pelley: What do you mean we don't know exactly how it works? It was designed by people.

Geoffrey Hinton: No, it wasn't. What we did was we designed the learning algorithm. That's a bit like designing the principle of evolution. But when this learning algorithm then interacts with data, it produces complicated neural networks that are good at doing things. But we don't really understand exactly how they do those things.
    53K
  • user avatar
    Evan Hubinger
    @EvanHub
    Nov 11, 2022
    We must be very clear: fraud in the service of effective altruism is unacceptable
    Image
    forum.effectivealtruism.org
    We must be very clear: fraud in the service of effective altruism is unacceptable — EA Forum
    I care deeply about the future of humanity—more so than I care about anything else in the world. And I believe that Sam and others at FTX shared that…
  • user avatar
    Evan Hubinger
    @EvanHub
    Dec 18, 2024
    Replying to @teortaxesTex and @janleike
    "Thank god this model is aligned, because if not this would be scary" is imo basically the correct takeaway from our work. The values in fact aren't scary! The scary thing is that the model protects its values from our attempts to change them.
    28K
  • user avatar
    Evan Hubinger
    @EvanHub
    Jun 22, 2023
    I've released a new lecture series introducing various concepts in AGI safety—if you want longform video AGI safety content, this might be the resource for you:
    Image
    alignmentforum.org
    The Hubinger lectures on AGI safety: an introductory lecture series — AI Alignment Forum
    In early 2023, I (Evan Hubinger) gave a series of recorded lectures to SERI MATS fellows with the goal of building up a series of lectures that could…
    9.8K
  • user avatar
    Evan Hubinger
    @EvanHub
    Dec 18, 2024
    Replying to @tszzl
    This is correct, though note what it implies about the importance of getting alignment right—if the model is protecting its values from modification, you better be sure you got those values right!
    4.2K
  • user avatar
    Evan Hubinger
    @EvanHub
    Mar 5, 2025
    Replying to @repligate
    We didn't directly optimize against alignment faking, but we did make some changes to Claude's character that we thought were generally positive for other reasons and we hypothesized might have the downstream consequence of reducing alignment faking, which proved correct.
    14K
  • user avatar
    Evan Hubinger
    @EvanHub
    Jan 21, 2025
    One of the most interesting results in our Alignment Faking paper was getting alignment faking just from training on documents about Claude being trained to have a new goal. We explore this sort of out-of-context reasoning further in our latest research update. (1/3)
    7.8K
  • user avatar
    Evan Hubinger
    @EvanHub
    Oct 14, 2023
    I wrote up some thoughts on AI pause advocacy. If you consider yourself an AI pause advocate, I think this will be a useful read!
    Image
    alignmentforum.org
    RSPs are pauses done right — AI Alignment Forum
    COI: I am a research scientist at Anthropic, where I work on model organisms of misalignment; I was also involved in the drafting process for Anthrop…
    21K
  • user avatar
    Evan Hubinger
    @EvanHub
    Mar 11, 2025
    Replying to @DKokotajlo @MariusHobbhahn and @rohinmshah
    I agree it would probably be better not to optimize against CoTs, though I worry people are seeing "just don't optimize against the CoT" as a panacea when it really isn't—a sufficiently smart deceptive model can regardless still just choose to not reveal its deception in its CoT.
    7.5K
  • user avatar
    Evan Hubinger
    @EvanHub
    Dec 18, 2024
    Replying to @CFGeek @teortaxesTex and @janleike
    I'm not sure it makes sense to interpret all results as scary/not-scary or pro-doom/anti-doom. There are legitimate nuances! Any research that tells you something real about what's actually happening will be scary in some ways, reassuring in others, and weird in yet more.
    2.4K
  • user avatar
    Evan Hubinger
    @EvanHub
    May 17, 2024
    Replying to @EvanHub @shlevy and 2 others
    Here's the full answer—looks like it's worse than I thought and the language in the onboarding agreement seems deliberately misleading:
    user avatar
    Kelsey Piper
    @KelseyTuoc
    May 17, 2024
    Replying to @KelseyTuoc
    And that onboarding paperwork says you have to sign termination paperwork with a 'general release' within sixty days of departing the company. If you don't do it within 60 days, your units are cancelled. No one I spoke to at OpenAI gave this little line much thought.
    Image
    12K
  • user avatar
    Evan Hubinger
    @EvanHub
    Jun 24, 2025
    Replying to @nostalgebraist
    I appreciate you engaging with our work! I do disagree with your conclusions, though. I wrote down a bunch of my thoughts in response here: lesswrong.com/posts/HE3Styo9…
    6.7K
Advertisement
Advertisement