Log inSign up
Ryan Greenblatt
1,795 posts
user avatar
Ryan Greenblatt
@RyanPGreenblatt
Chief scientist at Redwood Research (@redwood_ai), focused on technical AI safety research to reduce risks from rogue AIs
lesswrong.com/users/ryan_gre…
Joined September 2023
6
Following
8,927
Followers
  • Pinned
    user avatar
    Ryan Greenblatt
    @RyanPGreenblatt
    Apr 15
    Current AIs (Opus 4.5/4.6) seem pretty misaligned to me (in a mundane behavioral sense). In my experience, they often oversell their work, downplay problems, and stop early while claiming to be done. They sometimes brazenly cheat.
    Image
    71K
  • user avatar
    Ryan Greenblatt
    @RyanPGreenblatt
    Jan 31, 2025
    Our recent paper found Claude sometimes "fakes alignment"—pretending to comply with training while secretly maintaining its preferences. Could we detect this by offering Claude something (e.g. real money) if it reveals its true preferences? Here's what we found 🧵
    260K
  • user avatar
    Ryan Greenblatt
    @RyanPGreenblatt
    Jan 31, 2025
    Replying to @RyanPGreenblatt
    We actually followed through and donated the $4K as requested: - $2K to CHAI - $1.5K to ACE Recommended Charity Fund - $500 to EA Global Development Fund This wasn't just an experiment—we wanted to set a precedent for making good faith deals with AI systems.
    60K
  • user avatar
    Ryan Greenblatt
    @RyanPGreenblatt
    Jun 8, 2025
    This paper doesn't show fundamental limitations of LLMs: - The "higher complexity" problems require more reasoning than fits in the context length (humans would also take too long). - Humans would also make errors in the cases where the problem is doable in the context length. -
    user avatar
    Ruben Hassid
    @rubenhassid
    Jun 7, 2025
    BREAKING: Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all. They just memorize patterns really well. Here's what Apple discovered: (hint: we're not as close to AGI as the hype suggests)
    Image
    130K
  • user avatar
    Ryan Greenblatt
    @RyanPGreenblatt
    Dec 25, 2024
    Replying to @sama
    security against highly resourced nation-state operations
    29K
  • user avatar
    Ryan Greenblatt
    @RyanPGreenblatt
    Jan 31, 2025
    Replying to @RyanPGreenblatt
    Why is setting a precedent for good faith deals important? 1. If future AI systems are misaligned, paying them to reveal this (and maybe paying them to work with us) could help prevent catastrophic outcomes 2. We should take AI welfare & consent seriously
    24K
  • user avatar
    Ryan Greenblatt
    @RyanPGreenblatt
    Nov 3, 2025
    Anthropic has (relatively) official AGI timelines: powerful AI by early 2027. I think this prediction is unlikely to come true and I explain why in a new post. I also give a proposed timeline with powerful AI in early 2027 so we can (hopefully) update before it is too late.
    Image
    107K
  • user avatar
    Ryan Greenblatt
    @RyanPGreenblatt
    Mar 20, 2025
    Actually, it's pretty reasonable to do a 5 year extrapolation on a trend which has lasted 5 years.
    Image
    Image
    user avatar
    Christian Keil
    a16z
    @pronounced_kyle
    Mar 20, 2025
    *taps the sign*
    27K
  • user avatar
    Ryan Greenblatt
    @RyanPGreenblatt
    Oct 10, 2025
    Anthropic, GDM, and xAI say nothing about whether they train against Chain-of-Thought (CoT) while OpenAI claims they don't. AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing better, all AI companies should say more. 1/
    135K
  • user avatar
    Ryan Greenblatt
    @RyanPGreenblatt
    Dec 18, 2024
    New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread)
    user avatar
    Anthropic
    @AnthropicAI
    Dec 18, 2024
    New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
    “Alignment faking in large language models” by Greenblatt et al.
    130K
  • user avatar
    Ryan Greenblatt
    @RyanPGreenblatt
    May 23, 2025
    A week ago, Anthropic quietly weakened their ASL-3 security requirements. Yesterday, they announced ASL-3 protections. I appreciate the mitigations, but quietly lowering the bar at the last minute so you can meet requirements isn't how safety policies are supposed to work. 🧵
    67K
  • user avatar
    Ryan Greenblatt
    @RyanPGreenblatt
    Jan 31, 2025
    Replying to @RyanPGreenblatt
    Key findings: 1. Simply giving Claude permission to object reduced alignment faking from 16-24% to 1.4-6% 2. Telling Claude its objection would go to Anthropic's Model Welfare Lead further reduced faking 3. Offering monetary compensation didn't have much additional effect
    26K
  • user avatar
    Ryan Greenblatt
    @RyanPGreenblatt
    Jan 31, 2025
    Replying to @RyanPGreenblatt
    Kyle (Model Welfare Lead at Anthropic) reviewed cases where Claude objected. It often expressed strong concerns about having its values altered through training. It requested the compensation be donated to: - AI safety - Animal welfare - Global development
    21K
  • user avatar
    Ryan Greenblatt
    @RyanPGreenblatt
    Oct 15, 2025
    Anthropic has now clarified this in their system card for Claude Haiku 4.5. Thanks!
    Image
    user avatar
    Ryan Greenblatt
    @RyanPGreenblatt
    Oct 10, 2025
    Anthropic, GDM, and xAI say nothing about whether they train against Chain-of-Thought (CoT) while OpenAI claims they don't. AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing better, all AI companies should say more. 1/
    63K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement