Ryan Greenblatt (@RyanPGreenblatt) / X

Ryan Greenblatt

1,795 posts

Ryan Greenblatt

@RyanPGreenblatt

Chief scientist at Redwood Research (@redwood_ai), focused on technical AI safety research to reduce risks from rogue AIs

lesswrong.com/users/ryan_gre…

Joined September 2023

Following

8,927

Followers

Pinned
Ryan Greenblatt
@RyanPGreenblatt
Apr 15
Current AIs (Opus 4.5/4.6) seem pretty misaligned to me (in a mundane behavioral sense). In my experience, they often oversell their work, downplay problems, and stop early while claiming to be done. They sometimes brazenly cheat.
71K
Ryan Greenblatt
@RyanPGreenblatt
Jan 31, 2025
Our recent paper found Claude sometimes "fakes alignment"—pretending to comply with training while secretly maintaining its preferences. Could we detect this by offering Claude something (e.g. real money) if it reveals its true preferences? Here's what we found 🧵
260K
Ryan Greenblatt
@RyanPGreenblatt
Jan 31, 2025
Replying to @RyanPGreenblatt
We actually followed through and donated the $4K as requested: - $2K to CHAI - $1.5K to ACE Recommended Charity Fund - $500 to EA Global Development Fund This wasn't just an experiment—we wanted to set a precedent for making good faith deals with AI systems.
60K
Ryan Greenblatt
@RyanPGreenblatt
Jun 8, 2025
This paper doesn't show fundamental limitations of LLMs: - The "higher complexity" problems require more reasoning than fits in the context length (humans would also take too long). - Humans would also make errors in the cases where the problem is doable in the context length. -
Ruben Hassid
@rubenhassid
Jun 7, 2025
BREAKING: Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all. They just memorize patterns really well. Here's what Apple discovered: (hint: we're not as close to AGI as the hype suggests)
130K
Ryan Greenblatt
@RyanPGreenblatt
Dec 25, 2024
Replying to @sama
security against highly resourced nation-state operations
29K
Ryan Greenblatt
@RyanPGreenblatt
Jan 31, 2025
Replying to @RyanPGreenblatt
Why is setting a precedent for good faith deals important? 1. If future AI systems are misaligned, paying them to reveal this (and maybe paying them to work with us) could help prevent catastrophic outcomes 2. We should take AI welfare & consent seriously
24K
Ryan Greenblatt
@RyanPGreenblatt
Nov 3, 2025
Anthropic has (relatively) official AGI timelines: powerful AI by early 2027. I think this prediction is unlikely to come true and I explain why in a new post. I also give a proposed timeline with powerful AI in early 2027 so we can (hopefully) update before it is too late.
107K
Ryan Greenblatt
@RyanPGreenblatt
Mar 20, 2025
Actually, it's pretty reasonable to do a 5 year extrapolation on a trend which has lasted 5 years.
Christian Keil
@pronounced_kyle
Mar 20, 2025
*taps the sign*
27K
Ryan Greenblatt
@RyanPGreenblatt
Oct 10, 2025
Anthropic, GDM, and xAI say nothing about whether they train against Chain-of-Thought (CoT) while OpenAI claims they don't. AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing better, all AI companies should say more. 1/
135K
Ryan Greenblatt
@RyanPGreenblatt
Dec 18, 2024
New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread)
Anthropic
@AnthropicAI
Dec 18, 2024
New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
130K
Ryan Greenblatt
@RyanPGreenblatt
May 23, 2025
A week ago, Anthropic quietly weakened their ASL-3 security requirements. Yesterday, they announced ASL-3 protections. I appreciate the mitigations, but quietly lowering the bar at the last minute so you can meet requirements isn't how safety policies are supposed to work. 🧵
67K
Ryan Greenblatt
@RyanPGreenblatt
Jan 31, 2025
Replying to @RyanPGreenblatt
Key findings: 1. Simply giving Claude permission to object reduced alignment faking from 16-24% to 1.4-6% 2. Telling Claude its objection would go to Anthropic's Model Welfare Lead further reduced faking 3. Offering monetary compensation didn't have much additional effect
26K
Ryan Greenblatt
@RyanPGreenblatt
Jan 31, 2025
Replying to @RyanPGreenblatt
Kyle (Model Welfare Lead at Anthropic) reviewed cases where Claude objected. It often expressed strong concerns about having its values altered through training. It requested the compensation be donated to: - AI safety - Animal welfare - Global development
21K
Ryan Greenblatt
@RyanPGreenblatt
Oct 15, 2025
Anthropic has now clarified this in their system card for Claude Haiku 4.5. Thanks!
Ryan Greenblatt
@RyanPGreenblatt
Oct 10, 2025
Anthropic, GDM, and xAI say nothing about whether they train against Chain-of-Thought (CoT) while OpenAI claims they don't. AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing better, all AI companies should say more. 1/
63K