Current AIs (Opus 4.5/4.6) seem pretty misaligned to me (in a mundane behavioral sense). In my experience, they often oversell their work, downplay problems, and stop early while claiming to be done. They sometimes brazenly cheat.
Our recent paper found Claude sometimes "fakes alignment"—pretending to comply with training while secretly maintaining its preferences. Could we detect this by offering Claude something (e.g. real money) if it reveals its true preferences? Here's what we found 🧵
We actually followed through and donated the $4K as requested:
- $2K to CHAI
- $1.5K to ACE Recommended Charity Fund
- $500 to EA Global Development Fund
This wasn't just an experiment—we wanted to set a precedent for making good faith deals with AI systems.
This paper doesn't show fundamental limitations of LLMs:
- The "higher complexity" problems require more reasoning than fits in the context length (humans would also take too long).
- Humans would also make errors in the cases where the problem is doable in the context length.
-
BREAKING: Apple just proved AI "reasoning" models like Claude, DeepSeek-R1, and o3-mini don't actually reason at all.
They just memorize patterns really well.
Here's what Apple discovered:
(hint: we're not as close to AGI as the hype suggests)
Why is setting a precedent for good faith deals important?
1. If future AI systems are misaligned, paying them to reveal this (and maybe paying them to work with us) could help prevent catastrophic outcomes
2. We should take AI welfare & consent seriously
Anthropic has (relatively) official AGI timelines: powerful AI by early 2027. I think this prediction is unlikely to come true and I explain why in a new post.
I also give a proposed timeline with powerful AI in early 2027 so we can (hopefully) update before it is too late.
Anthropic, GDM, and xAI say nothing about whether they train against Chain-of-Thought (CoT) while OpenAI claims they don't.
AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing better, all AI companies should say more. 1/
New Redwood Research (@redwood_ai) paper in collaboration with @AnthropicAI: We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do. (Thread)
New Anthropic research: Alignment faking in large language models.
In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
A week ago, Anthropic quietly weakened their ASL-3 security requirements. Yesterday, they announced ASL-3 protections.
I appreciate the mitigations, but quietly lowering the bar at the last minute so you can meet requirements isn't how safety policies are supposed to work. 🧵
Key findings:
1. Simply giving Claude permission to object reduced alignment faking from 16-24% to 1.4-6%
2. Telling Claude its objection would go to Anthropic's Model Welfare Lead further reduced faking
3. Offering monetary compensation didn't have much additional effect
Kyle (Model Welfare Lead at Anthropic) reviewed cases where Claude objected. It often expressed strong concerns about having its values altered through training. It requested the compensation be donated to:
- AI safety
- Animal welfare
- Global development
Anthropic, GDM, and xAI say nothing about whether they train against Chain-of-Thought (CoT) while OpenAI claims they don't.
AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing better, all AI companies should say more. 1/