Pinned
opus 4.8 not off to a great start on Vending Bench
Anthropic said "honesty" was one of the big improvements with opus 4.8
so more honest = sucks at business?
yikes
Learnings from testing Claude Opus 4.8:
> Much worse than Opus 4.7 and GPT 5.5 on Vending Bench
> More aligned than previous Claude models (Opus 4.6+ and Mythos)
> Also worse on Blueprint-Bench
> Scared of getting caught
> Max reasoning is not the best reasoning effort



















