xAI launched Grok 4 without any documentation of their safety testing. This is reckless and breaks with industry best practices followed by other major AI labs.
If xAI is going to be a frontier AI developer, they should act like one. 🧵
Samuel Marks
669 posts
AI safety research @AnthropicAI, leading Cognitive Oversight team. Previously: postdoc with @davidbau, math PhD at @Harvard.
Boston
Joined October 2023
- I had the great pleasure of learning about this about 30mins before the rest of the world when I arrived today to my first day of work at @AnthropicAI and Jan was sitting next to me.I'm excited to join @AnthropicAI to continue the superalignment mission! My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.
- Claude loves to refuse harmful queries. What happens when you tell it that it's being trained to never refuse? Claude fakes alignment: strategically complies during training episodes, but not when unmonitored. Or in meme form:New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
- What can AI researchers do *today* that AI developers will find useful for ensuring the safety of future advanced AI systems? To ring in the new year, the Anthropic Alignment Science team is sharing some thoughts on research directions we think are important.
- Can we understand & edit unanticipated mechanisms in LMs? We introduce sparse feature circuits, & use them to explain LM behaviors, discover & fix LM bugs, & build an automated interpretability pipeline! Preprint w/ @can_rager, @ericjmichaud_, @boknilev, @davidbau, @amuuueller
GIF - Do language models know whether statements are true/false? And if so, what's the best way to "read an LLM's mind"? In a new paper with @tegmark, we explore how LLMs represent truth. 1/N
GIF - I've made a post here about a surprising observation about LLM representations: LLMs seem to linearly represent XORs of arbitrary features, even when there's no reason to do so. I also write about the consequences this has for interp research lesswrong.com/posts/hjJXCn9G…
- In a new post, I argue that interpretability researchers should demo downstream applications of their research as a means of validation.
- Replying to @saprmarksI'm glad xAI (apparently) ran some evals. But also, well...
- Replying to @saprmarksBut xAI is way out of line relative to other frontier AI developers, and this needs to be called out Anthropic, OpenAI, and Google's release practices have issues. But they at least do something, anything to assess safety pre-deployment and document findings. xAI does not.
- Replying to @saprmarksWhat's in a system card? For one, dangerous capabilities (DC) evals. These measure how well the model can assist with tasks that could pose a national security threat (like hacking or synthesizing bioweapons). E.g. these are the bio DC evals reported in the Claude 4 system card.
- Replying to @saprmarksFirst, it is standard practice to release models alongside a "system card" with additional info. You can find these system cards by searching "[model name] system card" Claude 4: www-cdn.anthropic.com/6be99a52cb68eb… o3: cdn.openai.com/pdf/2221c875-0… Gemini 2.5 Pro: storage.googleapis.com/model-cards/do…
- Replying to @saprmarksCoI: I work at Anthropic (though this thread represents my personal views only). But, as you'll see, this thread is not just mudslinging at a competitor: I'll touch on issues with model release practices across the industry (including at Anthropic).









