Samuel Marks (@saprmarks) / X

Samuel Marks

669 posts

Samuel Marks

@saprmarks

AI safety research @AnthropicAI, leading Cognitive Oversight team. Previously: postdoc with @davidbau, math PhD at @Harvard.

Boston

Joined October 2023

Samuel Marks
@saprmarks
Jul 13, 2025
xAI launched Grok 4 without any documentation of their safety testing. This is reckless and breaks with industry best practices followed by other major AI labs. If xAI is going to be a frontier AI developer, they should act like one. 🧵
666K
Samuel Marks
@saprmarks
May 29, 2024
I had the great pleasure of learning about this about 30mins before the rest of the world when I arrived today to my first day of work at @AnthropicAI and Jan was sitting next to me.
Jan Leike
@janleike
May 28, 2024
I'm excited to join @AnthropicAI to continue the superalignment mission! My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.
54K
Samuel Marks
@saprmarks
Dec 18, 2024
Claude loves to refuse harmful queries. What happens when you tell it that it's being trained to never refuse? Claude fakes alignment: strategically complies during training episodes, but not when unmonitored. Or in meme form:
Anthropic
@AnthropicAI
Dec 18, 2024
New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
26K
Samuel Marks
@saprmarks
Jan 10, 2025
What can AI researchers do *today* that AI developers will find useful for ensuring the safety of future advanced AI systems? To ring in the new year, the Anthropic Alignment Science team is sharing some thoughts on research directions we think are important.
108K
Samuel Marks
@saprmarks
Apr 3, 2024
Can we understand & edit unanticipated mechanisms in LMs? We introduce sparse feature circuits, & use them to explain LM behaviors, discover & fix LM bugs, & build an automated interpretability pipeline! Preprint w/ @can_rager, @ericjmichaud_, @boknilev, @davidbau, @amuuueller
GIF
69K
Samuel Marks
@saprmarks
Oct 16, 2023
Do language models know whether statements are true/false? And if so, what's the best way to "read an LLM's mind"? In a new paper with @tegmark, we explore how LLMs represent truth. 1/N
GIF
137K
Samuel Marks
@saprmarks
Jan 4, 2024
I've made a post here about a surprising observation about LLM representations: LLMs seem to linearly represent XORs of arbitrary features, even when there's no reason to do so. I also write about the consequences this has for interp research lesswrong.com/posts/hjJXCn9G…
97K
Samuel Marks
@saprmarks
Mar 31, 2025
In a new post, I argue that interpretability researchers should demo downstream applications of their research as a means of validation.
36K
Samuel Marks
@saprmarks
Jul 13, 2025
Replying to @saprmarks
I'm glad xAI (apparently) ran some evals. But also, well...
21K
Samuel Marks
@saprmarks
Jul 13, 2025
Replying to @saprmarks
But xAI is way out of line relative to other frontier AI developers, and this needs to be called out Anthropic, OpenAI, and Google's release practices have issues. But they at least do something, anything to assess safety pre-deployment and document findings. xAI does not.
33K
Samuel Marks
@saprmarks
Jul 13, 2025
Replying to @saprmarks
What's in a system card? For one, dangerous capabilities (DC) evals. These measure how well the model can assist with tasks that could pose a national security threat (like hacking or synthesizing bioweapons). E.g. these are the bio DC evals reported in the Claude 4 system card.
76K
Samuel Marks
@saprmarks
Jul 13, 2025
Replying to @saprmarks
First, it is standard practice to release models alongside a "system card" with additional info. You can find these system cards by searching "[model name] system card" Claude 4: www-cdn.anthropic.com/6be99a52cb68eb… o3: cdn.openai.com/pdf/2221c875-0… Gemini 2.5 Pro: storage.googleapis.com/model-cards/do…
27K
Samuel Marks
@saprmarks
Jul 13, 2025
Replying to @saprmarks
CoI: I work at Anthropic (though this thread represents my personal views only). But, as you'll see, this thread is not just mudslinging at a competitor: I'll touch on issues with model release practices across the industry (including at Anthropic).
34K
Samuel Marks
@saprmarks
Jun 21, 2024
Replying to @ericneyman
(I got this question wrong.)
71K