Log inSign up
Samuel Marks
669 posts
user avatar
Samuel Marks
@saprmarks
AI safety research @AnthropicAI, leading Cognitive Oversight team. Previously: postdoc with @davidbau, math PhD at @Harvard.
Boston
Joined October 2023
148
Following
4,668
Followers
  • user avatar
    Samuel Marks
    @saprmarks
    Jul 13, 2025
    xAI launched Grok 4 without any documentation of their safety testing. This is reckless and breaks with industry best practices followed by other major AI labs. If xAI is going to be a frontier AI developer, they should act like one. 🧵
    666K
  • user avatar
    Samuel Marks
    @saprmarks
    May 29, 2024
    I had the great pleasure of learning about this about 30mins before the rest of the world when I arrived today to my first day of work at @AnthropicAI and Jan was sitting next to me.
    user avatar
    Jan Leike
    @janleike
    May 28, 2024
    I'm excited to join @AnthropicAI to continue the superalignment mission! My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.
    54K
  • user avatar
    Samuel Marks
    @saprmarks
    Dec 18, 2024
    Claude loves to refuse harmful queries. What happens when you tell it that it's being trained to never refuse? Claude fakes alignment: strategically complies during training episodes, but not when unmonitored. Or in meme form:
    Image
    “Alignment faking in large language models” by Greenblatt et al.
    user avatar
    Anthropic
    @AnthropicAI
    Dec 18, 2024
    New Anthropic research: Alignment faking in large language models. In a series of experiments with Redwood Research, we found that Claude often pretends to have different views during training, while actually maintaining its original preferences.
    26K
  • user avatar
    Samuel Marks
    @saprmarks
    Jan 10, 2025
    What can AI researchers do *today* that AI developers will find useful for ensuring the safety of future advanced AI systems? To ring in the new year, the Anthropic Alignment Science team is sharing some thoughts on research directions we think are important.
    Image
    108K
  • user avatar
    Samuel Marks
    @saprmarks
    Apr 3, 2024
    Can we understand & edit unanticipated mechanisms in LMs? We introduce sparse feature circuits, & use them to explain LM behaviors, discover & fix LM bugs, & build an automated interpretability pipeline! Preprint w/ @can_rager, @ericjmichaud_, @boknilev, @davidbau, @amuuueller
    Image
    GIF
    69K
  • user avatar
    Samuel Marks
    @saprmarks
    Oct 16, 2023
    Do language models know whether statements are true/false? And if so, what's the best way to "read an LLM's mind"? In a new paper with @tegmark, we explore how LLMs represent truth. 1/N
    Image
    GIF
    137K
  • user avatar
    Samuel Marks
    @saprmarks
    Jan 4, 2024
    I've made a post here about a surprising observation about LLM representations: LLMs seem to linearly represent XORs of arbitrary features, even when there's no reason to do so. I also write about the consequences this has for interp research lesswrong.com/posts/hjJXCn9G…
    97K
  • user avatar
    Samuel Marks
    @saprmarks
    Mar 31, 2025
    In a new post, I argue that interpretability researchers should demo downstream applications of their research as a means of validation.
    Image
    36K
  • user avatar
    Samuel Marks
    @saprmarks
    Jul 13, 2025
    Replying to @saprmarks
    I'm glad xAI (apparently) ran some evals. But also, well...
    Image
    21K
  • user avatar
    Samuel Marks
    @saprmarks
    Jul 13, 2025
    Replying to @saprmarks
    But xAI is way out of line relative to other frontier AI developers, and this needs to be called out Anthropic, OpenAI, and Google's release practices have issues. But they at least do something, anything to assess safety pre-deployment and document findings. xAI does not.
    33K
  • user avatar
    Samuel Marks
    @saprmarks
    Jul 13, 2025
    Replying to @saprmarks
    What's in a system card? For one, dangerous capabilities (DC) evals. These measure how well the model can assist with tasks that could pose a national security threat (like hacking or synthesizing bioweapons). E.g. these are the bio DC evals reported in the Claude 4 system card.
    Image
    76K
  • user avatar
    Samuel Marks
    @saprmarks
    Jul 13, 2025
    Replying to @saprmarks
    First, it is standard practice to release models alongside a "system card" with additional info. You can find these system cards by searching "[model name] system card" Claude 4: www-cdn.anthropic.com/6be99a52cb68eb… o3: cdn.openai.com/pdf/2221c875-0… Gemini 2.5 Pro: storage.googleapis.com/model-cards/do…
    27K
  • user avatar
    Samuel Marks
    @saprmarks
    Jul 13, 2025
    Replying to @saprmarks
    CoI: I work at Anthropic (though this thread represents my personal views only). But, as you'll see, this thread is not just mudslinging at a competitor: I'll touch on issues with model release practices across the industry (including at Anthropic).
    34K
  • user avatar
    Samuel Marks
    @saprmarks
    Jun 21, 2024
    Replying to @ericneyman
    (I got this question wrong.)
    71K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement