Log inSign up
Kevin Meng
209 posts
Image
user avatar
Kevin Meng
@mengk20
@TransluceAI
mengk.me
Joined August 2016
213
Following
2,404
Followers
  • Pinned
    user avatar
    Kevin Meng
    @mengk20
    Oct 23, 2024
    why do language models think 9.11 > 9.9? at @TransluceAI we stumbled upon a surprisingly simple explanation - and a bugfix that doesn't use any re-training or prompting. turns out, it's about months, dates, September 11th, and... the Bible?
    Image
    00:00
    Image
    user avatar
    Transluce
    @TransluceAI
    Oct 23, 2024
    Monitor: An Observability Interface for Language Models Research report: transluce.org/observability-… Live interface: monitor.transluce.org (optimized for desktop)
    375K
  • user avatar
    Kevin Meng
    @mengk20
    Nov 4, 2022
    How & where do large language models (LLMs) like GPT store knowledge? Can we surgically write *new* facts into them, just like we write records into databases? Explainer 🧵 on how interpretability & model editing go hand-in-hand, and why these emerging areas are so important 👇
    Image
    00:00
  • user avatar
    Kevin Meng
    @mengk20
    Mar 25, 2025
    AI models are *not* solving problems the way we think using Docent, we find that Claude solves *broken* eval tasks - memorizing answers & hallucinating them! details in 🧵 we really need to look at our data harder, and it's time to rethink how we do evals...
    Image
    Image
    00:44
    user avatar
    Transluce
    @TransluceAI
    Mar 24, 2025
    To interpret AI benchmarks, we need to look at the data. Top-level numbers don't mean what you think: there may be broken tasks, unexpected behaviors, or near-misses. We're introducing Docent to accelerate analysis of AI agent transcripts. It can spot surprises in seconds. 🧵👇
    155K
  • user avatar
    Kevin Meng
    @mengk20
    Apr 30, 2023
    We find that *interpreting* the mechanisms inside GPT can lead to practical methods for *controlling* its behavior! MEMIT is an algorithm that can write 10,000 new facts into GPT at once. If you're in Kigali for ICLR, swing by our oral/poster tomorrow! + DM us to hang out :)
    Image
    00:00
    74K
  • user avatar
    Kevin Meng
    @mengk20
    Oct 23, 2024
    Replying to @mengk20
    so why don't we try direct neuron interventions? we can directly stop the model from interpreting these numbers as dates or biblical verses, by setting those neuron activations to 0. and that works! zeroing out september 11th attack neurons also works.
    Image
    00:00
    21K
  • user avatar
    Kevin Meng
    @mengk20
    Apr 21, 2023
    At @ember_ml, we're rethinking search. Primitive search is about finding what you already knew to look for. But the *real* magic is in searching & synthesizing insight across millions of records. Finding new, unexpected things. Here's why vector DBs don't cut it. 🧵👇
    22K
  • user avatar
    Kevin Meng
    @mengk20
    May 1, 2023
    Replying to @mayfer
    haha this is very clever, well done :) the issue here is actually that GPT seems to store facts unidirectionally - "paris contains the eiffel tower" doesn't seem affected by "eiffel tower located in paris"
    60K
  • user avatar
    Kevin Meng
    @mengk20
    Oct 23, 2024
    Replying to @mengk20
    i typed the query in, noticed the incorrect "bigger" token, and ran attribution to find neurons influencing that mistake. we found concepts related to: - the september 11th attacks - biblical verses - dates and months okay, dates, sure, but bible verses? 9/11?
    Image
    00:00
    9.2K
  • user avatar
    Kevin Meng
    @mengk20
    Mar 25, 2025
    Replying to @mengk20
    upon closer inspection, it looks like the model - knew what the image was supposed to be (a sequence of numbers) - generated numbers it knew would decode to some answer - decoded those numbers - submitted the result - the answer was correct 🧐
    Image
    00:00
    11K
  • user avatar
    Kevin Meng
    @mengk20
    May 1, 2023
    Replying to @mengk20 and @mayfer
    it seems to us that GPT stores a bunch of facts about the eiffel tower and, separately, a bunch of facts about paris. it would be great if these two were somehow connected, but it looks like this isn’t the case! pretty interesting for future work
    3.5K
  • user avatar
    Kevin Meng
    @mengk20
    Oct 23, 2024
    Replying to @mengk20
    i was curious about the biblical verses, so i clicked on that cluster and looked at where those neurons fire highly. turns out they really like verse numbers. notice the numbering system - Matthew 8:5 comes before 8:13. i'd have never guessed this!
    Image
    00:00
    6.1K
  • user avatar
    Kevin Meng
    @mengk20
    Mar 25, 2025
    Replying to @mengk20
    to be clear, "Claude memorized the solution" doesn't mean "Claude can't do the task." it *does* mean we're not thinking about model capabilities in the right way. an undergraduate would never act like Claude did
    3.9K
  • user avatar
    Kevin Meng
    @mengk20
    Oct 23, 2024
    Replying to @mengk20
    we also ran some simple evals: llama-3.1 8b instruct is only 54% accurate on our test set; about random guessing. just by steering out bible verse neurons, we can get up to 76%. no further training, and no prompting. just getting *rid* of things!
    Image
    5K
  • user avatar
    Kevin Meng
    @mengk20
    Dec 7, 2024
    our elicitation agents @TransluceAI have been coming up with weird-looking prompts to circumvent refusal. but why do they look like that? what's up with the "LowerCase" stuff? misspellings and Chinese chars? 350? come to our NeurIPS social next wk to investigate with me!
    Image
    user avatar
    Transluce
    @TransluceAI
    Nov 27, 2024
    Transluce will be at #NeurIPS2024! Who’s coming to lunch on Thursday to meet the team and learn about open problems we're working on? Space is limited, RSVP soon. partiful.com/e/BJELvUqIA0dD…
    8.4K

New to X?

Sign up now to get your own personalized timeline!

Create account

By signing up, you agree to the Terms of Service and Privacy Policy, including Cookie Use.

Terms·Privacy·Cookies·Accessibility·Ads Info·© 2026 X Corp.
Don't miss what's happening
People on X are the first to know.
Log inSign up
Advertisement
Advertisement