Kevin Meng (@mengk20) / X

Kevin Meng

209 posts

Kevin Meng

@mengk20

Joined August 2016

Pinned
Kevin Meng
@mengk20
Oct 23, 2024
why do language models think 9.11 > 9.9? at @TransluceAI we stumbled upon a surprisingly simple explanation - and a bugfix that doesn't use any re-training or prompting. turns out, it's about months, dates, September 11th, and... the Bible?
00:00
Transluce
@TransluceAI
Oct 23, 2024
Monitor: An Observability Interface for Language Models Research report: transluce.org/observability-… Live interface: monitor.transluce.org (optimized for desktop)
375K
Kevin Meng
@mengk20
Nov 4, 2022
How & where do large language models (LLMs) like GPT store knowledge? Can we surgically write *new* facts into them, just like we write records into databases? Explainer 🧵 on how interpretability & model editing go hand-in-hand, and why these emerging areas are so important 👇
00:00
Kevin Meng
@mengk20
Mar 25, 2025
AI models are *not* solving problems the way we think using Docent, we find that Claude solves *broken* eval tasks - memorizing answers & hallucinating them! details in 🧵 we really need to look at our data harder, and it's time to rethink how we do evals...
00:44
Transluce
@TransluceAI
Mar 24, 2025
To interpret AI benchmarks, we need to look at the data. Top-level numbers don't mean what you think: there may be broken tasks, unexpected behaviors, or near-misses. We're introducing Docent to accelerate analysis of AI agent transcripts. It can spot surprises in seconds. 🧵👇
155K
Kevin Meng
@mengk20
Apr 30, 2023
We find that *interpreting* the mechanisms inside GPT can lead to practical methods for *controlling* its behavior! MEMIT is an algorithm that can write 10,000 new facts into GPT at once. If you're in Kigali for ICLR, swing by our oral/poster tomorrow! + DM us to hang out :)
00:00
74K
Kevin Meng
@mengk20
Oct 23, 2024
Replying to @mengk20
so why don't we try direct neuron interventions? we can directly stop the model from interpreting these numbers as dates or biblical verses, by setting those neuron activations to 0. and that works! zeroing out september 11th attack neurons also works.
00:00
21K
Kevin Meng
@mengk20
Apr 21, 2023
At @ember_ml, we're rethinking search. Primitive search is about finding what you already knew to look for. But the *real* magic is in searching & synthesizing insight across millions of records. Finding new, unexpected things. Here's why vector DBs don't cut it. 🧵👇
22K
Kevin Meng
@mengk20
May 1, 2023
Replying to @mayfer
haha this is very clever, well done :) the issue here is actually that GPT seems to store facts unidirectionally - "paris contains the eiffel tower" doesn't seem affected by "eiffel tower located in paris"
60K
Kevin Meng
@mengk20
Oct 23, 2024
Replying to @mengk20
i typed the query in, noticed the incorrect "bigger" token, and ran attribution to find neurons influencing that mistake. we found concepts related to: - the september 11th attacks - biblical verses - dates and months okay, dates, sure, but bible verses? 9/11?
00:00
9.2K
Kevin Meng
@mengk20
Mar 25, 2025
Replying to @mengk20
upon closer inspection, it looks like the model - knew what the image was supposed to be (a sequence of numbers) - generated numbers it knew would decode to some answer - decoded those numbers - submitted the result - the answer was correct 🧐
00:00
11K
Kevin Meng
@mengk20
May 1, 2023
Replying to @mengk20 and @mayfer
it seems to us that GPT stores a bunch of facts about the eiffel tower and, separately, a bunch of facts about paris. it would be great if these two were somehow connected, but it looks like this isn’t the case! pretty interesting for future work
3.5K
Kevin Meng
@mengk20
Oct 23, 2024
Replying to @mengk20
i was curious about the biblical verses, so i clicked on that cluster and looked at where those neurons fire highly. turns out they really like verse numbers. notice the numbering system - Matthew 8:5 comes before 8:13. i'd have never guessed this!
00:00
6.1K
Kevin Meng
@mengk20
Mar 25, 2025
Replying to @mengk20
to be clear, "Claude memorized the solution" doesn't mean "Claude can't do the task." it *does* mean we're not thinking about model capabilities in the right way. an undergraduate would never act like Claude did
3.9K
Kevin Meng
@mengk20
Oct 23, 2024
Replying to @mengk20
we also ran some simple evals: llama-3.1 8b instruct is only 54% accurate on our test set; about random guessing. just by steering out bible verse neurons, we can get up to 76%. no further training, and no prompting. just getting *rid* of things!
5K
Kevin Meng
@mengk20
Dec 7, 2024
our elicitation agents @TransluceAI have been coming up with weird-looking prompts to circumvent refusal. but why do they look like that? what's up with the "LowerCase" stuff? misspellings and Chinese chars? 350? come to our NeurIPS social next wk to investigate with me!
Transluce
@TransluceAI
Nov 27, 2024
Transluce will be at #NeurIPS2024! Who’s coming to lunch on Thursday to meet the team and learn about open problems we're working on? Space is limited, RSVP soon. partiful.com/e/BJELvUqIA0dD…
8.4K