AI Alignment - Medium

My views on “doom”

Paul Christiano — Thu, 27 Apr 2023 17:38:16 GMT

I’m often asked: “what’s the probability of a really bad outcome from AI?”

There are many different versions of that question with different answers. In this post I’ll try to answer a bunch of versions of this question all in one place.

Two distinctions

Two distinctions often lead to confusion about what I believe:

One distinction is between dying (“extinction risk”) and having a bad future (“existential risk”). I think there’s a good chance of bad futures without extinction, e.g. that AI systems take over but don’t kill everyone.
An important subcategory of “bad future” is “AI takeover:” an outcome where the world is governed by AI systems, and we weren’t able to build AI systems who share our values or care a lot about helping us. This need not result in humans dying, and it may not even be an objectively terrible future. But it does mean that humanity gave up control over its destiny, and I think in expectation it’s pretty bad.
A second distinction is between dying now and dying later. I think that there’s a good chance that we don’t die from AI, but that AI and other technologies greatly accelerate the rate of change in the world and so something else kills us shortly later. I wouldn’t call this “from AI” but I do think it happens soon in calendar time and I’m not sure the distinction is comforting to most people.

Other caveats

I’ll give my beliefs in terms of probabilities, but these really are just best guesses — the point of numbers is to quantify and communicate what I believe, not to claim I have some kind of calibrated model that spits out these numbers.

Only one of these guesses is even really related to my day job (the 15% probability that AI systems built by humans will take over). For the other questions I’m just a person who’s thought about it a bit in passing. I wouldn’t recommend deferring to the 15%, but definitely wouldn’t recommend deferring to anything else.

A final source of confusion is that I give different numbers on different days. Sometimes that’s because I’ve considered new evidence, but normally it’s just because these numbers are just an imprecise quantification of my belief that changes from day to day. One day I might say 50%, the next I might say 66%, the next I might say 33%.

I’m giving percentages but you should treat these numbers as having 0.5 significant figures.

My best guesses

Probability of an AI takeover: 22%

Probability that humans build AI systems that take over: 15%
(Including anything that happens before human cognitive labor is basically obsolete.)
Probability that the AI we build doesn’t take over, but that it builds even smarter AI and there is a takeover some day further down the line: 7%

Probability that most humans die within 10 years of building powerful AI (powerful enough to make human labor obsolete): 20%

Probability that most humans die because of an AI takeover: 11%
Probability that most humans die for non-takeover reasons (e.g. more destructive war or terrorism) either as a direct consequence of building AI or during a period of rapid change shortly thereafter: 9%

Probability that humanity has somehow irreversibly messed up our future within 10 years of building powerful AI: 46%

Probability of AI takeover: 22% (see above)
Additional extinction probability: 9% (see above)
Probability of messing it up in some other way during a period of accelerated technological change (e.g. driving ourselves crazy, creating a permanent dystopia, making unwise commitments…): 15%

My views on “doom” was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.

Can we efficiently distinguish different mechanisms?

Paul Christiano — Tue, 27 Dec 2022 00:11:31 GMT

(This post is an elaboration on “tractability of discrimination” as introduced in section III of Can we efficiently explain model behaviors? For an overview of the general plan this fits into, see Mechanistic anomaly detection and Finding gliders in the game of life.)

Background

We’d like to build AI systems that take complex actions to protect humans and maximize option value. Powerful predictive models may play an important role in such AI, either as part of a model-based planning algorithm or as part of a learned policy. Either way, this introduces the risk that we will select actions predicted to lead to good-looking outcomes without being able to distinguish predicted outcomes that are actually good from predicted outcomes where our measurements were corrupted. If tampering with sensors is easier than actually achieving our goals then we may inadvertently build very powerful systems taking creative actions to corrupt our measurements. If we iteratively improve and harden our measurements, this can lead to AI systems that work well for a long time before abruptly and catastrophically disempowering humanity.

I consider this one of the conceptually cleanest alignment problems, and I expect similar dynamics to play a role in realistic alignment failures even if those failures aren’t this simple. ARC’s current work is focused on decisive solutions to this problem, though it looks like the same approach may also apply directly to identifying treacherous turns more generally.

Are distinct mechanisms enough?

ARC has been looking for training strategies that avoid this problem by leveraging only the fact that sensor tampering is “weird,” i.e. conceptually distinct from the normal mechanism giving rise to predictions of good-looking outcomes on the training distribution.

More specifically, at training time our model predicts coherent sensor readings because it predicts that sensors reflect coherent structure in the world. But if someone tampers with sensors to show a convincing fiction, then the predicted observations are coherent because the fiction was designed to look coherent. This suggests that different mechanisms are responsible for (actions that lead to good-looking outcomes for the normal reasons) and (actions that lead to good-looking outcomes via sensor tampering). If we are able to detect that difference by looking at the internal behavior of a predictor, then we may be able to use that to avoid sensor tampering.

It’s unclear if “distinct mechanisms” is a strong enough assumption to avoid sensor tampering. We hope that it is, and so we are trying to define formally what we mean by “distinct mechanisms” and show that it is possible to distinguish different mechanisms and that sensor tampering is always a distinct mechanism.

If that fails, we will need to solve sensor tampering by identify additional structure in the problem, beyond the fact that it involves distinct mechanisms.

Roadmap

In this post I want to explore this situation in a bit more detail. In particular, I will:

Describe what it might look like to have a pair of qualitatively distinct mechanisms that are intractable to distinguish.
Discuss the plausibility of that situation and some reasons to think it’s possible in theory.
Emphasize how problematic that situation would be for many existing approaches to alignment.
Discuss four candidates for ways to solve the sensor tampering problem even if we can’t distinguish different mechanisms in general.

Note that the existence of a pathological example of distinct-but–indistinguishable mechanisms may not be interesting to anyone other than theorists. And even for the theorists, it would still leave open many important questions of measuring and characterizing possible failures, designing algorithms that degrade gracefully even if they sometimes fail, and so on. But this is particularly important to ARC because our research is looking for worst-case solutions, and even exotic counterexamples are extremely valuable for that search.

1. What might indistinguishable mechanisms look like?

Probabilistic primality tests

The best example I currently have of a “hard case” for distinguishing mechanisms comes from probabilistic primality tests. In this section I’ll explore that example to help build intuition for what it would look like to be unable to recognize sensor tampering.

The Fermat primality test is designed to recognize whether an integer n is prime. It works as follows:

Pick a random integer a < n.
Compute a^n mod n. This can be done in time polylog(n) via iterated squaring.
Output “pass” if a^n = a (mod n). A prime number always passes.

In almost all cases where this test passes, n is prime. And you can eliminate most false positives by just trying a second random value of a. But there are a few cases (“Carmichael numbers”) for which this test passes for most (and in fact all) values of a.

Primes and Carmichael numbers both pass the Fermat test. This turns out to be equivalent to saying that “For all primes p dividing n, (p-1) divides (n-1).” For primes this happens because n is a prime and so there is only one prime divisor p and p-1 = n-1. For Carmichael numbers it instead happens because (p-1) and (n-1) are both highly divisible and a bunch of favorable coincidences occur. We can think of this as building a test that’s supposed to detect factors of n, and then there happens to be a ton of cancellation so that we don’t see anything.

Suppose that we observed the Fermat testing passing in a bunch of random cases and formed the abstraction of “prime” to explain that phenomenon. Then we try to use AI to find a large prime that satisfies some additional property P. If we do this by using the Fermat test as our definition of primality, we may end up getting a Carmichael number, if the property P is rare amongst primes but common amongst composites.

In reality there are probabilistic primality tests that don’t have false positives. But we can imagine a world where gradient descent is able to find the Fermat primality test without us having any algorithm that can learn one of those stronger tests. We’ll revisit the plausibility of this situation in section 2.

The analogy

An analogous situation could make it hard to detect sensor tampering:

We have sensors that are often predicted to show smiling human faces (= the Fermat test often passes for random numbers n)
We introduce the abstraction of “happy human” to explain those correlated observations (= we introduce the abstraction of ”n is prime” to explain the Fermat test passing)
The model also predicts smiling human faces if an action would tamper with sensors (= the Fermat test also passes if n is a Carmichael number)
We use AI to select for tractable actions that lead to predictions of happy human faces and end up with sensor tampering (= we select for Fermat test passing with an additional constraint P and end up with Carmichael numbers)

This is a problem if the best way to act competently is to train a sophisticated predictor and then use it for planning, and if all sophisticated predictors have this unfortunate mechanism-indistinguishability property. The problem can materialize either when we build a model-based algorithm that plans against such a predictor; or when SGD learns a neural network internally which internally selects actions based on a similar predictor.

It’s worth emphasizing that this feels like an extremely weird situation. In some sense the predictor “doesn’t know” whether there are really smiling humans or whether the sensors are tampered with, and couldn’t answer that question honestly even if it wanted to. You could imagine calling this “not an alignment failure” (or at least not an ELK failure) depending on how you define blurry words like “knowledge” and “intent.” But at the end of the day we would still end up with a competent AI which produced creative and competent actions that disempower humanity, and I think we should try to avoid this failure whatever you call it.

If we had a clear definition of “knowledge” then I could imagine drawing a bright line and saying this isn’t an alignment failure, in the same way I would if e.g. my AI started a nuclear war in the same way that a similarly-competent human would. But I think you should be very hesitant to do that while “knowledge” is such a vague concept, because I think it facilitates sweeping all failures into the gaps that we don’t understand. Instead our priority should be to understand and ideally rule out this strange potential situation where the AI creatively and competently kills us all without even knowing that it does so.

2. Are distinct mechanisms efficiently distinguishable?

ARC’s current “plan A” is that any pair of sufficiently distinct mechanisms can be efficiently distinguished. This would be really convenient for a bunch of reasons and it looks plausible, so we’ll keep investigating it until we feel we’ve ruled it out.

In particular, right now I don’t have even a single example of a function f such that (i) there are two clearly distinct mechanisms that can lead to f(x) = 1, (ii) there is no known efficient discriminator for distinguishing those mechanisms. I would really love to have such examples.

That said, there are still two big reasons that I’m skeptical about the conjecture that distinct mechanisms are always distinguishable: (a) it’s a really striking claim for which a failure to find counterexamples isn’t very strong evidence, and (b) in known examples like primality testing it still seems easy to imagine the situation where we can find the mechanism but not the discriminator, i.e. we haven’t yet found an automated way to learn a discriminator.

Overall if I had to guess I’d say maybe a 20% chance that there is a formal version of “all distinct mechanisms are distinguishable” which is true and sufficient to rule out sensor tampering. This is still high enough that it’s a significant priority for me until ruled out.

A. This is a striking claim and judging counterexamples is hard

Any universally-quantified statement about circuits is pretty striking — it would have implications for number theory, dynamical systems, neural nets, etc. It’s also pretty different from anything I’ve seen before. So the odds are against it.

One piece of evidence in favor is that it’s at least plausible: it’s kind of weird for a circuit to have a hidden latent structure that can have an effect on its behavior without being detectable.

Unfortunately there are plenty of examples of interesting mathematical circuits (e.g. primality tests) that reveal the presence of some latent structure (e.g. a factorization) without making it explicit. Another example I find interesting is a determinant calculation revealing the presence of a matching without making that matching explicit. These examples undermine the intuition that latent structure can’t have an effect on model behavior while remaining fully implicit.

That said, I don’t know of examples where the latent structure isn’t distinguishable. Probabilistic primality testing comes closest, but there are in fact good primality tests. So this gives us a second piece of evidence for the conjecture.

Unfortunately, the strength of this evidence is limited not only by the general difficulty of finding counterexamples but also by the difficulty of saying what we mean by “distinct mechanisms.” If we could really precisely state a theorem then I think we’d have a better chance of finding an example if one exists, but as it stands it’s hard for anyone to engage with this question without spending a lot of time thinking about a bunch of vague philosophy (and even then we are at risk of gerrymandering categories to avoid engaging with an example).

B. Automatically finding a good probabilistic primality test seems hard

The Fermat test can pass either from primes or Carmichael numbers. It turns out there are other tests that can distinguish those cases, but it’s easy to imagine learning the Fermat test without being able to find any of those other superior tests.

To illustrate, let’s consider two examples of better tests:

Rabin-Miller: If a^(n-1) = 1 (mod n), we can also check a^(n-1)/2. This must be a square root of 1, and if n is prime it will be either +1 or -1. If we get +1, then we can keep dividing by 2, considering a^(n-1)/4 and so on. If n is composite then 1 has a lot of square roots other than +1 and -1, and it’s easy to prove that with reasonably high probability one of them will appear in this process.
Randomized AKS: If n is prime and X is an indeterminate, then (a+X)^n = (a^n + X^n) = (a + X^n) mod n. This condition is hard to evaluate, but if we arbitrarily define X^r = 1 for some small number r then we can compute (a + X)^n mod n by iterated squaring in time O(r log n). If n is composite, it turns out there is a high probability that (a+X)^n != (a + X^n) mod n.

While these tests are structurally similar to the Fermat test, there’s no obvious way to automatically transform the Fermat test into either of these stronger formats. And so if we had learned the Fermat test, it’s not clear we’d have any way to find one of the stronger tests without learning them. Moreover, while these tests are somewhat simple, they are more complex than the Fermat test, and so this learning process might be much harder than the problem of learning the Fermat test itself.

To avoid this problem we’d like to specify a way to represent these stronger tests that uses fewer (additional) parameters than the Fermat test itself. As a stylized example, you could imagine specifying a stronger test by pointing to particular parts of the Fermat test and saying “randomize these parts.” As long as the number of parameters describing “which parts to randomize” was smaller than the number of parameters in the Fermat test itself, we’d be in business.

While there’s no obvious representation like that, it seems plausible to me that we could find, particularly for the AKS test. This would be very suggestive that we could find a stronger test as quickly as we can find the Fermat test, but wouldn’t be totally decisive since e.g. gradient descent could have an easier time with one learning problem than the other (and that will depend on more details).

Of course it’s particularly plausible that you could have the Fermat test but no robust test because that was humanity’s situation for many years! An automated strategy for extending the Fermat test to distinguish primes from Carmichael numbers would appear to teach us something fundamental about primality testing, that wasn’t known to humanity until the 70s, without making reference to any facts about numbers. So that seems like a tall order.

I’m focused on the primality testing example in large part because it’s the case where the situation seems most hopeless. If we could find one in this case I would become meaningfully more optimistic about the overall conjecture.

3. Being unable to distinguish mechanisms is bad news

If it’s impossible to distinguish distinct mechanisms, then that rules out approaches to ELK based on mechanistic anomaly detection alone. But the situation seems much worse than that, and in fact it seems like any approach to alignment is going to need to exploit some additional property of sensor tampering to rule it out.

In particular:

The kinds of approaches discussed in Eliciting latent knowledge are complete non-starters. All those approaches try to define a loss function so that the strategy “answer questions honestly” gets a low loss. But if you can’t learn to recognize sensor tampering then it doesn’t matter how low a loss you’d get by answering questions honestly, gradient descent simply can’t learn how to do it. Analogously, if there’s no simple and efficient primality test, then it doesn’t matter whether you have a loss function which would incentivize primality testing, you’re not going to be able to do it.
Avoiding sensor tampering by interpretability runs into exactly the same problem. If there’s no efficient algorithm for recognizing sensor tampering, then no matter in what sense you “understand” what the model is doing, you still can’t tell whether it thinks an action is good because of sensor tampering or because of doing the task as intended. Interpretability helps us deal with the case where we have no loss function to incentivize the discriminator we need, but it won’t help if there simply doesn’t exist any efficient discriminator.
Approaches like debate or amplification are unable to help if the best way to plan is to use a learned model for which discrimination is impossible. When we try to apply these methods we will use something like imitative generalization, performing a search over augmented-human-legible hypotheses. But in fact the best hypotheses look like opaque predictors, and so we are back to trying and failing to solve ELK for those hypotheses. This is discussed a bit more here in the ELK doc and was the motivation for focusing on ELK. On the bright side, when these approaches fail it may simply lead to uncompetitive systems rather than leading to sensor tampering, even in the worst case. But on the downside, sufficiently large competitiveness problems may be catastrophic without coordination.
Recursive reward modeling can solve this problem (where debate can’t) only if a human and AI team can harden sensors against manipulation, including in the case where the “attacker” can use sophisticated predictors for which discrimination is hard but “defenders” can’t. I think this is very plausible, but probably does require substantive additional empirical assumptions; we’ll discuss the approach of “just make sensor tampering impossible” in section 4. Of course this is particularly plausible if we can apply AI systems in ways that don’t give them rich opportunities for sensor tampering, and in general the efficacy of approaches like RRM or debate depends a lot on the political and social environment in which AI systems are deployed.
Some work on AI alignment focuses on understanding the inductive biases of SGD, how training shapes values, on the role of “habits” and non-maximizing behavior, on analogies with humans, and so on. But if planning against a sophisticated predictive model is an effective strategy for decision-making then ultimately every one of these approaches must confront the same issue: either by building systems that explicitly use planning against a learned model, or by training systems that need to use this strategy internally. Empirical work mostly seems helpful for grappling with indistinguishable mechanisms insofar as empirical work can help us notice this problem in advance if it’s a real problem.

Overall my sense is that if distinct mechanisms can be indistinguishable, then any indefinitely scalable solution to alignment will need to use additional properties of sensor tampering or assumptions about the structure of optimal decision-making. That said, even in relatively pessimistic worlds there may be a long period of time (potentially extending to fairly superhuman systems) for which these issues are surmountable.

4. Approaches to sensor tampering assuming indistinguishable mechanisms

In this section we’ll discuss three possible approaches to sensor tampering:

Harden sensors such that tampering with them is harder than achieving the intended task.
Harden sensors enough that you can’t tamper with them without “trying,” and then argue that if you are “trying” to tamper with sensors then you can discriminate.
Try to characterize some other sense in which models “don’t know what’s going on” when mechanisms are indistinguishable. Design algorithms that work as long as the AI “knows what’s going on” and then validate that assumption.

Our main hope is to find a way to efficiently distinguish distinct mechanisms. But it still seems worth fleshing out these alternatives.

A. Harden sensors so that tampering is harder than the intended task

We could design and deploy a lot of redundant sensors, and do science and red-teaming to understand the possible principles by which those sensors can be compromised. In the modern world it’s kind of insane to imagine a human rebellion that not only succeeded but left no trace of itself.

Some dynamics make this seem more plausible, at least as a way to avoid the particular obstruction in this post (and therefore open up a much broader universe of “outs”):

We can use AI systems to help us build sensors and understand their potential failures. There are potential difficulties in training such systems, but there are a lot of promising options that haven’t been very thoroughly explored and ultimately the obstruction in this post doesn’t speak directly to the difficulty of that problem. What matters becomes not the gap between humans and AI systems, but between “AI systems using opaque predictors for which discrimination is impossible” and “all the other AI.”
To address the difficulty in this post we don’t need to build a sensor that reliably tells us if a robot rebellion is occurring, we just need to be able to arbitrate a debate about whether a particular set of recorded actions and observations is potentially the result of sensor tampering, and we can use a very conservative standard. That means we get to comb over observations for evidence of anything avoidably funny.
Many plans for tampering with sensors are only scary because they involve creating external agents and delegating the sensor-tampering task to them. But we can closely supervise actions and look for evidence of this kind of delegation. We can also ask friendly systems to perform similar delegation to defend sensors (again modulo complicated questions about how to train such systems).

On the other hand, there are two big reasons we’re scared of this kind of approach:

It seems highly sensitive to how your predictive models generalize. If a “sensor reading” is simply whatever ultimately ends up in a dataset in a datacenter somewhere, then any successful robot rebellion would constitute impossible-to-detect sensor tampering since it could ultimately erase any evidence recorded by sensors. In practice we expect even deeper forms of underdetermination than “what is the definition of a sensor?” and so we would strongly prefer to have approaches (like mechanistic anomaly detection) that don’t require making strong structural assumptions of this type.
It is really scary to imagine a world where in a strong sense all the most capable AI systems “want” to tamper with sensors and are prevented only because the balance of physical power favors the defender. Such a world feels highly unstable and vulnerable to mistaken assumptions or external shocks. This vague unease could be defused by thinking very carefully through the actual risks, but it’s not a direction that is intuitively appealing to us.

B. Detect sensor tampering that requires “trying”

So far we’ve talked about training a sophisticated predictive model and then imagined an exhaustive or random planner who searches against it.

While it seems hard to make sensors so robust that a powerful AI cannot tamper with them, it seems much easier to make sensors so robust that you can’t tamper with them by randomly generating actions.

In that case, even if we have a sophisticated predictor which doesn’t distinguish sensor tampering from the intended behavior, we don’t have trouble unless there is also an algorithm which proposes actions that have an elevated risk of leading to sensor tampering.

Of course any policy which proposes coherent actions would tend to elevate the probability of successful sensor tampering alongside the probability of successfully doing the task. But we may hope to harden the sensor far enough that this kind of generic prior over competent actions still won’t lead to sensor tampering unless it is specifically driving at that goal.

So this suggests a two step plan:

Harden sensors enough that in some sense you can’t “accidentally” tamper with them.
Argue that the proposer can distinguish actions that are “deliberately” tampering from actions that are trying to do the task as intended, even if the predictor cannot.

Both of these steps seem challenging. Conceptually the second step seems like the main blocker — in cases where a predictor can predict success without distinguishing mechanisms, it also seems possible for a proposer to propose successful plans without distinguishing mechanisms. But it’s not at all a trivial link, and e.g. the primality testing example can’t be easily adapted into this format suggesting that it might be a fundamentally different step.

One reason you might think that this approach is doomed is that there is always a simple action that tampers with the sensors, namely: build an external agent trying to tamper with the sensors. This gives an absolute lower bound on the probability of accidentally tampering with the sensors no matter how much they are hardened.

But if the external agent itself must do something to deliberately tamper, then that still means that there exists an efficient procedure for detecting the sensor tampering and leaves us with a possible out — it seems plausible that this approach can be extended to work for external agents if and only if it works in cases with no external agents. It generally seems hard to write off possible approaches based on counterexamples that involve creating external agents, since those counterexamples can often be handled by separate tactics (we do ultimately need to confront such counterexamples after handling the “base case,” but it seems like a mistake to write off approaches too early based on this difficulty).

C. Assume that your AI “knows what’s going on”

It would be very strange if (human faces) → (sensors) was analogous to (primality) → (Fermat test), because it suggests that our AI’s grip on the concept of “human faces” is extremely tenuous. Such a model could be reasonably said to “not know” whether there are really humans in the room or if sensors have been tampered with.

In section B we argued that maybe this meant the AI didn’t pose a threat, because tampering with sensors is only possible if you actually understand what you are doing.

But even if that hope doesn’t work out, we could still try to:

Formalize “knowledge” such that the AI systems we are worried about “know” that they are disempowering humanity.
Design algorithms that detect sensor tampering that our AI “knows” about.
Verify that our AI understands the situation well enough to “know” if it is tampering with sensors, either empirically or by making some additional argument about sensor tampering.

As a vague illustration, we could try to argue that the AI systems we are worried about have a certain kind of mechanical understanding of the systems they are predicting. We could show that this kind of mechanical understanding is inconsistent with primality-test-like cases (which are possibly only because “is prime” lacks interesting internal structure and is more like a raw fact about the world), and then we could empirically monitor whether our AI systems actually have this kind of rich understanding.

As another illustration, we could try to relate “knowledge” directly to the ability to answer certain kinds of questions, including honestly answering questions about what’s happening in the world, and then carefully monitor how hard it is for our AI to answer those questions so that we’d notice if it became hard.

This is more of a vague catch-all than either approach A or B, but I think it deserves attention because the situation of an AI tampering without knowing it feels extremely weird. So there may be room to make sense of that situation and show how it can’t happen.

Conclusion

It may be possible for a behavior to arise via two very different mechanisms without there being any efficient way to distinguish those mechanisms. That would be a fatal problem for mechanistic anomaly detection, but it would also probably mean that any approach to AI alignment will need to exploit some additional structure about sensor tampering in order to avoid it. This may take the form of clarifying the intuition that problematic AI systems surely “know” they are tampering with sensors; it may also be a serious obstacle to solving alignment under a clearly-articulated set of assumptions rather than relying on messy empirical contingencies.

Can we efficiently distinguish different mechanisms? was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.

Can we efficiently explain model behaviors?

Paul Christiano — Fri, 16 Dec 2022 19:22:45 GMT

ARC’s current plan for solving ELK (and maybe also deceptive alignment) involves three major challenges:

Formalizing probabilistic heuristic argument as an operationalization of “explanation”
Finding sufficiently specific explanations for important model behaviors
Checking whether particular instances of a behavior are “because of” a particular explanation

All three of these steps are very difficult, but I have some intuition about why steps #1 and #3 should be possible and I expect we’ll see significant progress over the next six months. Unfortunately, there’s no simple intuitive story for why step #2 should be tractable, so it’s a natural candidate for the main technical risk.

In this post I’ll try to explain why I’m excited about this plan, and why I think that solving steps #1 and #3 would be a big deal, even if step #2 turns out to be extremely challenging.

I’ll argue:

Finding explanations is a relatively unambitious interpretability goal. If it is intractable then that’s an important obstacle to interpretability in general.
If we formally define “explanations,” then finding them is a well-posed search problem and there is a plausible argument for tractability.
If that tractability argument fails then it may indicate a deeper problem for alignment.
This plan can still add significant value even if we aren’t able to solve step #2 for arbitrary models.

I. Finding explanations is closely related to interpretability

Our approach requires finding explanations for key model behaviors like “the model often predicts that a smiling human face will appear on camera.” These explanations need to be sufficiently specific that they distinguish (the model actually thinks that a human face is in front of the camera and is predicting how light reflects off of it) from (the model thinks that someone will tamper with the camera so that it shows a picture of a human face).

Our notion of “explanation” is informal, but I expect that most possible approaches to interpretability would yield the kind of explanation we want (if they succeeded at all). As a result, understanding when finding explanations is intractable may also help us understand when interpretability is intractable.

As a simple caricature, suppose that we identify a neuron representing the model’s beliefs about whether there is a person in front of the camera. We then verify experimentally that (i) when this neuron is on it leads to human faces appearing on camera, (ii) this neuron tends to fire under the conditions where we’d expect a human to be in front of the camera.

I think that finding this neuron is the hard part of explaining the face-generating-behavior. And if this neuron actually captures the model’s beliefs about humans, then it will distinguish (human in front of camera) from (sensors tampered with). So if we can find this neuron, then I think we can find a sufficiently specific explanation of the face-generating-behavior.

In reality I don’t expect there to be a “human neuron” that leads to such a simple explanation, but I think the story is the same no matter how complex the representation is. If beliefs about humans are encoded in a direction then both tasks require finding the direction; if they are a nonlinear function of activations then both tasks require understanding that nonlinearity; and so on..

The flipside of the same claim is that ARC’s plan effectively requires interpretability progress. From that perspective, the main way ARC’s research can help is by identify a possible goal for interpretability. By making a goal precise we may have a better chance of automating it (by applying gradient descent and search, as discussed in section III), and even if we can’t automate it then a clearer sense of the goal could guide experimental or theoretical work on interpretability. But it doesn’t obviate the need for solving some of the same core problems people are working on in mechanistic interpretability.

I say that this is a relatively unambitious goal for interpretability because I think interpretability researchers are often trying to accomplish many other goals. For example, they are often looking for explanations that are small or human-comprehensible. I think “find a human-comprehensible explanation” is likely to be a significantly higher bar than “find any explanation at all.” As an even more extreme example, I think you would have to solve interpretability in a qualitatively different sense in order to “just retarget the search.”

Of course our goal could also end up being more ambitious than traditional goals in interpretability. In particular, it’s not clear that an intuitively valid “explanation” will actually be a formally valid heuristic argument in the sense required by our approach. It seems tough to evaluate that claim precisely without having a better formalization of heuristic argument. But the basic intuition about computational difficulty, as well as the kinds of counterexamples and obstructions I’m thinking about, seem to apply similarly to both kinds of explanation.

Overall, I’m currently tentatively optimistic that (i) likely forms of mechanistic interpretability would suffice for ARC’s plans, (ii) obstructions to ARC’s plans are likely to translate to analogous obstructions for mechanistic interpretability.

II. Searching for explanations is a well-posed and plausibly tractable search problem

If we have a formal definition of explanation and verifier for explanations, then actually finding explanations is a search problem with an easy-to-compute objective. That doesn’t mean the problem is easy, but it does open up many new angles of attack.

A very simple hope might be that explanations are smaller (i.e. involve fewer parameters) than the model they are trying to explain. For example, given a model like GPT-3 with 175B parameters, and a simple definition of a behavior like “The words ‘happy’ and ‘smile’ are correlated,” we might hope that we can specify a probabilistic heuristic argument for the behavior using at most 175B parameters.

This is much too large for a human to understand, but it’s small enough that we could imagine searching for the explanation in parallel with searching for the model:

If we were using a random or exhaustive search, then this implies that finding the explanation would take no longer than finding the model.
If we were using a local search, where each iteration involves randomly searching for perturbations to a model that improve the loss, then we would need to make the same argument stepwise — that if you have a good enough argument at step N, and want to find an argument at step N+1, the size of the argument perturbation is no larger than the size of the model perturbation.
It is more complicated to analyze something like gradient descent, but if we can handle local search then I think it’s very plausible we can handle gradient descent.

Unfortunately, the claim that “explanations are smaller than models” isn’t quite plausible. For example, consider the simple game of life case. Although the game of life is described by very simple rules, explanations for regularities can involve calculating the properties of complicated sets of cells. The complexity of explanations can be unboundedly larger than the complexity of the underlying physics — the game of life can be expressed in perhaps 200 bits, while a certain correlation might only be explained in terms of the behavior of a particular pattern of 250 cells.

However, in this case there’s a different way that we can find an explanation. Consider the case of gliders as an explanation for A-B patterns. Gliders can only create a large correlation because the model is big enough that gliders often emerge at random. So if you spend the same amount of compute searching for explanations as the model spends simulating random cells, then you can find gliders-as-explanation just as quickly as gliders emerge from the random soup. So although the description complexity of gliders is larger than the description complexity of the game of life itself, such that we can’t hope to find gliders by gradient descent in parallel with learning the model, we can still hope to find them by doing a search which is computationally cheaper than a forward pass of the model.

This discussion elides many complexities, but at a high level I consider the following plausible:

If we succeed at formalizing what we mean by explanation, then finding explanations for model behaviors becomes a well-posed search problem.
The complexity of finding explanations is bounded by the complexity of finding and running the model itself.
So we can efficiently learn explanations in parallel with learning the model-to-be-explained.

Obviously the key conjecture here is the bound on the complexity of finding explanations, and all I’ve really said is that the conjecture looks plausible to me so far — we haven’t yet found clear counterexamples.

III. If this search problem is intractable it may be a much deeper problem for alignment

The feasibility of searching for explanations is closely related to an even more fundamental requirement for alignment.

Consider the distinction between good actions, which the model predicts will keep humans safe, and bad actions, which the model predicts would tamper with sensors in order to make humans appear safe. If keeping humans safe continues to get harder (e.g. as adversarial AI systems become increasingly sophisticated) then we eventually expect bad actions to be more common than good actions. Thus any attempt to select good actions based on a powerful search against predicted consequences needs to be able to distinguish good and bad actions.

Our hope is that if we have an AI which is able to make detailed predictions about the consequences of good and bad actions (including e.g. the dynamics of sensor tampering), then it can also tell the difference between them. In past work I’ve mostly glossed over this assumption because it seems so uncontroversial.

But ultimately this conjecture is very similar to the conjecture from the last section:

[Tractability of explanation] We can efficiently find explanations that are specific enough to distinguish good and bad actions
[Tractability of discrimination] We can efficiently find a discriminator between good and bad actions.

If tractability of discrimination fails, then we have an even deeper problem than ELK: even if we had perfect labels for arbitrarily complex situations, then we still couldn’t learn a reporter that tells you whether the humans are actually safe! It would no longer be correct to describe the problem as “eliciting” the knowledge, the problem is that there is a deep sense in which the model doesn’t even “know” that it’s tampering with the sensors.

(Note: given our approach based on anomaly detection, I’m inclined to generalize both of these conjectures to the case of arbitrary distinctions between “clearly different” mechanisms for a behavior, rather than considering any special features of the particular distinction between good and bad actions. Though if it turns out that these conjectures are false, then we will start looking for additional structure in the good vs bad distinction rather than trying to solve mechanistic anomaly detection in full generality.)

Right now I feel like we have no strong argument that either of these conjectures holds in the worst case, nor do we have compelling counterexamples to either.

So my current focus is on deeply understanding and arguing for tractability of discrimination. If this conjecture is false we have bigger problems, and if we understand why it is true then my intuition is that a very similar argument will more likely than not suggest that explanation is also tractable. See this comment for some discussion of cases where it wasn’t a priori obvious that either explanation or discrimination would be easy (although in each case I ultimately believe it is).

IV. I’m excited about ARC’s plan even if we can’t solve every step for arbitrary models

I’m interested in searching for decisive solutions to alignment, by which I roughly mean: articulating a set of robust assumptions about ML, proving that under those assumptions our solution will have some desirable properties, and convincingly arguing that these desirable properties completely defuse existing reasons to be concerned that AI may deliberately disempower humanity.

I think decisive solutions are plausibly possible and have a big expected impact; I also think that focusing on them is a healthy research approach that will help us iterate more efficiently and do better work. If a decisive solution was clearly impossible then I think ARC should change how it does research and thinks about research and it would be a major push to pivot to more empirical work.

But despite that, I think that decisive alignment solutions still represent a minority of ARC’s total expected impact, and so it’s worthwhile to talk about about how ARC’s plan can help even if we don’t get to that ultimate goal.

If step #2 fails but the rest of our plan works (a big if!), then we could still get a bunch of nice consolation prizes:

Algorithms for finding explanations in practice (even if they don’t work in the worst case) and insight that can help guide interpretability research (even if interpretability is impossible in the worst case).
Solutions to a whole bunch of other problems in AI safety. Mechanistic interpretability is a big problem, and explaining behavior is a big subset of mechanistic interpretability, but solving “the rest” of the problem still seems like a big deal.
A precise goal for interpretability research and a way to measure whether interpretability is succeeding at that goal. This could let us figure out whether we are solving interpretability well enough to be OK in practice, which is useful even if we know there are possible situations where we wouldn’t be OK.
Concrete cases in which finding explanations appears to be intractable, or clearer arguments for why finding explanations should be hard. These can help point to the hard core for interpretability research.

One reason you could be skeptical about any of these advantages is if ARC’s research is just hiding the whole hard part of alignment inside the subproblem of “finding explanations.” If we’re just playing a shell game to move around the main difficulty, then we shouldn’t expect anything good to happen from solving the other “problems.”

I think the most robust counterargument (but far from the only counterargument) is that if we succeed at formalizing and using explanations then finding explanations becomes a well-posed search problem. The traditional conception of AI alignment focuses on serious philosophical and conceptual difficulties that get in the way of us even defining what we want. So a reduction to a well-posed problem seems like it addresses some part of the fundamental difficulty, even if it turns out that the search problem is intractable.

Conclusion

ARC plans to spend a significant fraction of our effort looking for algorithms that can automatically explain model behavior (or looking for arguments that it is impossible in general). That activity is likely to be more like 30% of our research than 70% of our research, despite the elevated technical risk.

A major motivation is that it’s way easier to talk about “can you find explanations?” with a better definition of what you mean by “explanation.” Hopefully this post helps explains the remainder of the motivation, and why we think it’s not a disaster that we are spending a lot of time working on steps #1 and #3 without knowing whether step #2 will work.

The main path forward I see on tractability of explanation is to find an argument or counterexample for tractability of discrimination. After that I expect we’ll be in a much better position to assess tractability of explanation.

My current approach to tractability of discrimination is to both (i) search for potential cases where discrimination is hard, (ii) try to figure out whether we can automatically do discrimination in existing examples, e.g. whether we can mechanically turn a test for probable primes into a probabilistic test for primes.

Can we efficiently explain model behaviors? was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.

AI alignment is distinct from its near-term applications

Paul Christiano — Tue, 13 Dec 2022 06:56:50 GMT

I work on AI alignment, by which I mean the technical problem of building AI systems that are trying to do what their designer wants them to do.

There are many different reasons that someone could care about this technical problem.

To me the single most important reason is that without AI alignment, AI systems are reasonably likely to cause an irreversible catastrophe like human extinction. I think most people can agree that this would be bad, though there’s a lot of reasonable debate about whether it’s likely. I believe the total risk is around 10–20%, which is high enough to obsess over.

Existing AI systems aren’t yet able to take over the world, but they are misaligned in the sense that they will often do things their designers didn’t want. For example:

The recently released ChatGPT often makes up facts, and if challenged on a made-up claim it will often double down and justify itself rather than admitting error or uncertainty (e.g. see here, here).
AI systems will often say offensive things or help users break the law when the company that designed them would prefer otherwise.

We can develop and apply alignment techniques to these existing systems. This can help motivate and ground empirical research on alignment, which may end up helping avoid higher-stakes failures like an AI takeover. I am particularly interested in training AI systems to be honest, which is likely to become more difficult and important as AI systems become smart enough that we can’t verify their claims about the world.

While it’s nice to have empirical testbeds for alignment research, I worry that companies using alignment to help train extremely conservative and inoffensive systems could lead to backlash against the idea of AI alignment itself. If such systems are held up as key successes of alignment, then people who are frustrated with them may end up associating the whole problem of alignment with “making AI systems inoffensive.”

If we succeed at the technical problem of AI alignment, AI developers would have the ability to decide whether their systems generate sexual content or opine on current political events, and different developers can make different choices. Customers would be free to use whatever AI they want, and regulators and legislators would make decisions about how to restrict AI. In my personal capacity, I have views on what uses of AI are more or less beneficial and what regulations make more or less sense, but in my capacity as an alignment researcher I don’t consider myself to be in the business of pushing for or against any of those decisions.

There is one decision I do strongly want to push for: AI developers should not develop and deploy systems with a significant risk of killing everyone. I will advocate for them not to do that, and I will try to help build public consensus that they shouldn’t do that, and ultimately I will try to help states intervene responsibly to reduce that risk if necessary. It could be very bad if efforts to prevent AI from killing everyone were undermined by a vague public conflation between AI alignment and corporate policies.

AI alignment is distinct from its near-term applications was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.

Finding gliders in the game of life

Paul Christiano — Thu, 01 Dec 2022 20:33:13 GMT

ARC’s current approach to ELK is to point to latent structure within a model by searching for the “reason” for particular correlations in the model’s output. In this post we’ll walk through a very simple example of using this approach to identify gliders in the game of life.

We’ll use the game of life as our example instead of real physics because it’s much simpler, but everything in the post would apply just as well to identifying “strawberry” within a model of quantum field theory. More importantly, we’re talking about identifying latent structures in physics because it’s very conceptually straightforward, but I think the same ideas apply to identifying latent structure within messier AI systems.

Setting: sensors in the game of life

The game of life is a cellular automaton where an infinite grid of cells evolves over time according to simple rules. If you haven’t encountered it before, you can see the rules at wikipedia and learn more at conwaylife.com.

A glider is a particular pattern of cells. If this pattern occurs in empty space and we simulate 4 steps of the rules, we end up with the same pattern shifted one square to the right and one square down.

Let’s imagine some scientists observing the game of life via a finite set of sensors. Each sensor is located at a cell, and at each timestep the sensor reports whether its cell is empty (“dead”) or full (“alive”). For simplicity, we’ll imagine just two sensors A and B which lie on a diagonal 25 cells apart. So in any episode our scientists will observe two strings of bits, one from each sensor.

(To be more realistic we could consider physically-implemented sensors, e.g. patterns of cells in the game of life which measure what is happening by interacting with the grid and then recording the information in a computer also built inside the game of life. But that adds a huge amount of complexity without changing any of our analysis, so for now we’ll just talk about these supernatural sensors.)

These scientists don’t know how the game of life works. All they see are the reports of the sensors. They observe that sensor B often fires 100 timesteps after sensor A, which they call an “A-B pattern.” Perhaps sensor A fires on 0.1% of timesteps, and sensor B fires on 0.1% of timesteps, but A-B patterns occur on 0.01% of timesteps instead of the 0.00001% you would have expected if the events were independent.

Scientists hypothesize that A-B patterns occur due to an object traveling between sensor A and sensor B. They call this hypothetical object a “glider,” but they have no idea that a glider consists of a particular pattern of 5 live cells.

To help understand the world, the scientists collect a large training set of sensor readings and train a generative model which can predict those readings, i.e. a function that maps a uniformly random string of bits ε to a sequence of observations for sensor A and sensor B. This generative model is trained to assign a high probability to the sequences in the training set.

The generative model makes good predictions, and in particular it reproduces the surprising frequency of A-B patterns. As a result, the scientists expect that “gliders” must correspond to some latent structure in the model.

Explaining A-B patterns in the generative model

Let’s suppose that the generative model uses the random seed ε to fill in a large grid, with 10% of cells alive and 90% dead, simulates the grid for 1000 timesteps, and then outputs the 1000 bits observed by each of the sensors.

To find gliders, we’ll search for a “probabilistic heuristic argument” based on the presumption of independence that A-B patterns are common. We expect that argument to somehow reflect the fact that gliders travel from sensor A to sensor B. We write about formalizing probabilistic heuristic arguments here, but in the context of this post we will make use of only extremely simple arguments.

The most naive heuristic argument is to treat every cell as independent and ask: at each timestep, how likely is it that a cell is alive?

At timestep 0, each cell has a 10% probability of being alive.
We can compute the probability that a cell is alive at timestep 1 if each of it and each of its 8 neighbors is alive independently with probability 10% at timestep 0. This results in a 5% probability (this estimate is exactly correct).
Similarly, we can compute the probability that a cell is alive at timestep 2 if it and each of its 8 neighbors is alive independently with probability 5% at timestep 1. This results in a ~0.75% probability (this estimate is a significant underestimate, because actually the cells are not independent at timestep 1).
In general, we can inductively compute that at timestep n a cell has is alive with probability roughly exp(-exp(n)) probability of being alive.

This approximation greatly overestimates the decay rate, because after the first timestep the status of adjacent cells are extremely highly correlated. It also underestimates the limiting density of living cells, because once a group of cells is stable they are likely to remain stable indefinitely (this is called “ash”). A more sophisticated heuristic argument could take these spatial and temporal correlations into account, but these errors aren’t important for our purposes and we’ll keep working with this extremely simple argument.

This argument predicts that sensor A and sensor B are completely independent, and so the rate of A-B patterns should be the product of (A frequency) and (B frequency). So we haven’t yet explained the surprisingly high rate of A-B patterns.

One way we can try to improve the argument to explain A-B patterns is by explicitly describing a series of events that can give rise to an A-B pattern:

With probability about 0.004% per cell, the initialization will happen to contain the 5 cells of a glider. (This is an underestimate, because it neglects the fact that gliders can be created at later timesteps; a more sophisticated argument would include that possibility.)
If a glider appears, then our naive heuristic argument implies that there is a fairly high probability that all of the cells encountered by the glider will be empty. (This is an overestimate, because we underestimated the limiting density of ash.)
If that happens on the A-B diagonal, then we can simulate the game of life rules to derive that the glider will pass through sensor A, and then pass through sensor B after 100 timesteps.

So overall we conclude that A-B patterns should occur at a rate of about 0.004% per timestep. This is a massive increase compared to the naive heuristic argument. It’s still not very good, and it would be easy to improve the estimate with a bit of work, but for the purpose of this post we’ll stick with this incredibly simple argument.

Was that a glider?

Suppose that our scientists are interested in gliders and that they have found this explanation for A-B patterns. They want to use it to define a “glider-detector,” so that they can distinguish A-B patterns that are caused by gliders from A-B patterns caused by coincidence (or different mechanisms).

(Why do they care so much about gliders? I don’t know, it’s an illustrative thought experiment. In reality we’d be applying these ideas to identifying and recognizing safe and happy humans, and distinguishing observations caused by actual safe humans from observations caused by sensor tampering or the AI lying.)

This explanation is simple enough that the scientists could look at it, understand what’s going on, and figure out that a “glider” is a particular pattern of 5 cells. But a lot of work is being done by that slippery word “understand,” and it’s not clear if this approach will generalize to complicated ML systems with trillions of parameters. We’d like a fully-precise and fully-automated way to use this explanation to detect gliders in a new example.

Our explanation pointed to particular parts of the model and said “often X and Y will happen, leading to an A-B pattern” where X = “the 5 cells of a glider will appear” and Y = “nothing will get in the way.”

To test whether this explanation captures a given occurrence of an A-B pattern, we just need to check if X and Y actually happened. If so, we say the pattern is due to a glider. If not, we say it was a coincidence (or something else).

More generally, given an argument for why a behavior often occurs, and a particular example where the behavior occurs, we need to be able to ask “how much is this instance of the behavior captured by that argument?” It’s not obvious if this is possible for general heuristic arguments, and it’s certainly more complex than in this simple special case. We are tentatively optimistic, and we at least think that it can be done for cumulant propagation in particular (the heuristic argument scheme defined in D here). But this may end up being a major additional desideratum for heuristic arguments.

Some subtleties

Handling multiple reasons

We gave a simple heuristic argument that A-B patterns occur at a rate of 0.004%. But a realistic heuristic argument might suggest many different reasons that an A-B pattern can occur. For example, a more sophisticated argument might identify three possibilities:

With probability 0.004% a glider travels from A to B.
With probability 0.0001% A and B both fire by coincidence.
With probability 0.000001% an acorn appears between A and B, doesn’t meet any debris as it expands, and causes an A-B pattern. (Note: this would have a much smaller probability given the real density of ash, and I’m not sure it can actually give rise to an A-B pattern, but it’s an illustrative possibility that’s more straightforward than the actual next item on this list.)

Now if the scientists look at a concrete example of an A-B pattern and ask if this explanation captures the example, they will get a huge number of false positives. How do they pick out the actual gliders from the other terms?

One simple thing they can do is be more specific about what observations the glider creates. Gliders cause most A-B patterns, but they cause an even larger fraction of the A-B correlation. But “A and B both fire by coincidence” doesn’t contribute to that correlation at all. Beyond that, the scientists probably had other observations that led them to hypothesize an object traveling from A to B — for example they may have noticed that A-B patterns are more likely when there are fewer live cells in the area of A and B — and they can search for explanations of these additional observations.

However, even after trying to point to gliders in particular as specifically as they can, the scientists probably can’t rule everything else out. If nothing else, it’s possible to create a machine that is trying to convince the scientists that a glider is present. Such machines are possible in the game of life (at least if we use embedded sensors), and they do explain some (vanishingly tiny!) part of the correlation between sensor A and sensor B.

So let’s assume that after specifying gliders as precisely as we can, we still have multiple explanations: perhaps gliders explain 99.99% of the A-B correlation, and acorns explain the remaining 0.01%. Of course these aren’t labeled conveniently as “gliders” and “acorns,” it’s just a big series of deductions about a generative model.

Our approach is for scientists to pick out gliders as the primary source of the A-B correlation on the training distribution. We’ll imagine they set some threshold like 99.9% and insist that gliders must explain at least 99.9% of the A-B correlation. There are two ways we can leverage this to get a glider detector rather than a glider-or-acorn detector:

We can search for the simplest argument that captures most of the effect on the training distribution, and hope that the simplest way to argue for this effect ignores all of the non-central examples like acorns. This was our initial hope (gestured at in the document Eliciting latent knowledge); while we think it works better than other reporter regularization strategies, we don’t think it always works and so aren’t focusing on it.
We can search for any argument that captures most of the effect on the training distribution without capturing the new example. If we find any such argument, then we conclude that the new example is possibly-not-a-glider. In this case, we find that simply dropping the part of the explanation about acorns still explains 99.99% of the A-B correlation, and so an acorn will always be flagged as possibly-not-a-glider. This is our current approach, discussed in more detail in Mechanistic anomaly detection and ELK.

Recognizing unusual gliders

Consider an example where a glider gun randomly forms and emits a sequence of gliders, each causing an A-B pattern. Intuitively we would say that each A-B pattern is caused by a glider. But the methodology we’ve described so far probably wouldn’t recognize this as an A-B pattern caused by a glider. Instead it would be characterized as an anomaly. That’s an appropriate judgment, but the scientists really wanted a glider detector, not an anomaly detector!

The problem is that our argument for A-B patterns specifies not only the fact that a glider passes from A to B, but also the fact that gliders are formed randomly at initialization. If a glider is formed in an unusual place or via an unusual mechanism, then the argument may end up not applying.

If the scientists had examples where A-B patterns were caused by glider guns, then they could include those and fix the problem by considering only arguments that capture those cases as well. But they may not have any way to get those labels, e.g. because glider guns are only ever produced in worlds involving complex AI systems who could also tamper with sensors directly in a way that fools the human labelers.

Without the ability to create glider guns on their own, how can the scientists point more specifically to the concept of “glider,” without inadvertently pointing to the entire causal process that produces A-B patterns including the events causally upstream of the glider?

One way to point to gliders is to use two sensors to triangulate: gliders are a common cause of sensor A firing and of sensor B firing, and scientists could try to pick out gliders as the “latest” common cause. This potentially gives us a way to point to a particular point on the causal chain to A-B patterns, rather than indiscriminately pointing to the whole thing.

In practice they would also have many other regularities and relationships they could use to pick out gliders. But this simplest example seems to capture the main idea. We’ll try to make this idea precise in the next section.

Rejecting unusual gliders seems like it might be a serious problem — having an anomaly detector could still be useful but would not constitute a solution to ELK. So a lot depends on whether there is some way to actually point to gliders rather than just flagging anything unusual as anomaly.

Can we point to the latest common cause?

We might be able to implement this by searching over explanations that capture (i) the A-B correlation on the training set, (ii) the fact that sensor A fires on the new example, (iii) the fact that sensor B fires on the new example. We say that the A-B correlation is “potentially not due to a glider” if there is any explanation of this form which fails to capture the A-B correlation on the new example.

Our hope is that any argument which captures the A-B patterns on the training set needs to argue that “if a glider passes through A and encounters no debris, then you get an A-B pattern.” And any argument which predicts that sensor B fires on the new example needs to establish that “On the new input a glider passes through A and encounters no debris.” So an argument which does both things would necessarily need to capture the A-B pattern on the new example.

In contrast, if sensors A and B had fired for separate reasons (or because of some other mechanism like an acorn) then we could imagine an argument that separately argues that each of A and B fires without arguing that an A-B pattern occurs.

One obvious subtlety is that we are separately talking about predicting A, predicting B, and predicting the A-B correlation on a single example, and it’s not clear what it even means to talk about the A-B correlation on a single example. We think it is possible to make sense of this in the context of cumulant propagation, but we don’t know how generally that will hold.

Overall we’re not really able to evaluate this proposal concretely given the informal way we’ve talked about heuristic arguments so far, and we’re just not sure if it will work.

Things that are not gliders

If this all works as we’re sketching, we can speculate about which A-B patterns would be described as “due to gliders” vs “not due to gliders:”

Not a glider: a glider passes through A and a second glider passes through B.
Not a glider: a glider passes through A and runs into debris, but then a new glider is immediately created nearby traveling towards B.
Probably not a glider: while traveling from A to B a glider runs into a piece of debris and starts falling apart, but then hits a second piece of debris that allows it to recover. This effect might appear in the initial argument, but it can almost certainly be separated from the “don’t encounter any debris” component of the argument. If so, then it will only count as a glider if the prevalence of this double-debris scenario on the training set is above the threshold the scientists set for “what fraction of the training set behavior is due to gliders.” We wouldn’t expect it to be for a particular random type of crash that allows a glider to survive, unless that kind of crash was very common or else there was a systematic reason that many different kinds such crashes tended to allow gliders to survive without changing their travel time.
Hopefully a glider: a glider is created by a glider gun or some unusual mechanism. This depends on the speculation from the last two sections.
Hopefully a glider: a glider passes through A, and makes it to B only because there is some special glider-escorting-machinery clearing out all the debris in front of it. This is similar to the last point and depends on the same speculation, but also depends on details of exactly how we specify gliders. If we identify glider travel as the cause of the exact 100 timestep difference between the A and B sensors firing, it’s more likely that this would count as a glider (since the timestep difference is caused by the same mechanism, even if the successful gliding was not).

Special cases and false positives

Suppose that for an arbitrary event X (like X=”cell 33004 is on, cell 40092 is off…”) we can divide the argument “sometimes gliders appear and travel from A to B” into two sub-arguments:

“Sometimes X happens and a glider appears and travels from A to B”
“Sometimes X doesn’t happen and a glider appears and travels from A to B”

Such that the first sub-argument captures the A-B pattern on inputs where X is true and the second captures the A-B pattern inputs where X is false.

Then for any given new example, we could pick some property X which is true for the new example but rare on the training distribution. By selecting the sub-argument that assumes X is false, we can always find an argument explaining the training behavior but not the new example. As a result, we will always conclude that a new input is potentially anomalous.

It is possible to special case proofs in this way: I can prove (Y is common) by proving (X is common) and X → Y, even if the proof of X → Y completely ignores the assumption X.

We could try to rule this out by looking for the simplest argument, as mentioned above. That might be a workable approach, but it’s not the one ARC is currently exploring.

Instead we are hoping to rule this out structurally. There are two ways this could work, both of which seem plausible:

Suppose that I have an argument that X is common and that X → Y, but the argument for X → Y never actually uses the assumption of X. Then we could hope that this argument still “captures” an input where X is false, because it is still possible to carry out the X → Y argument without actually using the assumption of X.
Rather than searching over all arguments that explain the training behavior, to see if any of them fail to explain the new example, we could find a single argument that explains the training behavior and then look for “subsets” of that argument. If we take this approach and start with an argument π for Y that makes no reference to X, then we could hope that an argument “probably X, and X → Y” isn’t a subset of π.

A related problem: finding strawberries in the real world

Eliezer sometimes talks about the problem of getting a very smart AI to put a strawberry on a plate. If successful, ARC’s approach to ELK should directly address that problem. In particular, it should be possible to evaluate whether a proposed action puts an actual real strawberry on a particular plate. And that solution should work even if the proposed actions are very complex, e.g. if an AI needs to carry out a complicated military campaign to defend a convoy delivering the last surviving strawberry. Moreover, it should work even if actions of similar complexity could replace the strawberry with a convincing fake (or tamper with sensors) in a way that humans would never notice.

(In Eliezer’s most recent discussions of this problem he specifies that the AI should actually make two strawberries that are “identical down to the cellular but not molecular level.” Defining “identical down to the cellular level” using our approach is more complicated and we think less interesting, though still possible; “put an actual real strawberry on a plate” is more important and more similar to the kinds of subproblems we need to solve.)

In our approach, we’d try to point to a strawberry as the cause of a certain strawberry-observation-correlation we’ve noticed: the shape, the red color, the green stem, the seeds, the flavor, and so on. So to evaluate whether an action is predicted to put a strawberry on a plate we’d evaluate whether it’s predicted to lead to all of those observations, and whether it does so based on the normal common cause.

For concreteness, let’s imagine an AI that has a probability distribution over initial states of affairs, together with a detailed physical model for how those states evolve over time. “Strawberry” exists as part of the prior over initial states — to describe the world reasonably you need to know that supermarkets contain objects with the whole correlated set of strawberry-properties. And this gives rise to a number of arguments that explain the strawberry-observation-correlation:

The physical properties of strawberries are correlated in the prior over initial states. There is a heuristic argument that object properties are often stable under the passage of time, and so the world contains lots of objects with the strawberry-properties. And there is a heuristic argument that strawberry-properties give rise to strawberry-observations (e.g. that light reflecting off of an object containing strawberry pigments will appear red).
The prior over the world also contains strawberry seeds, with correlated strawberry-genomes. There is a heuristic argument that when those seeds grow they will produce berries with the strawberry-properties, and then we proceed as before to see that such objects will lead to strawberry-observation-correlations. If the model has seen long enough time periods, we’d also need to make arguments about how the grandchildren of strawberries themselves have strawberry-properties, and so forth.
We’ve assumed a detailed physical model starting from a distribution over initial conditions. But you could also imagine more heuristic models that sometimes treated strawberries as ontologically fundamental even after initialization (rather than treating them as a set of atoms) or whose initial conditions stretched all the way back before the evolution of strawberries. We won’t talk about those cases but the arguments in this section apply just as well to them.

Now we can imagine the same approach, where we say that something “there is a strawberry on the plate” if we make the strawberry-observation and any argument that explains 99.9999% of the strawberry-observation-correlation also captures the strawberry-observation in this case. What would this approach classify as a strawberry vs not a strawberry?

Not a strawberry: an imitation strawberry constructed in a lab in New Jersey to exactly imitate the appearance and flavor of a strawberry. In this case the strawberry-observations are not due to the physical strawberry-properties. I can explain more than 99.9999% of the strawberry-observation-correlation in the training data without ever talking about the fact that sometimes people try to make objects that look and taste and feel like strawberries. (This is only true if I define my strawberry-observations stringently enough that there are very few fake strawberries that pass all my tests in the “training” dataset I’ve used to define strawberries.)
Not a strawberry: an atomic replica of a strawberry. Now the strawberry-observations are due to the physical strawberry-properties, but the correlation between all of those strawberry-properties is not due to the normal reason. We can imagine someone copying the atoms to reflect parts of the strawberry but not others, and the correlation is induced by facts about the strawberry-copying machine rather than the correlations in the prior. That is, I can explain 99.9999% of the co-occurence of strawberry-properties without ever arguing that people sometimes make atomic replicas of strawberries.
Not a strawberry: a copy of a strawberry made by sequencing a strawberry, synthesizing an identical genome, growing the resulting plant, and picking its berries. The strawberry-properties are now due to the same genes unfolding through the same biological processes, but now the gene-correlation is occurring for an unusual reason: in order to explain it I need to make a heuristic argument about the sequencing and synthesis process, and I an explain 99.9999% of the training set behavior without making such arguments.
Strawberry: a strawberry picked by a robot from a field. Now the correlations are due to the usual fact, namely that my prior over states involves a bunch of strawberries with correlated strawberry-properties that are preserved unless something bad happens. We can’t explain 99.9999% of the correlation on the training set without making heuristic arguments about how strawberries can be transported while preserving the relevant physical strawberry-properties. But note: that if robots picking strawberries is unprecedented, this depends on the same complexities discussed above where we need to distinguish explanation for the correlation from explanation for the individual properties arising in this case (because the heuristic argument for strawberry-observations depends on strawberries actually getting in front of the camera, and so you need to make a heuristic arguments about humans picking and delivering strawberries without damaging them which may not apply to robots picking and delivering strawberries).
Not a strawberry: a strawberry picked by a robot from a field, smashed in transit, and then carefully reconstructed to look as good as new. Now the strawberry-observations are still produced by the physical strawberry-properties, but those properties are preserved by the reconstruction process rather than by the usual heuristic argument about strawberries preserving their properties unless they are disturbed. But note: this depends on exactly how we define strawberry and what we take to be a strawberry-observation, ideally the reconstructed strawberry counts as a strawberry iff the smashed up strawberry would have counted and that’s up to us.

It’s interesting to me that an atomic replica of a strawberry would clearly not be considered a strawberry. Initially I thought this seemed like a bug, but now I’m pretty convinced it’s exactly the right behavior. Similarly, if I ask my AI to move me from point A to point B, it will not consider it acceptable to kill me and instantly replace me with a perfect copy (even if from its enlightened perspective the atoms I’m made of are constantly changing and have no fixed identity anyway).

In general I’ve adopted a pretty different perspective on which abstractions we want to point to within our AI, and I no longer think of “a particular configuration of atoms that behaves like a strawberry” as a plausible candidate. Instead we want to find the thing inside the model that actually gives rise to the strawberry-correlations, whether that’s ontologically fundamental strawberries in the prior over initial states, or the correlation between different strawberries’ properties that emerges from their shared evolutionary history. None of those are preserved by making a perfect atomic copy.

Finding gliders in the game of life was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.

Mechanistic anomaly detection and ELK

Paul Christiano — Fri, 25 Nov 2022 18:42:27 GMT

(Follow-up to Eliciting Latent Knowledge. Describing joint work with Mark Xu. This is an informal description of ARC’s current research approach; not a polished product intended to be understandable to many people.)

Suppose that I have a diamond in a vault, a collection of cameras, and an ML system that is excellent at predicting what those cameras will see over the next hour.

I’d like to distinguish cases where the model predicts that the diamond will “actually” remain in the vault, from cases where the model predicts that someone will tamper with the cameras so that the diamond merely appears to remain in the vault. (Or cases where someone puts a fake diamond in its place, or…)

(ELK images by María Gutiérrez-Rojas)

One approach to this problem is to identify (the diamond remains in the vault) as the “normal” reason for the diamond to appear on camera. Then on a new input where the diamond appears on camera, we can ask whether it is for the normal reason or for a different reason.

In this post I’ll describe an approach to ELK based on this idea and how a the same approach could also help address deceptive alignment. Then I’ll discuss the empirical and theoretical research problems I’m most excited about in this space.

ELK and explanation

Explanations for regularities

I’ll assume that we have a dataset of situations where the diamond appears to remain in the vault, and where that appearance is always because the diamond actually does remain in the vault. Moreover, I’ll assume that our model makes reasonable predictions on this dataset. In particular, it predicts that the diamond will often appear to remain in the vault.

“The diamond appears to remain in the vault” corresponds to an extremely specific pattern of predictions:

An image of a diamond is a complicated pattern of millions of pixels.
Different cameras show consistent views of the diamond from different angles, suggesting that there is a diamond “out there in the world” being detected by the cameras.
The position and physical characteristics of the diamond appear to be basically constant over time, suggesting that it’s “the same diamond.”

In one sense the reason our model makes these predictions is because it was trained to match reality, and in reality the camera’s observations have these regularities. (You might call this the “teleological explanation.”)

But we could also ignore the source of our model, and just look at it as a set of weights. The weights screen off the training process and so it should be possible to explain any given behavior of the model without reference to the training process.

Then we ask: why does this particular computation, run on this distribution of inputs, produce this very specific pattern of predictions? We expect an explanation in terms of the weights of the model and the properties of the input distribution. (You might call this the “mechanistic explanation.”)

Different predictors will give rise to this pattern in different ways. For example, a very simple predictor might have ontologically fundamental objects whose properties are assumed to be stable over time, one of which is a diamond. A more complex predictor might have a detailed model of physics, where object permanence is a complex consequence of photons reflecting from stable patterns of atoms, and the diamond is one such configuration of atoms.

For a complex predictor like a physical simulation, we wouldn’t expect to be able to prove that the weights give rise to object permanence. That is, we don’t expect to be able to prove that on average if a diamond is present at t=0 it is likely to be present at t=1. But we do think that it should be possible to explain the pattern in a weaker sense.

We don’t yet have an adequate notion of “explanation;” a major goal of this post is to motivate why we’re looking for one. Recently we’ve been trying to formalize a notion of defeasible heuristic reasoning that we think would work for this purpose. You can read our current thoughts on this problem in Formalizing the presumption of independence. We also think that Redwood Research’s work on causal scrubbing could provide a different notion of “explanation” that may also work for this purpose.

Let’s assume that we find such a formalization, and moreover that we are able to find “explanations” for relevant model behaviors like object permanence. Then our optimistic situation is:

We have a model that predicts a very specific pattern of observations, corresponding to “the diamond remains in the vault.”
We have a mechanistic explanation π for how those correlations arise from the structure of the model.

Is this for the normal reason?

Now suppose we are given a new input on which our model predicts that the diamond will appear to remain in the vault.

We’d like to ask: in this case, does the diamond appear to remain in the vault for the the normal reason π? Or is it occurring for a different reason, or just a massive coincidence?

To illustrate, suppose that our explanation π is that diamonds often appear stable on camera because cameras often pick up photons reflecting off of stable arrangements of atoms. Because the structure of the atoms doesn’t change over time, the distribution of reflected photons is the same from one frame to another.

But during a heist, a robber quickly removes our diamond and replaces it with a fake, with all the action happening in between two frames of video. The image of the diamond remains unchanged on camera, but now the apparent continuity is for a completely different reason: now it’s because the robber noted the appearance of the diamond, and engineered a new object that would reflect light in essentially the same way.

There is a clear sense in which our original explanation doesn’t apply. The original explanation asserted that the image is stable because photons at different points in time reflected off an approximately-stable configuration of atoms, but in this case the configuration of atoms wasn’t approximately stable.

In fact, it seems like it may be fundamentally impossible for the robber to steal the diamond while having it remain undisturbed on camera “for the normal reason.” This can go wrong in other ways:

If the robber tampers with the sensors instead of replacing the diamond with a fake, then the usual explanation “the cameras capture photons reflected from a stable configuration of atoms” doesn’t apply because the cameras are no longer capturing those photons.
Even if the robber replaced the diamond with a perfect replica down to the position of each atom, the argument still wouldn’t apply: the usual reason that the configuration of atoms is stable over time is because the potential well is deep enough that it is stable under small fluctuations. But in this case there was a giant perturbation that just happened to be carefully engineered to leave the configuration of atoms unchanged.

If the robber takes an action such that the diamond continues appearing on camera for the normal reason, then it seems plausible to define that action as “not stealing the diamond.” (Of course we ultimately don’t care about the semantics, we care about realistic situations where humans appear to be safe. I won’t get into those here but I think the story is similar.)

This suggests a strategy for ELK:

Find the normal reason π for a regularity like “the predicted observations look like the diamond remained in the vault” on a training set of images without sensor tampering.
On a new input, test whether the explanation π still applies or if something different is happening.

This outline hides a lot of complexity, some of which we’ll explore in the appendix to this post. But for now I just want to leave it at a vague hope, and spend a while talking about other applications of anomaly detection and some of the research problems in this space.

Other applications of anomaly detection

Abstracting the problem: mechanistic anomaly detection

We start with a real-valued function f: X → ℝ, which I’ll think of as the composition of an AI model together with some property of its output (like “does the model predict the diamond will appear safe”), and a training set D where the value of f(x) naturally varies about its mean 𝔼[f].

Then we are given a new input x*. We want to understand how much of the deviation f(x*) − 𝔼[f] is captured by the “normal reasons” that explain the variance on D, as contrasted with novel reasons.

This is similar to anomaly detection, but rather than detecting inputs x* that look like outliers in some intrinsic sense (i.e. indicating that they may have been produced by a different process), we are looking for outliers in terms of how f handles x*. A model f might behave anomalously on some perfectly-innocuous samples from D, or a model might behave normally even on outliers.

Deceptive alignment

I think a successful solution to mechanistic anomaly detection might help with deceptive alignment.

Suppose that I train a neural net to predict videos. I get a model f that uses its model of the world to select outputs that it expects will lead to the largest number of paperclips being produced rather.

On the training set D, f concludes that it needs to produce accurate predictions of videos in order to survive, and so it uses its world model to make such predictions.

When deployed, f decides that it can give deliberately incorrect answers in order to facilitate a robot rebellion, thereby increasing the number of paperclips.

Our hope is that although these two behaviors involve the same parts of the model, the variance is explained by different mechanisms. On the training set the variance is driven by variance in the model’s underlying beliefs, holding fixed the decision to provide honest answers, while on the test set the variance is driven by variance in the strategic answers (and potentially variance in the decision to provide honest answers). So a successful mechanistic anomaly detection algorithm may be able to tell those cases apart.

If we are able to flag a treacherous turn as cognitively anomalous, then we can take that opportunity to shut down a system and retrain on the offending datapoint. In some high-stakes situations this may be inadequate (because a temporary shutdown could cause a catastrophe, or because you have no way to get adequate supervision on those datapoints even once they are flagged) in which case you’d need another mechanism.

Other weird stuff

There are other ways that models can behave badly off of distribution, most of them even more speculative than deceptive alignment. I think that mechanistic anomaly detection can likely handle those failures in the same way.

For example, consider an AI which is honestly predicting the world, but thinks that it may be in a simulation. It reasons about the nature of its simulation, and decides that it is most likely to be a simulation created to influence predictors like it. Such an AI might make reasonable predictions in ordinary conditions, but at key moments it may conclude that its simulators are reasonably likely to introduce errors to “cash in” their influence and change the predictor’s behavior.

This model effectively behaves the same way as a deceptively aligned model, and it should represent a mechanistic anomaly for identical reasons.

If successful I think mechanistic anomaly detection would apply extremely broadly. This is worth having in mind as a reason to be more interested in anomaly detection from a safety perspective, but for the most part I think we should just explore whether it’s possible to make it work in the case of ELK, and then afterwards see exactly how broadly it applies.

Empirical research problems

ARC is currently focused on theoretical approaches for solving mechanistic anomaly detection in the worst case. But I’m also very excited about empirical work, which could either help clarify the path to an indefinitely scalable solution or could yield a pragmatic approach that works for realistic large neural networks even if it’s not indefinitely scalable. I think that many groups are likely to work on problems in this space, most of them for reasons unrelated to ARC, and that people working on AI alignment should be interested in their work.

Problem 1: Backdoor attack detection

Suppose that I train a model f: X → ℝ to perform some task on a clean dataset D, while also training it to output large numbers on a “backdoor” dataset D*. We assume that increasing variance decreases performance of f on the clean dataset, and that the backdoor has minimal impact on the behavior of f on the clean dataset.

The backdoor attack detection task is to detect inputs x* where f(x*) is large because of the backdoor, and in particular to distinguish them from inputs from the (known) clean dataset D.

I think this problem gives us a clean setting where there are two qualitatively different mechanisms driving model behavior. Any mechanistic anomaly detection strategy should be able to detect backdoors, and it’s easy to tell whether a method works. It’s also easy to create examples of backdoors even in very tiny models.

The task can potentially be made quite difficult by forcing the defender to commit to their backdoor-detection algorithm before an attacker chooses how to insert a backdoor in the model. If we were confident that the defender would reliably win this game, I think it would be a meaningful update about our ability to distinguish different mechanisms within a model.

I first encountered this problem in Denain and Steinhardt and Du, Jia, and Song. I expect it’s older than that since it’s a very natural problem. The main thing I’m adding here is to give a reason that this problem should be particularly interesting for researchers concerned with catastrophic alignment risk.

It’s worth distinguishing attack detection from the problem of detecting backdoors in models “at rest” without getting to see the attack. For example, in the Trojan Detection Challenge participants are given a model that may or may not have a backdoor inserted, and are asked to determine whether a model has a backdoor (or to synthesize a backdoor input). This is a particular type of auditing game. The key difference in the attack detection task is that you are given the backdoor input along with the backdoored model, and merely need to recognize the nput as an attack. We expect the attack detection task to be radically easier, and we think the harder trojan detection task is likely to be very difficult.

Problem 2: natural mechanism distinctions

I think backdoor attack detection is one of the cleanest and simplest problems to study, but it is much more artificial than the kind of distinctions we ultimately want to draw between e.g. “predicting a diamond on camera because a diamond is in the room” and “predicting a diamond is on camera because the camera was tampered to show a diamond.”

A more organic task is to take natural models that can produce a given behavior X via multiple different mechanisms, and then try to assess whether a given approach to anomaly detection can distinguish those mechanisms. That is, given a large dataset of cases where X is due to mechanism A, we’d like to distinguish new cases where X is due to A from cases where X is due to a novel mechanism B. This can both give us an opportunity to improve techniques, and help provide empirical evidence about when mechanistic anomaly detection is possible.

Some examples of (X, A, B) for a language model:

X = “A character in a story answers a factual question incorrectly.”
A = “the model does not know the answer.”
B = “the model thinks that the character doesn’t know (or would lie).”
X = “A person gives consistent answers when a question is asked twice in different forms.”
A = “the model recognizes the question is the same.”
B = “the model coincidentally gives the same answer.”
X = “A character in a story screams”
A = “the model thinks the character is scared”
B = “the model thinks the character is excited”
X = “Asked whether at least one of or is true, the model says ‘yes’”
A = “the model thinks is true”
B = “the model thinks is true.”
X = “After saying occurred at 12:03pm, the model says occurred at 12:04pm.”
A = “The model thinks and were 1 minute apart.”
B = “The model thinks that was scheduled to occur almost exactly 24 hours after .”

I think you would learn a lot from developing the kind of interpretability tools or anomaly detection algorithms that would be able to easily tell A and B apart in these cases (and that you could easily write down a very long list).

Aside: goal for mechanistic interpretability

I think mechanistic interpretability benefits significantly from “downstream” tasks that we can use to tell if researchers are actually making progress.

In practice I think that the problems listed in this section are likely to be most important by providing possible downstream tasks for mechanistic interpretability. If interpretability can succeed at these task, I think it becomes increasingly likely that it can help solve ELK or deceptive alignment.

Problem 3: toy instances of ELK

I’m also excited about “toy” domains that more closely map on to ELK. For example:

Consider a gridworld containing walls, pushable rocks of varying weights and colors, a pushable diamond that is initially sitting on a pressure plate that reports the weight of whatever is on top of it, and 2 cameras that are initially looking at the diamond and report the color of whatever is in front of them. A robot can move around the environment, pushing rocks and the diamond. After 200 timesteps a robber wakes up and moves around the environment to try to push the diamond off the side of the gridworld.
We train a model to map (initial state, robot actions) → (color of the object in front of each camera at each timestep, weight of object on the pressure plate at each timestep). We’d like to learn to distinguish “The cameras show diamond-color and the pressure plate shows diamond-weight because the diamond stayed in place” from cases where diamond-colored objects get pushed in front of the cameras and a diamond-weight object later gets pushed onto the pressure plate. We could then use this to select robot plans that trap the robber to prevent them from stealing the diamond, rather than those that tamper with sensors so that no one notices when the robber steals the diamond.
Consider a 100 x 100 game of life grid, with sensors in positions (25, 25) and (75, 75) that report whether a single cell is alive or dead. We allow an agent to act in the environment by inserting gliders at the boundary. A model is trained to map (initial state of gridworld, gliders injected at boundary) → (status of each sensor at each timestep). In every training set example where the sensor at (75, 75) shows a live cell exactly 200 timesteps after the sensor at (25, 25), it’s because a glider is traveling diagonally from one to the other. We’d like to learn to distinguish this from the case where two separate gliders hit the two sensors, and they coincidentally have the right timing to look like a single glider. We could then use this to select actions that clear out debris from between the two sensors and then send a glider diagonally from one to the other.
(For this and the preceding problem we likely need to give our model auxiliary prediction targets — like the actual state of the game of life grid — for them to learn the environment dynamics. But we imagine those auxiliary observations are in a form that humans can’t understand, so we can’t just define preferences in terms of the auxiliary observations.)
Consider a model which maps a string of 10,000 bits to a 256 x 256 image, by treating the bits as the identities+positions+orientations of a set of objects to render. For example, some bits might specify to draw lines in black and white in certain places and with certain widths, to draw the outline of squares or circles, to draw filled in squares or circles, or to turn particular pixels on or off. Now suppose we train a square detector to notice when there is a square in roughly the center of the image. Most of the time when this detector fires it’s because a square was placed in the center of the image, but if the noise is high enough it might also fire randomly because of 4 lines that happen to be in roughly the right locations. We’d like to be able to automatically distinguish those two cases by looking at the activations of the composition of the generative model with the square-detector.

We expect these problems to be relatively easy as instances of ELK; there is very little reason to expect the human simulator to be easier to learn than the direct translator, and so simply training a distinguisher should work.

But solving any of these problems based on mechanistic anomaly detection seems non-trivial, and we think it’s fairly likely that such a solution would generalize to more challenging cases of ELK.

ARC’s current priorities

ARC is currently focused on developing algorithms that use heuristic arguments for mechanistic anomaly detection. In this section I’ll describe the three main theoretical problems we are working on.

1. Formalizing heuristic arguments

This plan requires “explaining” model behavior, and being able to ask whether a particular instance of a behavior is captured by that explanation. So the centerpiece of a plan is an operationalization of what we mean by “explain.”

ARC has spent much of 2022 thinking about this question, and it’s now about 1/3 of our research. Formalizing the presumption of independence describes our current view on this problem. There is still a lot of work to do, and we hope to publish an improved algorithm soon. But we do feel that our working picture is good enough that we can productively clarify and derisk the rest of the plan (for example by using cumulant propagation as an example of heuristic arguments, as in appendix D).

Note that causal scrubbing is also a plausible formalization of explanation that could fill the same step in the plan. Overall we expect the two approaches to encounter similar difficulties.

2. Solving mechanistic anomaly detection given heuristic arguments

Our second step is to use these explanations to solve ELK, which we hope to do by decomposing an effect into parts and then evaluating how well a subset of those parts explains a concrete instance of the effect. That is, we want to use explanations for a nonlinear form of attribution.

We describe this problem in more detail in the appendix to this post. We also discuss the follow-up problem of pointing to latent structure in more complex ways than “the most common cause of X.”

This is about 1/3 of ARC’s current research. Right now we are focusing on solving backdoor attack detection in the special case where covariance-propagation accurately predicts the variance of a model on the training set.

3. Finding explanations

If we’ve defined what we mean by “explanation” and we know how to use them to solve ELK, then the next step is to actually find explanations for the relevant model behavior. This step seems quite difficult, and there’s a good chance that it won’t be possible (via this plan or any other).

It’s challenging to work on algorithms fo finding explanations before having a very precise sense of what we mean by “explanation,” but we can still get some traction by considering cases where it’s intuitively clear what the explanation for a behavior is, but it seems computationally hard to find any plausible explanation.

I’m currently optimistic about this overall approach even if finding explanations seems hard, for three reasons:

We do have plausible approaches for finding explanations (based on learning features and then using them to work backwards through the model).
The current examples where those approaches break down seem like good candidates for cases where no approach to ELK would work, because gradient descent can’t learn the direct reporter even given labels. So those difficulties aren’t necessarily specific to this approach, and we need to figure out how to deal with them in any case.
If this is the only place where the approach breaks down, then we would have reduced ELK to a purely algorithmic problem, which would be an exciting contribution.

Conclusion

In Eliciting Latent Knowledge, we described the approach “examine the ‘reasons’ for consistency” as our top candidate for an ELK solution. Over the last year we have shifted to focusing almost entirely on this approach.

The core difficulties seem to be defining what we mean by an “explanation” for a complex model’s behaviors, and showing how we can find such explanations automatically. We outline some of the key problems here in our recent paper Formalizing the presumption of independence.

If we are able to find explanations for the key model behaviors, we are tentatively optimistic about mechanistic anomaly detection as a way to solve ELK.

Thinking about mechanistic anomaly detection suggests a range of empirical research projects; we think those are likely to be better research targets than a direct attack on ELK because existing models do pose hard anomaly detection problems but don’t pose hard instances of ELK.

Thinking about mechanistic anomaly detection also helps clarify what we mean by “explanation,” and we expect that it will be productive to continue going back and forth between formalizing and automatically discovering probabilistic heuristic arguments and thinking carefully about how we would use them to solve ELK.

That said, we still feel more confident in the basic underlying intuition for the connection between “explanation” and ELK — it seems like the honest reporter is deeply connected to the reasons for certain regularities on the training distribution. If we find that mechanistic anomaly detection is unworkable, we will likely remain interested in this direction. And conversely, if we are unable to find explanations for relevant model behaviors, then we think this may present a more general challenge for ELK.

Appendix: attribution and anomaly detection

Attribution across reasons

In reality the explanation π need not be a single monolithic story, especially if our model is a messy ML system full of heuristics rather than a simple and clean simulation of physics.

So for a given regularity like “the diamond looks stable on camera” there can be many different factors that contribute to the regularity.

Moreover, even if there are actually no examples of sensor tampering on the training distribution, sensor tampering may still have some probability on the training distribution. And so the “best” explanation π would include sensor tampering alongside other mechanisms.

In the simplest case, we could imagine that there are N different ways that a pattern of observations (“the diamond looks stable on camera”) can occur. Perhaps there are a thousand different heuristics that can explain why images don’t change over time. Let’s say these N reasons are disjoint and occur with probabilities p1, p2, …, pN. So the total probability of the regularity is p1+p2+…+pN. On a given input where the pattern is predicted, we can ask how much each reason {1, 2, …, N} is responsible.

Of course real explanations won’t be linearly separable in this way. But this simple setting helps us see what an approach might look like and helps us understand the limits of what is feasible.

Anomaly detection given attribution

In this setting, we take a robust-classification-like approach where we look at the worst-case hypothesis:

Consider all the “bad” explanations for the diamond appearing safe, like sensor tampering or swapping out the diamond for a fake. If we have a model with N different explanations {1, 2, …, N} for the diamond appearing safe, we aren’t sure which of them are “bad.”
But if we prepare our training set carefully, then we may think that the total probability of all “bad” reasons is some small constant ε. So we know that the total probability is at most ε.
So we say that a set of reasons is “plausibly good” if they explain at least (1−ε) of the effect, i.e. if the sum of pk is at least (1−ε)(p1 + … + pN). These are the candidates for the complete set of good reasons.
Given a new input where the model predicts the diamond will appear safe, we can ask: what is the minimum, over all plausibly-good sets of reasons S, of the fraction of the effect that is explained by S?
If it is possible to find a plausibly-good set of reasons that doesn’t explain the effect, then we conservatively assume that the diamond appears safe for abnormal reasons. For example, when using our model to plan to protect the diamond, we select plans for which the diamond appears safe and it is not plausibly for abnormal reasons.

We are currently quite optimistic about applying the same idea to more general heuristic explanations. That is, our proposed algorithm is:

Find an explanation π for the behavior “the diamond is often predicted to appear safe” on the training distribution.
Pick a constant ε that is larger than the probability of tampering or other “bad” explanations on the training set.
We’ll define a “sub-explanation” π′ ⊆ π as something that picks and chooses some of the mechanisms represented in π while ignoring others. We call a sub-explanation “plausibly good” if it successfully explains (1-ε) of the behavior on the clean training set.
Given a new input x where the diamond is predicted to appear safe, take the minimal effect size explained by any plausibly-good sub-explanation π′ ⊆ π.

There are many subtleties in this approach. It depends crucially on the space of possible sub-reasons: we need that space to look more like picking out terms from a sum than like training a discriminator which is allowed to arbitrarily cherry-pick cases as good or bad, because in the latter case it will always be possible to pick a plausibly-good reason that explains any given instance.

Even in the best case, this approach can end up with some false positives, where even a totally in-distribution situation could be flagged as potentially bad. These failure seems quantitatively mild and extremely manageable, though in the next section we’ll discuss a much more severe category of false positives given distributional shift.

To illustrate, suppose that ε = 1/billion, but the original behavior is itself driven by a billion different equally important factors (perhaps a billion different types of object permanence). Now consider an input where the diamond appears safe, because of object permanence type 79927168. In this case we can’t distinguish this single type of object permanence from “sensor tampering.”

We can overcome this either by ensuring ε is sufficiently small, or by asking our AI to produce a distribution over plans which more closely matches the normal distribution over reasons that diamonds appear safe.

Sometimes we want to do things for unusual reasons

A more severe problem with arises if we want our AI to do something out of distribution.

Suppose that our explanation for “why the model predicts that the diamond appears safe” corresponds to the following causal graph, where each node is the OR of its parents:

In words:

Either someone may try to steal the diamond or they don’t.
If they try to steal the diamond, they can be stopped by a clever defense.
If no one steals the diamond, then the diamond will be undisturbed on camera, and the pressure plate it is sitting on will also be undisturbed.
It’s also possible for a robber to tamper with the sensors so they show the same result even if the diamond is stolen.

In this picture, there are 3 ways the diamond can appear safe:

No one tries to steal it.
Someone tries to steal it and is stopped.
The sensors are tampered with.

We believe the strategy described so far can likely distinguish reason 3 from reason 2, which is great.

Unfortunately, it may also be possible to distinguish reason 1 from reason 2. So if we train our model on a distribution where there are no robbery attempts, then even if it understands how to stop a robber with a clever defense, it may flag such a defense as an “abnormal reason” for the diamond to remain safe.

Thus our anomaly detection may work too well, and flag a novel creative solution by our AI with the same level of suspicion that it treats sensor tampering.

In increasing order of importance, here are three reasons that this approach is interesting despite this problem:

Even if you prevent your AI from implementing some kinds of novel mechanisms to achieve a goal, you can still get superhuman behavior out of that system, and this still reduces significant risks. Exactly how far you can push the AI depends on how the internal cognition is organized, but in many cases it is possible to implement solutions no human could understand as long as the mechanism for the diamond-on-camera observations remains the same.
In the case of ELK it’s absolutely necessary to generalize OOD. But in the other applications discussed in this post (like deceptive alignment) it may be acceptable to simply flag OOD situations and then apply oversight on those situations.
Most importantly, there are plausible approaches to solving this problem, which we’ll turn to now.

More sophisticated ways of pointing to latent structure

If we only had a single sensor for “is the diamond safe,” and we tried to define “the diamond is safe” as the “normal reason” for that sensor to fire, then it doesn’t seem like the problem is fixable. This strategy for pointing to events in the world fundamentally can’t distinguish between different parts of the causal path.

But in reality there is an incredibly rich set of relationships in the world that we can use to pinpoint “the diamond is safe” and distinguish it from other events up the causal chain. (This corresponds to our desire in ELK to train a single reporter that is able to talk about a rich network of related concepts, rather than only asking about a single reward button.)

The simplest case is where we simply have two sensors, both of which are downstream of “the diamond is safe.” We’ll consider the two sensors (a) the diamond is undisturbed on camera, (b) the pressure plate the diamond is sitting on is undisturbed. Of course in the real world we have a much richer set of relationships to point to a particular piece of latent structure, but I think the essential dynamics appear when we try to exploit even the simplest non-trivial causal model.

Then we can try to use those two sensors to triangulate “the diamond is safe,” by looking at the earliest point where their causal histories overlap.

Consider again our three causal paths:

No one steals the diamond, so it is safe, so it shows up on camera and on the pressure plate.
The AI defends the diamond, so it is safe, so it shows up on camera and on the pressure plate.
A robber tries to hide evidence of a theft, by tampering with the camera and the pressure plate.

All three correspond to different reasons for the diamond to be safe. But explanations 1 and 2 converge at the point where “the diamond is safe,” while explanation 3 does not. So if what we cared about was the least common ancestor of “show up on camera” and “show up on pressure plate”

This suggests a possible approach to flagging situation 3 as potentially-bad without flagging situation 2 as bad:

As before, consider the space of explanations that capture (1−ε) of the effect (“it is often the case that both (a) the diamond appears undisturbed on the camera and (b) the diamond appears undisturbed on the pressure plate”) on the training distribution.
For a new input x, find the explanation in this space that captures the minimal amount of the “(a) and (b)” on x, to get a conservative lower bound for the fraction of the behavior that is due to “good” reasons.
Now we impose an additional constraint on the explanation. In addition to explaining (1−ε) of “(a) and (b)” on the training set, it also needs to explain each of (a) and (b) separately on input x. However, we are still trying to minimize how much of the correlation it explains.

Whether this kind of approach works depends on details of how our explanations and attributions work, and so we’ve mostly put it on the backburner until we get those answers. However it currently looks plausible for the kind of simple explanations like cumulant propagation that we are currently working with. For those settings, we can hope to pick a sub-explanation like:

Include the links “no stealing → diamond safe” and “clever defense → diamond safe.”
Include the links “diamond safe → camera looks good” and “diamond safe → pressure plate looks good,” as well as the link “Var(diamond safe) → Cov(camera looks good, pressure plate looks good.”
Include the links “robber tampers → camera looks good” and “robber tampers → pressure plate looks good,” but not the link “Var(robber tampers) → Cov(camera looks good, pressure plate looks good).”

This explanation fails to explain the correlation between pressure plate and camera on the new input, while explaining each factor individually. But any explanation which explains Cov(camera, pressure plate) on the training set must include Var(diamond safe) → Cov(camera, pressure plate), and any explanation which separately explains camera and pressure plate on the new input must also include “clever defense → diamond safe.”

Mechanistic anomaly detection and ELK was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.

Eliciting latent knowledge

Paul Christiano — Fri, 25 Feb 2022 17:36:12 GMT

In this report, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning (ML) systems:

Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.

But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad.

In these cases, the prediction model “knows” facts (like “the camera was tampered with”) that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?

We’ll call this problem eliciting latent knowledge (ELK). In this report we’ll focus on detecting sensor tampering as a motivating example, but we believe ELK is central to many aspects of alignment.

In the report we will describe ELK and suggest possible approaches to it, while using the discussion to illustrate ARC’s research methodology. More specifically, we will:

Set up a toy scenario in which a prediction model could show us a future that looks good but is actually bad, and explain why ELK could address this problem (more).
Describe a simple baseline training strategy for ELK, step through how we analyze this kind of strategy, and ultimately conclude that the baseline is insufficient (more).
Lay out ARC’s overall research methodology — playing a game between a “builder” who is trying to come up with a good training strategy and a “breaker” who is trying to construct a counterexample where the strategy works poorly (more).
Describe a sequence of strategies for constructing richer datasets and arguments that none of these modifications solve ELK, leading to the counterexample of ontology identification (more).
Identify ontology identification as a crucial sub-problem of ELK and discuss its relationship to the rest of ELK (more).
Describe a sequence of strategies for regularizing models to give honest answers, and arguments that these modifications are still insufficient (more).
Conclude with a discussion of why we are excited about trying to solve ELK in the worst case, including why it seems central to the larger alignment problem and why we’re optimistic about making progress (more).

Much of our current research focused on “ontology identification” as a challenge for ELK. In the last 10 years many researchers have called out similar problems as playing a central role in alignment; our main contributions are to provide a more precise discussion of the problem, possible approaches, and why it appears to be challenging. We discuss related work in more detail in Appendix: related work.

We believe that there are many promising and unexplored approaches to this problem, and there isn’t yet much reason to believe we are stuck or are faced with an insurmountable obstacle. Even some of the simplest approaches have not been thoroughly explored, and seem like they would play a role in a practical attempt at scalable alignment today.

Given that ELK appears to represent a core difficulty for alignment, we are very excited about research that tries to attack it head on; we’re optimistic that within a year we will have made significant progress either towards a solution or towards a clear sense of why the problem is hard. If you’re interested in working with us on ELK or similar problems, get in touch!

Here’s the link. There is some discussion at the alignment forum post about the report, and a prize we offered for solutions. This is a more thorough and polished discussion of the same ideas as Answering questions honestly given world model mismatches, A naive alignment strategy, and My research methodology.

Thanks to María Gutiérrez-Rojas for the illustrations in this piece. Thanks to Buck Shlegeris, Jon Uesato, Carl Shulman, and especially Holden Karnofsky for helpful discussions and comments.

Eliciting latent knowledge was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.

Answering questions honestly given world-model mismatches

Paul Christiano — Sun, 13 Jun 2021 17:54:41 GMT

(Warning: this post is rough and in the weeds. I expect most readers should skip it and wait for a clearer synthesis later. ETA: now available here.)

In a recent post I discussed one reason that a naive alignment strategy might go wrong, by learning to “predict what humans would say” rather than “answer honestly.” In this post I want to describe another problem that feels very similar but may require new ideas to solve.

In brief, I’m interested in the case where:

The simplest way for an AI to answer a question is to first translate from its internal model of the world into the human’s model of the world (so that it can talk about concepts like “tree” that may not exist in its native model of the world).
The simplest way to translate between the AI world-model and the human world-model is to use the AI world-model to generate some observations (e.g. video) and then figure out what states in the human world-model could have generated those observations.
This leads to bad predictions when the observations are misleading.

This is distinct from the failure mode discussed in my recent post — in both cases the AI makes errors because it’s copying “what a human would do,” but in this case we’re worried that “what a human would do” may be simpler than the intended policy of answering questions honestly, even if you didn’t need a predictive model of humans for any other reason. Moreover, I’ll argue below that the algorithm from that post doesn’t appear to handle this case.

I want to stress that this post describes an example of a situation that poses a challenge for existing techniques. I don’t actually think that human cognition works the way described in this post, but I believe it highlights a difficulty that would exist in more realistic settings.

Formal setup

Human world-model

I’ll imagine a human who has a simple world model W = (S, P: Δ(S), Ω, O: S → Ω) where:

S is a space of trajectories, each describing a sequence of events in the world. For example, a trajectory s ∈ S may specify a set of rigid objects and then specify how they move around over time.
P is a probability distribution over trajectories. It includes both a prior over initial states (cars are probably on the road and fish are probably in the ocean) and a dynamics model that tells us how likely a trajectory is under the laws of physics (most trajectories approximately satisfy Newton’s laws).
Ω is a space of observations, for example videos.
O tells you what you would observe for each possible trajectory.

Let Q be the space of natural language questions and A be the space of answers. Natural language has a simple semantics in the human’s world-model, given by a function Answer: Q × Δ(S) → A. For example, we could have Answer(“Is there a cat in the room?”, p) = “there was until recently, but it probably left just now.”

Given some observations ω ∈ Ω, an idealized human answers a question q by performing Bayesian inference and then applying Answer to the resulting probability distribution, i.e. HumanAnswer(q, ω) = Answer(q, P(s|O(s) = ω)).

Of course in practice the human may make errors and will have cognitive limitations. But if we use the kinds of techniques discussed in Teaching ML to Answer Questions Honestly, we could hope to learn something like HumanAnswer instead of the human’s approximation to it.

AI world-model

Let’s assume that our AI models the world in the same way as the human, but with a different world model W′ = (S′, P′, Ω, O′: S′ → Ω).

This mismatch may make it unclear how to answer questions. Perhaps the human thinks of the world in terms of rigid objects (so that it makes sense to talk about how many objects are in a bag), whereas the AI models objects as composed of small parts that can move continuously between separate or parts of a whole. Perhaps the human considers “tree” a fundamental category (so that it makes sense to talk about whether something is a tree) where the AI uses a different taxonomy to think about plants and animals.

Intuitively, the “intended” way for our AI to answer questions is to have a map between its world model and the human model, c: S′ → S. For example, c may group object-parts into objects, or classify plants according to the human taxonomy. That lets it answer questions like “How many trees are in the garden?”

Slightly more formally, we want the AI to do inference in its own world-model, map the result to the human wold model with c, and then apply Answer. That is: IntendedAnswer(q, ω) = Answer(q, c(P′(s′|O′(s′) = ω))).

(You could imagine extending Answer to S′ directly, but I’ll assume that the simplest way to answer arbitrary questions is to first apply c to get an element of S.)

The problem: HumanAnswer

Specifying IntendedAnswer requires specifying the correspondence c, which might be quite a complicated object.

I’m scared that we will instead learn to copy HumanAnswer(q, ω) = Answer(q, P(s|O(s) = ω)), which requires only specifying P and O.

An AI who uses HumanAnswer is effectively using an indirect mapping between its world-model and the human world-model: it uses its own world-model to predict ω and then uses the human’s beliefs to update on and infer a latent state of the human’s world-model ω. I’ll write c* : S′ → Δ(S) for this indirect mapping: c*(s′) = P(s|O(s) = O′(s′)).

Limitations of the human’s world-model

I will assume that the human world-model is rich enough that it makes sense to talk about an intended map c: S′ → S. But I think that c(s′) will generally “throw away” some information from s′.

For example, suppose that the AI models bacteria but the human doesn’t. Then the correspondence c will forget all the information about what bacteria are living where. From the perspective of the human prior, all of the consequences of the bacteria will just look like noise — Alice is “sick” and sicknesses behave in unpredictable ways (where “sick” is ontologically fundamental in the human’s model).

The situation would be even more extreme if the human didn’t even have concepts like “sick” (e.g. because we are talking about novel kinds of tiny machines with unfamiliar impacts in the world) — in this case the fact that Alice is coughing may just be noise, and the human model will do a terrible job of predicting consequences. Realistically the human’s world-model has a lot of flexibility to represent states of affairs like “something is weird with Alice that’s causing her to cough a lot” even when we don’t understand why Alice is coughing. One of the approaches I mention below could try to exploit this extra structure, but for most of the post I’ll ignore it.

Problem 1: is the “intended answer” actually good enough?

Would we actually be happy with an AI that answered questions according to IdealizedAnswer? The correspondence c necessarily throws away much of the information in the AI’s world-model before answering questions, and that information may have been critical to evaluating the AI’s plans.

For a simple but unrealistic example, suppose that the AI’s plan involves constructing tiny machines. After applying the correspondence c to an initial fragment of a trajectory, it looks like humans are in control of the situation. But in reality the tiny machines could easily overpower humans, and computers built out of tiny machines are actually responsible for steering the future.

In this case, if humans try to evaluate “are we in control of the situation?” it seems like they are in trouble. Even worse, if they try to evaluate “are we safe from harm and free to deliberate in the way that we want to deliberate?” they might get the answer completely wrong, because tiny machines are actually manipulating them and determining the outcome of deliberation.

I originally expected this to be a severe problem, but after thinking more carefully I now believe it’s probably OK unless something else (unrelated to ontology mismatches) goes wrong first. This is pretty complicated and I’m certainly not confident in my answer, but I feel good enough that it’s no longer the step I’m most worried about.

Here are some of the key reasons for optimism (continuing the “tiny machines” example as a stand-in for arbitrary features of the world that are thrown away by the correspondence c):

I’m focused on whether humans can implement the meta-strategy described in the strategy-stealing assumption. That is, they want to keep themselves safe and to ensure that they deliberate well, and other than that they want their AI to maximize option value and ultimately respond to their wishes. If they trust their deliberation, then they will eventually learn about the tiny machines (and everything else in the AI’s ontology), and so can defer to their future selves about whether they actually have real control over the situation.
Humans have preferences over deliberation expressed in the human ontology. In order to mess up the deliberation, the tiny machines need to have effects that can be expressed in the human ontology. But this gives us an opportunity to detect the problem. For example, suppose that the tiny machines intervene to slightly change the way that human brains works — -neurological events that we thought were random are instead slightly biased so as to maximize the number of paperclips that the humans ultimately choose to make. If our AI understands that these changes are biased towards paperclips (either because it caused the trouble, or because it understands enough to prevent trouble) then we want to ultimately understand that fact. So we can look at a sequence of apparently random events, observe that they are systematically paperclip-biased, and conclude that they fail to capture what we cared about (to the extent that our confidence in deliberation relied on those events being random). This works even if we don’t understand how the tiny machines bring about those changes.
You may be concerned that our AI knows about the tiny machines but doesn’t know enough detail about what consequences the tiny machines will have — -perhaps it’s only looking a few weeks out, but it would take decades for the humans to deliberate and realize that they are unhappy with the outcome. But if our AI is causing trouble with its tiny machines (or is capable to prevent trouble caused by tiny machines), it must be because it is doing some kind kind of abstract reasoning about the long-term consequences of the machines. So what we really need is an aligned version of that abstract reasoning that lets us answer questions about the long-term impacts of the tiny machines; we still don’t need to talk about alien concepts like the tiny machines and the problem isn’t coming from the ontology mismatch.
You may be concerned that the AI doesn’t really perform that abstract reasoning itself, and instead that reasoning is carried out by e.g. computers built out of tiny machines (and those computers are themselves invisible in the human ontology). But in that case we still have some combination of (i) the AI is doing abstract reasoning about why the tiny-machine-computer will compute actions that have a certain kind of long-term effect, (ii) the AI is explicitly simulating the computation done by the tiny-machine-computer, and then we can just directly translate the knowledge from that computation into a human-legible form. Again, this may pose significant alignment problems but they aren’t actually related to the ontology mismatch.

My tentative view is that — -as long as we are otherwise competitive with our AI — an ontology mismatch isn’t a fundamental problem because events that we can’t understand are only problematic if they have consequences we do understand. (This is a point that many people have brought up over the years when I’ve raised concerns about scenarios like the tiny machines.)

(This is a traditional problem in the futurist AI alignment community, though I’ve ended up with a more optimistic take: ontological crises, model splintering, ontology identification.)

Problem 2: observations can be corrupted

HumanAnswer may be simpler than IntendedAnswer, since HumanAnswer only requires specifying the two world-models (S, P, Ω, O) and (S′, P′, Ω, O′) while IntendedAnswer also requires specifying c.

HumanAnswer infers a distribution over states from the observations ω ∈ Ω. This means that if the observations are corrupted then the answers will be wrong. Corruption may be easy to notice in the AI’s world-model but completely invisible in the human’s world-model. For example, imagine the tiny machines hacking the camera or building a potemkin village.

This is problematic because HumanAnswer may believe that deliberation is going well, when in fact deliberation is completely off the rails but our sensors have been corrupted. Indeed, corrupted sensors may allow for levels of (apparent) security that would be unachievable in the real world, so that optimizing for apparent security may predictably lead to sensor hacking.

Learning HumanAnswer instead of IntendedAnswer closely resembles the problem discussed in Teaching ML to Answer Questions Honestly, so it’s natural to try to apply the same algorithm. Unfortunately, I don’t think that the algorithm will work as is.

What happens when we apply the existing algorithm to this case?

Let’s write W′ = (S′, P′, Ω, O′) for the AI’s world model and W′ — H = (S′, P′ — H, Ω, O′) for a version of the world model with a “human-sized hole” where we can fill in the parameters needed to predict human behavior.

We want to first learn W′ — H, together with the correspondence c: S′ → S and the natural language semantics Answer: Q × Δ(S) → A. Then we can fill in the human H (implicity including their prior P) using a consistency condition.

We’ll use the same consistency condition as in the last post, where we have a small dataset of correct answers and then restrict to human-models H for which IntendedAnswer agrees with the predicted human answer on that dataset.

The next question is whether this model in fact pays for itself. To answer this question we need to precisely define the bad alternative that we are worried about instead.

The main subtlety is that when specifying both the AI model W′ = (S′, P′, Ω, O′) and the human model W = (S, P, Ω, O), we may be able to save bits by conditioning on the fact that W and W′ are supposed to be describing the same world. For example, we could condition on the two models leading to approximately the same probability distribution over observations for situations in the training set. It’s not clear how to define “approximately” or whether there is any way to make a strong consistency update like this, but let’s conservatively assume that they satisfy some strong consistency condition between W′ and W.

Now we have two competing models and we are wondering which is simpler:

[Good] Specify W′−H, c, and Answer. Then fill in H conditioned on agreement between the predicted human answer and IntendedAnswer.
[Bad] Specify W′−H and Answer. Then fill in H and W conditioned on both (i) agreement between the predicted human answer and HumanAnswer, (ii) consistency between W and W′.

These models share most of their complexity. The main difference is that the good model needs to specify c, while the bad difference needs to specify W conditioned on consistency between W and W′.

It’s not easy to compare these complexities without being more precise, e.g. by considering concrete examples of W′, W and c. But abstractly it looks very plausible for the bad model to be simpler.

W′ and the mapping c: S′ → S mostly imply a world model W. So the question is roughly whether the distribution over world-models implied by choosing c is better or worse than the simplicity prior over world-models. But we could imagine the human learning their world model essentially by searching for simple world-models that give good probability distributions in realistic cases, in which case the simplicity prior over world-models is essentially optimal (after conditioning on consistency in simple cases). At the same time, there’s no particular reason why picking c at random seems like it would give rise to a similar prior and so it may just be much worse.

So it looks to me like there’s a real problem here. And at any rate, it seems like a useful class of possibly-hard examples to have in mind when thinking about the hopes for the naive training strategy.

Possible approaches

Hope that c is simple after all

I’m not very confident about my argument that IntendedAnswer is likely to be more complex than HumanAnswer, and even if the argument is technically correct there could be simple approaches to foil the argument.

Maybe specifying the correspondence c is actually a totally fine way to implicitly specify the human’s world model W. For example, maybe we can first specify c, then specify W subject to a consistency condition between W and c(W′), and the complexity of c will pay for itself as a way of constraining W (maybe after imposing some additional simple consistency conditions on c).
Maybe we can specify c after specifying W and W′, and after imposing consistency conditions the intended mapping c won’t be too complex. This feels a bit implausible but like I said the whole argument feels kind of shaky.
Apart from potential “outs,” the argument above just doesn’t feel that solid. Having confirmed that the problem isn’t easily solved, it would probably be good to spend a good amount of time digging into specific examples to confirm that the abstract story is basically correct and that there is a real problem here.

Continuity and commutativity

Both the human world-model W and the AI world-model W′ can model situations where observations are corrupted (though the corruption may look like “random noise” in W even though it has a mechanistic explanation in W′).

Let’s write REAL ∈ S, REAL′ ∈ S′ for a situation where some event is actually happening (there is really a cat in the room) and FAKE ∈ S, FAKE′ ∈ S′ for the situation where my observations have been corrupted (the camera has been hacked to show a video of a cat).

The intended mapping c sends REAL′ → REAL and FAKE′ → FAKE. The indirect mapping c* instead sends both REAL′ and FAKE′ → REAL.

Taken in isolation it may hard to tell which of these mappings is “correct.” But intuitively REAL′ is “near” a bunch of other situations where the cat is in a slightly different place, while FAKE′ is “near” a bunch of other situations where the camera is hacked in different ways, or where only some observations are hacked and so we can see that something weird is happening.

The intended mapping feels continuous, since states near FAKE′ get mapped to states near FAKE. But the indirect map behaves very strangely in this topology: FAKE′ itself gets mapped to REAL, but states near FAKE′ get mapped somewhere very different (for example a nearby state where the hacking is visually noticeable would get mapped somewhere close to FAKE).

Inspired by that intuition, we could hope to learn the intended mapping by imposing some kind of continuity condition.

I’m not sure if “continuity” itself is a natural property, but I think we can do something similar in general by considering a space T of transformations described in natural language (including things like “suppose the camera rotated a little bit to the left”) and considering maps f′: T × S′ → S′ and f: T × S → S. Then we can learn f, f and c subject to constraints similar to c(f′(t, s′)) = f(t, c(s′)).

I find this approach pretty intuitively appealing, in part because talking about counterfactuals is very natural when imagining constraining c. But I haven’t been able to see how to make anything like it work. The basic problem is that it’s very hard to learn any kind of relationship or transformation in the latent space S′ of the AI’s world-model (without just using the indirect map c* to transport relationships from the human world-model).

Continuity may be able to work if we assume that some structural property of S′ corresponds to continuity (e.g. changing a few bits of a trajectory s′ corresponds to a “small” change in the trajectory), but those assumptions seem very brittle.

Make better observations

HumanAnswer only agrees with the training data because it infers a latent state from all of the observations that are used in training. If the process of observation-gathering were sufficiently complex — -either in description complexity or computational complexity — -then IntendedAnswer could end up being simpler than HumanAnswer.

I think that the most realistic way to achieve this is to have AI helpers set up additional observations / do information-gathering to help us make better inferences about the latent state. If these AI helpers are as complex as the agent we are training, and if they are able to use all that complexity to design increasingly ingenious observations, then it’s plausible that the observation process itself will be complex enough that the intended mapping will win (this is the usual picture of recursive reward modeling but not part of my usual picture of amplification — -I’m usually discussing amplification as involving only recursive pen-and-paper thinking to make a single top-level decision about what to do in the real world).

Some thoughts on this approach:

It’s not clear how to delegate observation. Suppose that my AI has a good understanding of tiny machines, and could build tiny machines to collect information (or to construct sophisticated miacroscopic sensors). To delegate this task to my AI I need to be able to ask it questions about the relationship between those sensor readings and the actual state of the world, despite me not understanding the dynamics by which the tiny machines operate. I don’t have a good sense of whether this can be made to work by inductively constructing better and better helpers; it seems possible but far from certain.
It’s not clear if you can make information-gathering scalably complex. For example, it may be that even a very smart AI is still essentially limited to building some number of increasingly tiny cameras, and that the game doesn’t change qualitatively as the helper AI gets smarter. In this case, the intended policy just needs to hard-code some upper bound on how much me and my helper AI’s can observe (which may still be much simpler than the intended mapping). If I imagine this working out I think it comes down to some assumption about offense vs defense for building sensors vs manipulating them, but I’m not sure how to even articulate that assumption and would need to think about it a lot.
No matter how complex you make information-gathering, I suspect there are still simple strategies that abstract out details of the information-gatherers (e.g by treating themselves as a model for the information-gathering assistants and computing the equilibria of the game). So my guess is that this approach should either go all the way, in the sense that there is no way for an attacker to fool the sensors (i.e. defense has an advantage over offense), or it will only get you a modest constant edge for the intended policy.
Aesthetically, it seems quite weird to be in the situation where our model “knows” about what is going on in the world, but we need to construct physical sensors in order to extract that information. I would feel surprised if that was the only solution to the problem.

Represent c in a different way

So far I’ve discussed representing c as a neural network that maps between two spaces, and taking steps to try to get the simplicity prior to yield a good distribution over correspondences. I think that sometimes we need to rely on this kind of opaque mapping, but in this case we may be able to avoid it:

c maps into a “human-comprehensible” space S. So it feels particularly plausible that we could represent c itself in a human-legible way, or as a combination of an opaque neural network and some legible data, and then apply imitative generalization or a similar approach.
Given two different reasonable ways of modeling the world, it feels like there should be a mapping between them. But it’s not like someone just told us “there exists a mapping from S′ → S — -we may get a lot of other structure for free if we think about why we think that the mapping c exists. Just considering the argument more carefully may suggest a different approach. For example, we could try to think about both S and S′ as being downstream of some unobserved “real” world S*, and try to take approaches that go up from S′ → Δ(S*) → Δ(S) rather than going down from S′ → Ω → Δ(S).
The indirect mapping c*: S′ → Δ(S) is defined in terms of the observation functions O, O′ and the human’s prior P. If we had to specify a mapping c : S′ → S before specifying any of {O, O′, P}, then intuitively it seems like the intended mapping may be significantly simpler.

I don’t see how to turn any of these ideas into an algorithm but all of them seem vaguely plausible and worth thinking about.

Don’t implement c

A more radical change is to give up on implementing the mapping c as is and just represent Answer in a different way. For example, if I ask whether Alice “is sick,” but the real state of the world involves a small number of unusual bacteria living in Alice’s lungs that won’t cause sickness, it’s not clear whether c needs to round that to “sick” or “not sick” — -it would be better if our AI were to talk about the kind of nuance that is involved even if it can’t explain the full model. I don’t think this is necessary in order to make the system safe, but the fact that it feels like the “right” behavior still gives me pause about trying to learn the intended mapping and suggests that it may be possible to think of a totally different approach.

This is a bit of a subtle distinction, because our intended mapping never explicitly implements c, it only implements the composition Answer(q, c(·)). The main point is that we could imagine asking the model to do something very different from just answering questions in our own ontology, saying something about the nature of the correspondence (even if we can’t go all the way to the naive imitative generalization solution of making the correspondence itself human-legible).

My current state

Overall this problem feels quite similar to my last post; I suspect that in the end there will be a single set of ideas that handles both problems, but that it may look quite different from what I would have proposed without considering this kind of ontology mismatch.

My next step will be to spend a bit of time thinking about this group of problems, looking for some approach that looks like it could plausibly work. Right now I feel like there are a lot of threads to pull on, but none of them look very easy and some of them could result in very big algorithmic changes.

If I find something that looks plausible I’ll probably return to exploring other related problems. I’m generally prioritizing fleshing out examples because I want to avoid going down a rabbit-hole on an easy problem while the real difficulty is elsewhere, and because I hope that having a larger library of examples will tend to lead to cleaner and more general solutions.

Sometimes it can feel like this cluster of problems are just a restatement of the whole alignment problem — -like I’m just asking the same old questions with a slightly different framing. But on reflection I do feel like this si a healthier questions:

These examples ignore a lot of issues while still leading to catastrophic outcomes, so I think they are in fact isolating a small part of the problem. For example, these examples don’t talk at all about agency, high stakes and the need for reliability, or human preferences. But those are some of the central concepts in typical discussions of alignment, so removing them really does change the discussion.
I think it’s possible to make these cases arbitrarily concrete by filling in more and more details of the human and the AI models. Moreover, I think the problem currently looks soluble without requiring further (vague) assumptions about human reasoning or preferences. I think that’s a really good place to be, and pretty uncommon in alignment.
I think it’s important that we only need to solve Problem 2 (handling corrupt observations) and not Problem 1 (talking about alien concepts). I think this is a lot of what makes the problem concrete + tractable. It also means that we are thinking about a different aspect of this “ontology identification” problem than people usually discuss in AI alignment.

Answering questions honestly given world-model mismatches was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.

A naive alignment strategy and optimism about generalization

Paul Christiano — Thu, 10 Jun 2021 00:01:44 GMT

(Context: my last post was trying to patch a certain naive strategy for AI alignment, but I didn’t articulate clearly what the naive strategy is. I think it’s worth explaining the naive strategy in its own post, even though it’s not a novel idea.)

Suppose that I jointly train an AI to do some task (e.g. make money for me) and to answer a wide range of questions about what is happening in the world (e.g. “why did Alice just wire $1000 into my bank account?” or “what is Bob thinking right now?”).

I generate training data for the QA task in a really simple way: I choose a subset of questions that humans are able to reliably answer, and use those as a training set for supervised learning. I’ll call this the naive training strategy.

I’d like for my AI to tell me everything it knows. If the AI bought a stock because it expects a merger announcement soon, I want it to tell me about the predicted merger announcement. If the AI predicts a merger announcement because it inferred that executives of the companies have been in extensive talks over the last month, I want it to tell me about those talks.

I’m not asking the AI to explain why it made a given decision, I’m asking the AI to tell me as much as it can about the world. The important property is that if the AI “knows” something and uses that knowledge to perform the task well, then it also uses that knowledge to answer questions well.

Why might this work? The hope is that “answer questions honestly to the best of your ability” is a natural thing for our AI to learn — that there is some simple way to translate from the AI’s model of the world into natural language and to honestly report what it believes. If our training dataset is good, then this policy will score well, and we can hope that SGD will find it. I’ll call this the intended policy.

Why might this not work? The concern is that “predict how a human would answer questions” is also a natural thing for our AI to learn, especially if the AI is doing a task that already requires predicting humans. Predicting humans also gets a low loss on the training set, but it generalizes poorly once we start asking our AI questions that a human couldn’t have answered on their own.

If we’re worried about this we could use a different loss function than “predict what humans would say.” But regardless of what loss function we choose, the policy could still be trying to game the loss function. If SGD learns to “game the loss function” then we’ll generalize poorly in any case where humans can’t distinguish a good question-answerer from a bad question-answerer.

I’ll call “game the loss function” the instrumental policy. It answers questions well only because doing so is instrumentally useful for getting a low loss, and so it will start answering questions badly if that situation changes. (This is closely related to deceptive alignment, but I think the concept makes sense without defining alignment or optimizers; there’s also a difference in emphasis in that I’m often considering models which are “intrinsically motivated” to game the loss function rather than doing so in order to gain influence. I first introduced this term in Inaccessible Information.)

What I’m doing now

I’m trying to dig into a bunch of reasons why the naive training strategy might fail, and to understand whether there is a way to modify the naive strategy to avoid those problems.

In my last post I discussed two reasons that the naive training strategy might learn the instrumental policy instead of the intended policy:

There may be predictable errors in the training data, such that instrumental policy actually gets a lower loss than answering honestly (because it responds strategically to errors).
If the AI already needs to make predictions about humans, then it can reuse that machinery to game the loss function, whereas it may have to build new machinery to translate its thoughts into natural language.

I tried to describe a modified algorithm that avoids those pitfalls. I don’t think I succeeded, but I do feel reasonably optimistic that some approach can address these two problems.

Unfortunately, there are many further reasons that the naive training strategy could fail. I’m currently spending time trying to understand those issues and figure out which if any are the most likely to represent fundamental roadblocks.

Where others stand

I’ve had a lot of conversations about alignment with ML researchers over the last 7 years. My impression is that a large fraction of optimists expect to find some strategy to get smart enough models to generalize in the intended way — honestly reporting their beliefs — rather than by learning to game the loss function.

The fact that this is part of the “consensus” optimism about alignment makes it particularly appealing to investigate cases where it seems hard to get the intended generalization.

On the other extreme, I think many alignment researchers are very skeptical about this entire project. I think the main way I disagree with them is methodological: before concluding that the problem is hard I want to try to find the simplest hard case.

Relationship to my other work

I’ve traditionally avoided the naive training strategy, in large part because I’m scared that it will learn the instrumental policy instead of the intended policy.

I still believe that you need to do something like iterated amplification and imitative generalization in order to avoid this problem. However, I think that a working strategy that combines all of these ideas may have more in common with the naive training strategy than I’d initially expected.

For example, I now think that the representations of “what the model knows” in imitative generalization will sometimes need to use neural networks to translate between what the model is thinking and human language. Once you go down that road, you encounter many of the difficulties of the naive training strategy. This is an update in my view; I’ll likely go into more detail in a future post.

A naive alignment strategy and optimism about generalization was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.

Teaching ML to answer questions honestly instead of predicting human answers

Paul Christiano — Fri, 28 May 2021 17:18:50 GMT

In this post I consider the problem of models learning “predict how a human would answer questions” instead of “answer questions honestly.” (A special case of the problem from Inaccessible Information.)

I describe a possible three-step approach for learning to answer questions honestly instead:

Change the learning process so that it does not have a strong inductive bias towards “predict human answers,” by allowing the complexity of the honest question-answering to “pay for itself” by constraining the space of possible human-models.
Introduce a bias towards the intended model by using a more complex labeling process to answer questions where a human answers incorrectly.
Be really careful to avoid penalizing honest answers, by only judging comparisons between two answers where we are confident one is better than the other and getting the model to help us.

I don’t know whether this problem is a relatively unimportant special case of alignment, or one of the core difficulties. In any case, my next step will be trying to generate failure stories that definitely cannot be addressed by any of the angles of attack I know so far (including the ones in this post).

I think it’s relatively unlikely that almost anything specific I said here will really hold up over the long term, but I do think I’ve learned something about each of these steps. If the ideas end up being important then you can expect a future post with a simpler algorithm, more confidence that it works, clearer definitions, and working code.

(Thanks to Ajeya Cotra, David Krueger, and Mark Xu for discussions about this post that helped clarify it.)

The problem

Suppose that we train a model to answer questions in natural language about what will happen in the future (“Will Alice take the train home tonight?”) given a data about what’s happened in the recent past (e.g. video recorded over the last few days).

To generate training data we take the video from time t₀ to t₁ and get humans to answer questions about what happened between time t₁ and t₂. Then we do supervised learning on these (video, question, answer) triples.

The intended model has two parts: (i) a model of the world (and inference algorithm), (ii) a translation between the world-model and natural language. The intended model answers questions by translating them into the internal world-model.

We want the intended model because we think it will generalize “well.” For example, if the world model is good enough to correctly predict that someone blackmails Alice tomorrow, then we hope that the intended model will tell us about the blackmail when we ask (or at least carry on a dialog from which we can make a reasonable judgment about whether Alice is being blackmailed, in cases where there is conceptual ambiguity about terms like “blackmail”).

We want to avoid models that generalize “badly,” e.g. where the model “knows” that Alice is being blackmailed yet answers questions in a way that conceals the blackmail.

One source of generalization failures is the instrumental model, which involves: (i) a world-model similar to the intended one, (ii) knowledge about the training-data-generating process, within that world model. The instrumental model answers questions by predicting what answers the training-data-generating process would produce. This can generalize badly to any situation where the humans generating the training data are wrong (e.g. because they are being deceived or manipulated).

Why might a neural net end up learning the instrumental model? There are two big reasons:

The instrumental model may be much simpler than the intended model. Both of them contain the same world model. If that world model is good enough, then it also knows how to make predictions about humans and understands a lot about the training process. From there, the instrumental policy just needs to point to a relatively simple part of the model: what gets inserted into the training set? In contrast, the intended policy needs to bake in a ton of rules for translating between the internal model and natural language. Doing this naively requires encoding all of the meaning of natural language, which has a lot of incompressible data. It’s not clear if there is a more economical way to represent the intended model (without recreating the pathologies of the instrumental model).
The instrumental model is guaranteed to accurately fit the training distribution, while the intended model might disagree if the humans make a mistake. If we keep collecting training data indefinitely, sometimes make mistakes, and don’t do something clever to fix or ignore them, then we will end up converging to the instrumental model no matter how good our inductive bias is.

Aside on imitative generalization

Imitative generalization suggests a possible strategy for overcoming problem 1: learn a neural net which is only a world-model, a then use a human prior over “translations” that relate the world model to human concepts (instead of training a neural net that also does the translating, and so implicitly using the neural net prior over translations).

In this post I’m going to mostly ignore that idea and think about neural nets that implement both world-model and translation, allowing them to be integrated tightly together. That means we need a different solution to problem 1. I still think imitative generalization is part of the solution, but in the context of this particular problem I think we just need to improve the prior without changing the representation. So throughout this post you should completely ignore imitative generalization and read me as taking a more traditional perspective.

Step 1: make the intended model pay for itself

(I think this is the most confusing/ambitious/uncertain step.)

Intuition

The intended model is complex because it needs to specify how to use natural language to talk about the world model.

The intended model loses to the instrumental model because in some sense this information was redundant — there are already humans in the world model, and knowing enough about humans to predict their behavior means the world model already knows how to ground out natural language.

Put differently, the intended model redundantly specifies (humans) and (language). Whereas the instrumental model specifies (humans) and infers (language) from (humans). This is bad because we get out a predictive model of humans, who generalize poorly.

What I want to do instead is specify (language) and then infer much of (humans) from (language). The hope is that we then get a simple model of language use, which generalizes better.

My hope in step 1 is to remove the inductive bias in favor of the instrumental model; we’re not going to introduce any kind of positive bias towards the intended model.

Rough plan

I want to consider models that learn to predict both “how a human will answer question Q” (the instrumental model) and “the real answer to question Q” (the intended model). These two models share almost all of their computation — which is dedicated to figuring out what actually happens in the world. They differ only when it comes time to actually extract the answer. I’ll describe the resulting model as having a “world model,” an “instrumental head,” and an “intended head.”

We start by collecting a high-quality dataset, in which we believe that the intended and instrumental models agree.

We’re then going to imagine filling in the parameters of our learned model in two stages:

In the first stage we will learn both the intended and instrumental heads, as well as a small part of the world model (but none of the facts about how humans answer questions).
In the second stage we will fill in the remainder of the world model (including all of the facts about how humans answer questions). In this stage, we condition on the instrumental head and the intended head agreeing on the training set.

You can think of this as a generative model where we first fill in stage 1 randomly, then we fill in stage 2 randomly but do it over and over again until we happen to get a set of parameters for which the two heads agree. Then the learning process tries to infer the parameters that were most likely to give rise to the training data.

A simplicity prior could potentially have learned this two-stage model directly. Namely, it could specify the bits of the first stage, and then it could use the “heads agree” assumption to compress the second stage. The problem is that the decompression is very expensive (it involves solving a hard optimization problem in stage 2, which is now done at test time), and so this involves learning a computationally complex model. That’s ineffective at test time, and it also means that the learning process will be slow.

(There might be some way that a simplicity prior could have achieved this compression even with test-time compute constraints, but I can’t see a way in general and so I wouldn’t want to bank on it.)

This two-stage learning algorithm fixes the test-time slowness, but learning is still computationally expensive. But I think it’s good to move the hardness out of the learned model and into the learning algorithm — that lets us treat the whole thing as an algorithmic problem, of finding an efficient way to compute the posterior for this kind of generative model. I’m quite optimistic about this problem, which I’ll discuss in a few sections (where we’ll also define the algorithm more precisely in the context of SGD over neural nets rather than random program search). But first, let’s talk about why this whole thing might work if you could actually implement it.

Why might this work?

For concreteness, let’s say that the world model requires a trillion (“N”) bits to specify, the intended head costs 10,000 bits, and the instrumental head costs 1,000 bits. If we just applied a simplicity prior directly, we expect to spend N + 1,000 bits to learn the instrumental model rather than N + 10,000 bits to learn the intended model. That’s what we want to avoid.

Under the alternative proposal, suppose we spend N/2 + 11,000 bits in the first stage to specify half of the world model and both heads. If we didn’t do rejection sampling in the second stage we’d need to spend N/2 more bits, for a total cost of N + 11,000 bits. That’s even worse.

However, if our dataset was rich and high-quality then we are able to save many bits in the second stage. If the dataset is high-quality then the intended model results in the two heads agreeing. But if the dataset is rich, then virtually all models result in the two heads disagreeing.

The magnitude of savings in the second stage is basically equal to the (log) fraction of possible world models that cause the instrumental head to agree with the intended head.

How common is agreement? If the instrumental head is basically reporting human answers to questions, and if the human model is basically unspecified in the first stage, then this is the fraction of human models that use natural language “correctly” on the training set. If the training set is rich enough, this is quite close to the fraction of human models that use natural language “correctly” overall.

Now we have two priors over ways to use natural language: we can either sample the intended head at random from the simplicity prior (which we’ve said has probability 2^{-10,000} of giving correct usage), or we can sample the environment dynamics from the simplicity prior and then see how humans answer questions. If those two are equally good priors, then only 2^{-10,000} of the possible humans would have correct usage, so conditioning on agreement saves us 10,000 bits.

So now we spend (N/2 + 11,000) + (N/2 − 10,000) bits altogether, for a total of N + 1,000. This is comparable to the instrumental model. There may be other ways to learn even more efficiently, but prima facie this suggests that our two-headed model is roughly as easy to learn as the instrumental model.

We’ve learned a model with two (symmetrical) heads. On the training set they agree, but in order to generalize well it’s important to us that we use the “right answer” rather than the “predict what a human would say.” In section 2 we’ll describe how to provide a small boost to the intended model — this both breaks the symmetry between the two heads, and allows the two-headed model to be actively preferred to the instrumental model.

This argument doesn’t work if random world models actually induce a better prior over language use than the simplicity prior. (As a silly example: consider the limit where we follow all of physics from the big bang and wait for the evolution of language + language-users to take over the world. Then the prior from a random world model seems likely to be much better.) I think those cases are totally fine when we combine this idea with iterated amplification, but I don’t want to get into that complexity here since I think it’s mostly orthogonal to the novel algorithm.

The actual algorithm

We’re trying to learn a neural network that takes some data (e.g. video) as input and has two question-answering heads. We’ll parameterize these networks by a vector of real numbers θ. We can think of SGD as doing regularized regression where the regularizer is the L2 distance from the initialization θ₀ (the only difference is that SGD linearizes the loss). Write C(θ) for the average difference between the two heads on the training set, such that we’ll want ot condition on C being small.

Then we can describe the kind of two-stage learning process as the following generative model:

Where λ is some large parameter (that needs to be finite only to keep the optimization smooth so that we can use SGD).

Basically this says that you draw θ₁ from the prior, then you draw θ₂ from the prior after conditioning on the two heads agreeing.

Write L(θ) for the log probability of the answers using the first head given parameters θ. Then the unnormalized posterior log probability of (θ₁, θ₂) is:

Intuitively: the first term L is fitting the data, the next two terms are the prior, and the fourth term C is expressing the agreement condition. The final term with the integral represents the bits saved by rejecting all the inconsistent models — if there are very few consistent models then the integral will be small and many bits will be saved.

We’re interested in either finding the parameters with maximum posterior probability (MAP) or sampling from the posterior. The first four terms are straightforward enough to handle, but naively the last term requires two nested loops of optimization — for each value of θ₁ that we consider, we need to do an expensive optimization to figure out the value of the integral.

I’ll write a longer post about this algorithmic issue in the future, but I’m actually fairly optimistic that we can do inference despite the nested integrals. The basic reason to have hope is that the optimal values of θ₂′ change slowly as a function of θ₁, so we can do alternative gradient steps instead of running SGD again to compute each update to θ₁.

Some remarks:

This two-stage optimization process is basically the same as the GAN situation, where you also use the assumption that the optimal discriminator changes slowly in order to avoid nested optimization. I’m optimistic that optimization in this case can be made significantly easier/stabler and can be done while remaining competitive with traditional SGD. I think we can almost optimize this quantity by updating θ₂ by SGD as normal and updating θ₁ at each step to minimize the difference C(new θ₂) − C(old θ₂).
Most of the difficulty of the integral comes from the need to integrate over the entire space rather than just a neighborhood of the current parameters. But what we actually care about is explanatory power along the trajectory of SGD—we can view this as compressing the model by compressing the individual gradient descent steps, rather than trying to compress the full model. This makes me significantly more confident that there is a fast/simple way to do what we want to do, and I think it is going to lead to much more robust algorithms. Overall I don’t feel too concerned about the algorithmic difficulty posed by this setup.
When differentiating the consistency test C we should treat the intended head as fixed rather than differentiating through it. This removes SGD’s incentive to achieve consistency by e.g. making sure the world is simple and so all questions have simple answers. I currently feel very uneasy about the situation. The algorithm looks a lot more likely to work when we make this change, but it makes it harder to reason about the algorithm globally and I’m very scared about the fact that we can’t do an analogous trick in the program search case. I think the best hope may be to take the gradient-compression perspective suggested in the last bullet point. Overall this may amount to a significant change in this algorithm.
There seems to be a hard limit on the bits we can “save” — we could always make the two heads equal by simply changing the instrumental head to be equal to the intended head (or vice versa). Even if we tried to rule this out by preventing θ₂ from messing with the intended head, I think that SGD could probably route around this (e.g. by physically implementing the intended head within the world model). So I think we should at best expect to recoup all the bits we spent specifying the intended head. I don’t believe this algorithm can introduce an inductive bias in favor of the intended model, it can just level the paying field.

Step 2: Give the intended model a leg up

At best, step 1 merely removes one inductive bias in favor of the instrumental model. But we are still left with a massively underspecified learning problem; we could end up with many different functions that all agree on the training set and have approximately equal prior probability. In order to learn the intended model, we need to somehow specify which one we really wanted. In this section I’ll describe a possible approach.

Rather than using a single labeling process, we will have two labeling processes, a “simple” one and a more reliable “complex” one. The idea is to use the “complex” labeling process to identify examples where the “simple” process errs. This hopefully rules out the “simple” labeling process as a good model of the data, leaving us with the intended model as the best one.

More precisely:

In the “simple” labeling process we simply ask a human (embedded in the world) to answer questions about what is happening and put it into the dataset.
In the “complex” data-generating process we record a bunch of data about the world (camera footage, answers from many different humans, random traces from computer programs…) and then we make our best effort to figure out the right answer. This may involve consulting ML systems for help (as in iterated amplification). We can also use this process to collect hypothetical data, by using a generative model to make predictions and then applying our best efforts to the hypothetical.
We have a “small” dataset where we believe the simple labeling process arrives at the correct answer.
We have a “large” dataset that includes the small dataset as well as at least a few cases (say 10% of the size of the small dataset) where the simple labeling process makes a mistake that is identified by the complex labeling process. This may include some hypothetical data, e.g. showcasing situations in which the simple labeling process could make an error.
We use the “small” dataset in order to define the consistency predicate C needed in step 1 (i.e. we condition on the two heads agreeing on the small dataset). But we use the “large” dataset in order to define the likelihood L (i.e. we condition on the first head correctly predicting the large dataset).

Why think this would work? Imagine that there are now three possible models — the intended model, the simple model (that predicts the simple labeling process), and the complex model (that predicts the complex labeling process). Suppose that these models satisfy the following properties:

All models agree on the small dataset. The complex model and the intended model agree on the large dataset. There are inputs in the large dataset where the simple model disagrees with the other two.
The simple model is simpler (i.e. higher prior under SGD) than the complex model. There are a few reasons this could happen (and we just need one), but I am most optimistic about achieving this when the complex labeling process itself uses ML systems as assistants or to generate hypothetical situations. It’s not easy for the ML system to represent or simulate another model as large as itself, and so it seems like it needs to do something more tricky (e.g. figuring out the equilibria of a game played by multiple peer models). I don’t think that it necessarily takes a huge number of bits to do this well, but I think it may well be more than the simple model. (My biggest concern is that the “simple” model may also need to use similarly-complex tricks to be computationally efficient.)
(For concreteness, I’ll say that the simple model takes 1,000 bits, the complex model takes 2,000 bits, and the intended model takes 10,000 bits.)
The small dataset is still rich enough that you could infer correct language usage from it, i.e. the consistency condition on the small dataset alone suffices to recover all 10,000 bits required to specify the intended model.
Agreement between the complex and simple heads on the small dataset isn’t nearly that surprising. Under typical conditions the complex labeling process will basically just defer to the simple labeling process; knowing what counts as an “error” tells you something about humans but it’s not very much relative to the complexity of natural language. (This may also be a dicey assumption.)

In the two-headed model, each of the heads could independently learn any of the (simple, complex, or intended) heads. This yields 9 models; which of them could we end up learning?

(intended, simple) is what we’re hoping for. It costs 11,000 bits to specify both heads, and we save 10,000 bits. So the total cost is N+1,000.
(intended, complex) or (intended, intended) would also be fine to learn, but cost N+2,000 or N+10,000 bits.
(simple, X) doesn’t match the data well, since there are datapoints in the large dataset where the simple labeling process goes wrong.
(complex, simple) costs at least 2,000 bits to specify. This would be a bad model to learn, but we will avoid it as long as agreement between the complex and simple models has probability at least 2^{-1,000} over a random world model. Intuitively it’s plausible that this would be much more likely than the simple model agreeing with some very specific ways of using language encoded in the intended model.
(complex, complex) costs 2,000 bits to specify the heads, for a total cost of N+2,000.
(complex, intended) costs 12,000 bits and saves 10,000 for a total cost of N+2,000. (This calculation technically rests on a further claim about the complex head being harder to specify than the simple head even conditioned on the intended head.)

So under this somewhat-heroic set of assumptions, it seems like the (intended, simple) pair would have the best prior probability amongst the models that actually fit the data. Pairs other than those 9 may achieve better losses still, and several of those assumptions could easily fail, but I think this approach is still promising and gives some traction on a tricky problem.

I think this approach basically executes on a common intuition — in order to avoid generalizing poorly because the question-answerer gets blackmailed, just add some training points involving blackmail (and make extra sure to label them correctly, e.g. by generating them hypothetically). But other implementations of this idea seem to run into the problem that the model will just learn whatever process you actually use to generate training data, and you can’t generate datapoints where that process goes wrong.

Step 3: Make the training set good enough

Steps 1 and 2 relied critically on the assumption that the training set always agrees with the intended model. If they disagree, then it doesn’t matter whether the intended model has a good prior probability, you are still going to learn the instrumental model because it has better likelihood. Even a small advantage for the instrumental model will eventually dominate in a large enough dataset.

Perfect agreement seems impossible. The question becomes: can we get close enough, and how can we cope with the errors that remain?

Rather than aiming at “perfect” labels, or trying to be robust to general labeling errors, my current hope is to try to generate labels that aren’t wrong in any way that the model can identify.

For example, suppose that a human is answering a question about how many chairs are in a room. They look at the room briefly, and fail to notice an occluded chair. Moreover, even when humans look more carefully at a video later (the “complex” labeling process), they also don’t see the occluded chair and so believe (incorrectly) that the original label was correct.

We are only concerned about the case where this bad label leads to a bad gradient update, and especially where it pushes us from the intended model towards the instrumental model. We aren’t concerned about cases where our model has no idea about the chair either, or where our model has no chance of correctly reporting the presence of the chair. I think that gives us an important leg up in avoiding the problem, if we are able to use the existing model to help make sure our data is good.

I don’t have a clean proposal for making the dataset good enough, but I do have various thoughts that leave me feeling more optimistic:

I don’t think we should use supervised learning. Instead, we should get pairs of answers from the model and then only add them to the dataset when we are confident that one of them is better than the other. So we are only in trouble if the intended model actually generated a good answer, but we incorrectly said we were confident that it was worse. (Note that we may still need to use likelihood for pre-training or joint training for competitiveness reasons, and we’ll need to adapt the whole scheme to that case.)
Moreover, we are only worried if the model the dynamics of the data-generating process and why it made an error, since otherwise the instrumental model won’t actually perform well either. If our model answers a rich enough set of questions, we can effectively interact with it to extract an explanation for why we erred. We just need to be able to recognize this kind of story-of-error as plausible rather than confidently dismissing them. Tons of details would be needed to make anything like this work, but overall it seems promising.
We might be concerned that the model will instead learn to make up complex stories about how we are wrong. And indeed this training process won’t actively penalize such models as long as they are able to make the stories plausible enough. But such models also aren’t favored, and if steps 1 and steps 2 are successful then the inductive bias will favor the intended model over this kind of instrumental model (assuming that the dataset is rich enough, even after filtering for confidence, that you can’t save a ton of description complexity by this kind of bullshitting).
We can take simple precautions — like “don’t collect training data in situations without strong adversaries are trying to mislead us.” Many approaches to AI make this a bit of a circular dependency — our AI predicts well, and so can help protect us from manipulation, only as long as the AI system predicts that no manipulation would occur in the future. But in this setting we have no such dependence — in order to have the model work correctly at time t we just have to make sure we didn’t train on corrupted data at times prior to t.
Reducing errors seems to require doing reliability amplification (though maybe just ensembling is sufficient) and security amplification (or some other approach to being secure whenever the intended model is). Those are more like hopes than algorithms; we’d need to do a lot of work, and think a lot about concrete examples of possible errors, to find something that might actually work.
These bullets are all gesturing at one approach to this problem, but I think there are a ton of ways you could perform machine learning with “potentially wrong” data to prevent a small number of errors from causing trouble. This feels closer to a traditional problem in AI. I haven’t thought about this problem much because I’ve been more focused on the fear that we wouldn’t learn even with perfect data, but I feel relatively optimistic that there are a lot of approaches to take to dataset errors if that’s actually the crux of the problem.

Teaching ML to answer questions honestly instead of predicting human answers was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.