<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
    <title>cfallin.org</title>
    <link rel="self" type="application/atom+xml" href="https://cfallin.org/feed.xml"/>
    <link rel="alternate" type="text/html" href="https:&#x2F;&#x2F;cfallin.org"/>
    <generator uri="https://www.getzola.org/">Zola</generator>
    <updated>2026-04-10T08:54:50.104619253-07:00</updated>
    <id>https://cfallin.org/feed.xml</id>
    <entry xml:lang="en">
        <title>The acyclic e-graph: Cranelift&#x27;s mid-end optimizer</title>
        <published>2026-04-09T00:00:00+00:00</published>
        <updated>2026-04-09T00:00:00+00:00</updated>
        <author>
            <name></name>
        </author>
        <link rel="alternate" type="text/html" href="https://cfallin.org/blog/2026/04/09/aegraph/"/>
        <id>https://cfallin.org/blog/2026/04/09/aegraph/</id>
        <content type="html" xml:base="https://cfallin.org/blog/2026/04/09/aegraph/">&lt;p&gt;Today, I&#x27;ll be writing about the &lt;em&gt;aegraph&lt;&#x2F;em&gt;, or &lt;em&gt;acyclic egraph&lt;&#x2F;em&gt;, the
data structure at the heart of Cranelift&#x27;s mid-end optimizer. I
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;rfcs&#x2F;blob&#x2F;main&#x2F;accepted&#x2F;cranelift-egraph.md&quot;&gt;introduced this approach in
2022&lt;&#x2F;a&gt;
and, after a somewhat circuitous path involving one full rewrite, a
number of interesting realizations and &quot;patches&quot; to the initial idea,
various discussions with the wider e-graph community (including a
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;vimeo.com&#x2F;843540328&quot;&gt;talk&lt;&#x2F;a&gt;
(&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;cfallin.org&#x2F;pubs&#x2F;egraphs2023_aegraphs_slides.pdf&quot;&gt;slides&lt;&#x2F;a&gt;)
at the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;pldi23.sigplan.org&#x2F;home&#x2F;egraphs-2023&quot;&gt;EGRAPHS workshop at PLDI
2023&lt;&#x2F;a&gt; and a recent talk
and discussions at the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.dagstuhl.de&#x2F;en&#x2F;seminars&#x2F;seminar-calendar&#x2F;seminar-details&#x2F;26022&quot;&gt;e-graphs Dagstuhl
seminar&lt;&#x2F;a&gt;),
and a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;commits&#x2F;main&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;opts&quot;&gt;whole bunch of contributed rewrite rules over the past three
years&lt;&#x2F;a&gt;,
it is time that I describe the &lt;em&gt;why&lt;&#x2F;em&gt; (why an e-graph? what benefits
does it bring?), the &lt;em&gt;how&lt;&#x2F;em&gt; (how did we escape the pitfalls of full
equality saturation? how did we make this efficient enough to
productionize in Cranelift?), and the &lt;em&gt;how much&lt;&#x2F;em&gt; (does it help? how
can we evaluate it against alternatives?).&lt;&#x2F;p&gt;
&lt;p&gt;For those who are already familiar with Cranelift&#x27;s mid-end and its
aegraph, note that I&#x27;m taking a slightly different approach in this
post. I&#x27;ve come to the viewpoint that the &quot;sea-of-nodes&quot; aspect of our
aegraph, and the translation passes we&#x27;ve designed to translate into
and out of it (with optimizations fused in along the way), are
actually more fundamental than the &quot;multi-representation&quot; part of the
aegraph, or in other words, the &quot;equivalence class&quot; part itself. I&#x27;m
choosing to introduce the ideas from sea-of-nodes-first in this post,
so we will see a &quot;trivial eclass of one enode&quot; version of the aegraph
first (no union nodes), then motivate unions later. In actuality, when
I was experimenting then building this functionality in Cranelift in
2022, the desire to integrate e-graphs came first, and aegraphs were
created to make them practical; the pedagogy and design taxonomy have
only become clear to me over time. With that, let&#x27;s jump in!&lt;&#x2F;p&gt;
&lt;h2 id=&quot;initial-context-fixpoint-loops-and-the-pass-ordering-problem&quot;&gt;Initial context: Fixpoint Loops and the Pass-Ordering Problem&lt;&#x2F;h2&gt;
&lt;p&gt;Around May of 2022, I had introduced a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;4163&quot;&gt;simple alias analysis and
related
optimizations&lt;&#x2F;a&gt;
(removing redundant loads, and doing store-to-load forwarding). It
worked fine on all of the expected test cases, and we saw real speedup
on a few benchmarks (e.g. 5% on &lt;code&gt;meshoptimizer&lt;&#x2F;code&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;4163#issuecomment-1130829170&quot;&gt;here&lt;&#x2F;a&gt;)
but led to a new question as well: how should we integrate this pass
with our other optimization passes, which at the time included GVN
(global value numbering), LICM (loop-invariant code motion), constant
propagation and some algebraic rewrites?&lt;&#x2F;p&gt;
&lt;p&gt;To see why this is an interesting question, consider how GVN, which
canonicalizes values, and redundant load elimination interact, on the
following IR snippet:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;v2 = load.i64 v0+8&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;v3 = iadd v2, v1   ;; e.g., array indexing&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;v4 = load.i8 v3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;;; ... (no stores or other side effects here) ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;v10 = load.i64 v0+8&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;v11 = iadd v10, v1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;v12 = load.i8 v11&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Redundant load elimination (RLE) will be able to see that the load
defining &lt;code&gt;v10&lt;&#x2F;code&gt; can be removed, and &lt;code&gt;v10&lt;&#x2F;code&gt; can be made an alias of
&lt;code&gt;v2&lt;&#x2F;code&gt;, in a single pass. In a perfect world, we should then be able to
see that &lt;code&gt;v11&lt;&#x2F;code&gt; becomes the same as &lt;code&gt;v3&lt;&#x2F;code&gt; by means of GVN&#x27;s
canonicalization, and subsequently, &lt;code&gt;v12&lt;&#x2F;code&gt; becomes an alias of
&lt;code&gt;v4&lt;&#x2F;code&gt;. But those last two steps imply a tight cooperation between two
different optimization passes: we need to run one full pass of RLE
(result: &lt;code&gt;v10&lt;&#x2F;code&gt; rewritten), then one full pass of GVN (result: &lt;code&gt;v11&lt;&#x2F;code&gt;
rewritten), then one additional full pass of RLE (result: &lt;code&gt;v12&lt;&#x2F;code&gt;
rewritten). One can see that an arbitrarily long chain of such
reasoning steps, bouncing through different passes, might require an
arbitrarily long sequence of pass invocations to fully simplify. Not
good!&lt;&#x2F;p&gt;
&lt;p&gt;This is known as the &lt;em&gt;pass-ordering problem&lt;&#x2F;em&gt; in the study of compilers
and is a classical heuristic question with no easy answers as long as
the passes remain separate, coarse-grained algorithms (i.e., not
interwoven). To permit some interesting cases to work in the initial
Cranelift integration of alias analysis-based rewrites, I made a
somewhat &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;4163&#x2F;changes#diff-c7a66b91ac03843c5aafe984938022ccba235c80c3fad786772964dc7b9da152R166-R170&quot;&gt;ad-hoc
choice&lt;&#x2F;a&gt;
to invoke GVN once after the alias-analysis rewrite pass.&lt;&#x2F;p&gt;
&lt;p&gt;But this is clearly arbitrary, wastes compilation effort in the common
case, and we should be able to do better. In general, the solution
should reason about all passes&#x27; possible rewrites in a unified
framework, and interleave them in a fine-grained way: so, for example,
if we can apply RLE then GVN five times in a row just for one
localized expression, we should be able to do that, without running
each of these passes on the whole function body. In other words, we
want a &quot;single fixpoint loop&quot; that iterates until optimization is done
at a fine granularity.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;three-building-blocks-rewrites-code-motion-and-canonicalization&quot;&gt;Three Building Blocks: Rewrites, Code Motion, and Canonicalization&lt;&#x2F;h2&gt;
&lt;p&gt;Let&#x27;s review the optimizations we had at this point:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;GVN (global value numbering), which is a &lt;em&gt;canonicalization&lt;&#x2F;em&gt;
operation: within a given scope where a value is defined (for SSA
IRs, the subtree of the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Dominator_(graph_theory)&quot;&gt;dominance
tree&lt;&#x2F;a&gt; below
a given definition), any identical computations of that value should
be canonicalized to the original one.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;LICM (loop-invariant code motion), which is a &lt;em&gt;code-motion&lt;&#x2F;em&gt;
operation: computations that occur within a loop, but whose value is
guaranteed to be the same on each iteration, should be moved
out. Loop invariance can be defined recursively: values already
outside the loop, or pure operators inside the loop whose arguments
are all loop-invariant. The transform doesn&#x27;t change any operators,
it only moves where they occur.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Constant propagation (cprop) and algebraic rewrites: these are
transforms like rewriting &lt;code&gt;1 + 2&lt;&#x2F;code&gt; to &lt;code&gt;3&lt;&#x2F;code&gt; (cprop) or &lt;code&gt;x + 0&lt;&#x2F;code&gt; to &lt;code&gt;x&lt;&#x2F;code&gt;
(algebraic). They can all be expressed as substitutions for
expressions that match a given pattern.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Redundant load elimination and store-to-load forwarding: these both
replace &lt;code&gt;load&lt;&#x2F;code&gt; operators with the SSA value that operator is known
to load.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;And one that we wanted to implement: &lt;em&gt;rematerialization&lt;&#x2F;em&gt;, which
reduces register pressure for values that are easier to recompute on
demand (e.g., integer constants) by re-defining them with a new
computation. This can be seen as a kind of code motion as well.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;As a start to thinking about frameworks, we can categorize the above
into &lt;em&gt;code motion&lt;&#x2F;em&gt;, &lt;em&gt;canonicalization&lt;&#x2F;em&gt;, and &lt;em&gt;rewrites&lt;&#x2F;em&gt;. Code motion is
what it sounds like: it involves moving where a computation occurs,
but not changing it otherwise. Canonicalization is the unifying of
more than one instance of a computation into one (&quot;canonical&quot;)
instance. And rewrites are any optimization that replaces one
expression with another that should compute the same value. Said more
intuitively (and colloquially), these three categories attempt to
cover the whole space of possibilities for &quot;simple&quot; optimizations: one
can move code, merge identical code, or replace code with equivalent
code. (The notable missing possibility here is the ability to change
control flow and&#x2F;or make use of control-flow-related reasoning; more
on that in a later section.) Thus, if we can build a framework that
handles these kinds of transforms, we should have a good
infrastructure for the next steps in Cranelift&#x27;s evolution.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;ir-design-sea-of-nodes-and-intermediate-points&quot;&gt;IR Design, Sea-of-Nodes, and Intermediate Points&lt;&#x2F;h2&gt;
&lt;p&gt;From first principles, one might ask: how &lt;em&gt;should&lt;&#x2F;em&gt; a unifying
framework for these concerns look? Code motion and canonicalization
together imply that perhaps computations (operator nodes) should &lt;em&gt;not&lt;&#x2F;em&gt;
have a &quot;location&quot; in the program, whenever that can be avoided. In
other words, perhaps we should find a way to represent &lt;code&gt;add v1, v2&lt;&#x2F;code&gt; in
our IR without putting it somewhere concrete in the control flow. Then
all instances of that same computation would be merged (because
duplicates would differ only by their location, which we removed), and
code motion is... inapplicable, because code does not have a location?&lt;&#x2F;p&gt;
&lt;p&gt;Well, not quite: the idea is that one &lt;em&gt;starts&lt;&#x2F;em&gt; with a conventional IR
(with control flow), and &lt;em&gt;ends&lt;&#x2F;em&gt; with it too, but &lt;em&gt;in the middle&lt;&#x2F;em&gt; one
can eliminate locations where possible. So in the transition &lt;em&gt;to&lt;&#x2F;em&gt; this
representation, we erase locations, and canonicalize; and in the
transition &lt;em&gt;from&lt;&#x2F;em&gt; this representation, we re-assign locations, and
code-motion can be a side-effect of how we do that.&lt;&#x2F;p&gt;
&lt;p&gt;What we just described above is called a
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Sea_of_nodes&quot;&gt;sea-of-nodes&lt;&#x2F;a&gt; IR.  A
sea-of-nodes IR is one that dispenses with a classical &quot;sequential
order&quot; for all instructions or operators in the program, and instead
builds a graph (the &quot;sea&quot;) of operators (the &quot;nodes&quot;) with edges to
denote the actual dependencies, either for dataflow or control flow.&lt;&#x2F;p&gt;
&lt;p&gt;In the purest form of this design, one can represent &lt;em&gt;every IR
transform&lt;&#x2F;em&gt; as a graph rewrite, because a graph is all there is. For
example, LICM, a kind of code motion that hoists a computation out of
a loop, is a purely algebraic rewrite on the subgraph representing the
loop body. This is because the loop itself is a kind of node in the
sea of nodes, with control-flow edges like any other edge; code motion
is not a &quot;special&quot; action outside the scope of the expression
language (nodes and their operands).&lt;&#x2F;p&gt;
&lt;p&gt;While that kind of flexibility is tempting, it comes with a
significant complexity tax as well: it means that reasoning through
and implementing classical compiler analyses and transforms is more
difficult, at least for existing compiler engineers with their
experience, because the IR is so different from the classical data
structure (CFG of basic blocks). The V8 team &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;v8.dev&#x2F;blog&#x2F;leaving-the-sea-of-nodes&quot;&gt;wrote about this
difficulty&lt;&#x2F;a&gt; recently as
support for their decision to migrate away from a pure Sea-of-Nodes
representation.&lt;&#x2F;p&gt;
&lt;p&gt;However, we might achieve some progress toward our goal -- providing a
general framework for rewrites, code motion and canonicalization -- if
we take inspiration from sea-of-nodes&#x27; handling of &lt;em&gt;pure&lt;&#x2F;em&gt;
(side-effect-free) operators, and the way that they can &quot;float&quot; in the
sea, unmoored by any anchor other than actual inputs and outputs
(dataflow edges). Stated succinctly: what if we kept the CFG for the
side-effectful instructions (call it the &quot;side-effect skeleton&quot;) and
used a sea-of-nodes for the rest?&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2026-03-29-cfg-skeleton-web.svg&quot; alt=&quot;Figure: sea-of-nodes-with-CFG&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;This would allow for us to unify code motion, canonicalization and
rewrites, as described above: canonicalization works on pure
operators, because we remove distinctions based on location;
code-motion can occur when we put pure operators back in the CFG; and
rewrites can occur on pure operators. In fact rewrites are now both
(i) simpler to reason about, because we don&#x27;t have to place expression
nodes at locations in an IR, only create them &quot;floating in the air&quot;,
and (ii) more efficient, because they occur once on a canonicalized
instance of an expression, rather than all instances separately.&lt;&#x2F;p&gt;
&lt;p&gt;We&#x27;ll call this representation a &quot;sea-of-nodes with CFG&quot;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;implementing-sea-of-nodes-with-cfg&quot;&gt;Implementing Sea-of-Nodes-with-CFG&lt;&#x2F;h3&gt;
&lt;p&gt;Now, to practical implementation: architecting the entire compiler
around sea-of-nodes for pure operators might make sense from first
principles, but as a modification of the existing Cranelift compiler
pipeline, we would not want to (or be able to) make such a radical
change in one step. Rather, I wanted to build this as a replacement
for the mid-end, taking CLIF (our conventional CFG-based SSA IR) as
input and producing CLIF as output. So we need a three-stage
optimizer:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Lift all pure operators out of the CFG, leaving behind the
skeleton. Put these operators into the &quot;sea&quot; of pure computation
nodes, deduplicating
(&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Hash_consing&quot;&gt;hash-consing&lt;&#x2F;a&gt;) as we
go.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Perform rewrites on these operators, replacing some values with
others according to whatever rules we have that preserve value
equivalence.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Convert this sea-of-pure-nodes back to sequential IR by scheduling
nodes into the CFG. We&#x27;ll call this process &quot;elaboration&quot; of the
computations.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;This is in fact how the heart of Cranelift&#x27;s mid-end now works; we&#x27;ll
go through each part above in turn.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;into-sea-of-nodes-with-cfg-canonicalization&quot;&gt;Into Sea-of-Nodes-with-CFG: Canonicalization&lt;&#x2F;h3&gt;
&lt;p&gt;Let&#x27;s talk about how we get &lt;em&gt;into&lt;&#x2F;em&gt; the sea-of-nodes representation
first. The most straightforward answer, of course, would be to simply
&quot;remove the nodes from the CFG&quot; and let them free-float, referenced by
their uses that remain in the skeleton -- and that&#x27;s it. But that
gives up on the obvious opportunity offered by the fact that these
operators are &lt;em&gt;pure&lt;&#x2F;em&gt; (have no side-effects, or implicit dependencies
on the rest of the world): an operator &lt;code&gt;op v1, v2&lt;&#x2F;code&gt; &lt;em&gt;always&lt;&#x2F;em&gt; produces
the same value given the same inputs, and two separate instances of
this node have no distinguishing features or other properties that
should lead to different results. Hence, we should canonicalize, or
&lt;em&gt;hash-cons&lt;&#x2F;em&gt;, nodes.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Hash_consing&quot;&gt;Hash-consing&lt;&#x2F;a&gt; is a
standard technique in systems that have value- or operator-nodes: the
idea is to keep a lookup table indexed by the contents of each value
or operator, perform lookups in this table when creating a new node,
and reuse existing nodes when a match occurs.&lt;&#x2F;p&gt;
&lt;p&gt;What is the &lt;em&gt;equivalence class&lt;&#x2F;em&gt; by which we deduplicate? (In other
words, more concretely, how do we define &lt;code&gt;Eq&lt;&#x2F;code&gt; and &lt;code&gt;Hash&lt;&#x2F;code&gt; on
sea-of-nodes values?) We adopt a very simple answer (and deal with
subtleties later, as is often the case!): the (shallow) content of a
given node is its identity. In other words, if we have &lt;code&gt;iadd v1, v2&lt;&#x2F;code&gt;,
then that is &quot;equal to&quot; (deduplicates with) any other such operator.&lt;&#x2F;p&gt;
&lt;p&gt;Now, this shallow notion of equality may not seem like enough to
canonicalize all instances of the same expression tree. Consider if we
had&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;v0 = ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;v1 = ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;v2 = iadd v0, v1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;v3 = iconst 42&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;v4 = imul v2, v3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;v5 = iadd v0, v1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;v6 = iconst 42&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;v7 = imul v5, v6&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Clearly any reasonable canonicalization algorithm should consider &lt;code&gt;v4&lt;&#x2F;code&gt;
and &lt;code&gt;v7&lt;&#x2F;code&gt; to be the same, and condense uses of them into uses of one
canonical node. But the nodes are not &lt;em&gt;shallowly&lt;&#x2F;em&gt; equal. How do we
get from here to there?&lt;&#x2F;p&gt;
&lt;p&gt;One possible answer is induction: we could canonicalize a node only
after all of its operands have been canonicalized (and rewritten), so
we know that if subtrees are identical, we will have identical value
numbers. Thus, inductively, all values would be canonicalized deeply.&lt;&#x2F;p&gt;
&lt;p&gt;This requires processing definitions of a node before its uses,
however. Fortunately, the SSA CFG from which we are constructing the
sea-of-nodes-with-CFG provides us this property already if we traverse
it in a particular order: we need to visit blocks in the control-flow
graph in some &lt;em&gt;preorder&lt;&#x2F;em&gt; of the dominance tree (domtree), which we
usually have available already.&lt;&#x2F;p&gt;
&lt;p&gt;So we have an algorithm something like the following pseudo-code to
canonicalize the SSA CFG into a sea-of-nodes-with-CFG:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;def canonicalize(basic_block):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  for inst in basic_block:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    if is_pure(inst):                 # only dedup and move to sea-of-nodes for &amp;quot;pure&amp;quot; insts;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                      # leave the &amp;quot;skeleton&amp;quot; in place&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      basic_block.remove(inst)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      inst.rename_values(rename_map)  # rewrite uses according to a value-&amp;gt;value map&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      if inst in hashcons_map:        # equality defined by shallow content&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        rename_map[inst.value] = hashcons_map[inst]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      else:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        nodes.push(inst)              # add to the sea-of-nodes&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        hashcons_map[inst] = inst.value&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    else:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      # we still need to rename the CFG skeleton&amp;#39;s uses to refer to sea-of-nodes&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      inst.rename_values(rename_map)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  # recursive domtree-preorder traversal.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  for child in domtree.children(basic_block):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    canonicalize(child)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This will handle not only the above example, where we have &quot;deep
equality&quot; (because we will canonicalize and rename e.g. &lt;code&gt;v5&lt;&#x2F;code&gt; into &lt;code&gt;v2&lt;&#x2F;code&gt;
before visiting &lt;code&gt;v5&lt;&#x2F;code&gt;&#x27;s use), but also more complex examples with the
redundancies spread across basic blocks.&lt;&#x2F;p&gt;
&lt;p&gt;Finally: how does the &quot;-with-CFG&quot; aspect of all of this work? So far, we
have very much glossed over any values that are defined in the CFG
skeleton, other than to imply above that they are never renamed (because
we never take the &lt;code&gt;is_pure&lt;&#x2F;code&gt; branch). But is this OK?&lt;&#x2F;p&gt;
&lt;p&gt;Yes, in a sense, by construction: we have defined all impure values to
have their own &quot;identity&quot;, distinct from any other such value, even if
shallowly equal at a syntactic level. This aligns with the notion that
impure computations have implicit inputs: for example, &lt;code&gt;load v0&lt;&#x2F;code&gt;
appearing twice in the program may produce different values at those
two different times, so we cannot deduplicate it. This can be relaxed
if we have a dedicated analysis that can reason about such implicit
dependencies, and in fact for loads we do have one (alias analysis,
feeding into redundant-load elimination and store-to-load
forwarding). But in general, we cannot do anything with these
&quot;roots&quot;. Rather, they stay in the skeleton, feed values into the sea
of nodes, and consume values back out of that sea of nodes.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;out-of-sea-of-nodes-with-cfg-scoped-elaboration&quot;&gt;Out of Sea-of-Nodes-with-CFG: Scoped Elaboration&lt;&#x2F;h3&gt;
&lt;p&gt;Given a sea-of-nodes + skeleton representation of a program, how do we
go back to a conventional CFG, with fully linearized operators (i.e.,
each of which has a concrete program-point where it is computed), to
feed to the compiler backend and lower to machine code?&lt;&#x2F;p&gt;
&lt;p&gt;The basic task is to decide a location at which to put each
operator. Since nodes in the sea-of-nodes are &quot;rooted&quot; (referenced and
ultimately computed&#x2F;used) by side-effectful operators in the CFG
skeleton, the first idea one might have is to copy pure nodes back
into the CFG where they are referenced. One could do this recursively:
if e.g. we have a side-effecting instruction &lt;code&gt;store v1, v2&lt;&#x2F;code&gt;, we can
place the (pure operator) definitions of &lt;code&gt;v1&lt;&#x2F;code&gt; and &lt;code&gt;v2&lt;&#x2F;code&gt; just before
this instruction; if those definitions require other values, likewise
compute them first. We could call this &quot;elaboration&quot;.&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s consider the single-basic-block case first and then define
something like the following pseudocode:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;def demand_based_elaboration(bb):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  for inst in bb:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    elaborate_inst(inst, bb, before=inst)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;def elaborate_inst(inst, bb, before):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  for value in inst.args:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    inst.rewrite_arg(value, elaborate_value(value, bb, before=inst))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  if is_pure(inst):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    bb.insert_before(before, inst)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  return inst.def&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;def elaborate_value(value, bb, before):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  if defined_by_inst(value):   # some values are blockparam roots, not inst defs&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    return elaborate_inst(value.inst, bb, before)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  else:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    return value&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This would certainly work, but is far too simple: it &lt;em&gt;duplicates&lt;&#x2F;em&gt;
computation every time a value is used, and no value (other than
blockparam roots) is ever used more than once. This will almost
certainly result in extreme blowup in program size!&lt;&#x2F;p&gt;
&lt;p&gt;So if we use a value multiple times, it seems that we should compute
it &lt;em&gt;once&lt;&#x2F;em&gt;, some place in the program before any of the uses. For
example, perhaps we could augment the above algorithm with a map that
records the resulting value number the first time we elaborate a node,
and reuses it (i.e., memoizes the elaboration):&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;# ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;def elaborate_value(value, bb, before):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  if value in elaborated:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    return already_elaborated[value]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  else if defined_by_inst(inst):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    result = elaborate_inst(value.inst, bb, before)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    elaborated[value] = result&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    return result&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  else:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    return value&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This modified algorithm will handle the case of a single block with
reuse efficiently, computing a value the first time it is used (&quot;on
demand&quot;) as expected.&lt;&#x2F;p&gt;
&lt;p&gt;Now let&#x27;s consider &lt;em&gt;multiple&lt;&#x2F;em&gt; basic blocks. One might be tempted to
wrap the above with a traversal, as we did for the translation into
sea-of-nodes:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;def elaborate_domtree(bb):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  demand_based_elaboration(bb)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  for child in domtree.children(bb):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    elaborate_domtree(child)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;def elaborate(func):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  elaborate_domtree(func.entry)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But this, too, has an issue. Consider a program that began as a CFG
with many paths, two of which compute the same value:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2026-03-29-cfg-web.svg&quot; alt=&quot;Figure: CFG with some redundancy between code paths&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;If we define some traversal over all basic blocks to perform an
elaboration as above, with a single map &lt;code&gt;elaborated&lt;&#x2F;code&gt;, we will&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Elaborate a computation of &lt;code&gt;v2&lt;&#x2F;code&gt; in &lt;code&gt;bb2&lt;&#x2F;code&gt; and use it there;&lt;&#x2F;li&gt;
&lt;li&gt;Use it in &lt;code&gt;bb3&lt;&#x2F;code&gt; as well in place of &lt;code&gt;v3&lt;&#x2F;code&gt;, since it has already been
computed and is thus memoized;&lt;&#x2F;li&gt;
&lt;li&gt;And thus generate &lt;em&gt;invalid SSA&lt;&#x2F;em&gt;, where a value is used on a path
where it is never computed!&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Perhaps we could hoist the computation to a &quot;common ancestor&quot; of all
of its uses instead. Here that would be &lt;code&gt;bb1&lt;&#x2F;code&gt;. But that creates yet
another problem: if control flows from &lt;code&gt;bb1&lt;&#x2F;code&gt; to &lt;code&gt;bb4&lt;&#x2F;code&gt;, then we will
have computed the value and never used it -- in supposedly optimized
code! This is sometimes called a &quot;partial redundancy&quot;: a computation
that is sometimes unused, depending on control flow. We would like to
avoid this if possible.&lt;&#x2F;p&gt;
&lt;p&gt;It turns out that this problem exactly corresponds to &lt;em&gt;common
subexpression elimination&lt;&#x2F;em&gt; (CSE), which aims to find one place to
compute a value possibly used multiple times. The usual approach in
SSA code, &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Value_numbering&quot;&gt;global value
numbering&lt;&#x2F;a&gt; (GVN),
solves the problem by reasoning about &lt;em&gt;scopes&lt;&#x2F;em&gt;, where a &quot;scope&quot; is the
region in which a value has already been computed. The intuition is
that at any given use, we can cast a &quot;shadow&quot; downward and remove
redundant uses but only in that shadow. So in our example program, if
&lt;code&gt;bb1&lt;&#x2F;code&gt; computed &lt;code&gt;v2&lt;&#x2F;code&gt; then we could reuse it in &lt;code&gt;bb2&lt;&#x2F;code&gt; and &lt;code&gt;bb3&lt;&#x2F;code&gt;; but
because it occurs independently in two subtrees with no common
ancestor, we do nothing; we &lt;em&gt;duplicate it&lt;&#x2F;em&gt; (re-elaborate it).&lt;&#x2F;p&gt;
&lt;p&gt;SSA &quot;scopes&quot; -- regions in which a value can be used -- are defined by
the dominance relation, and so we can work with a domtree traversal to
implement the needed behavior. Concretely, we can do a domtree
preorder traversal; we can keep the &lt;code&gt;elaborated&lt;&#x2F;code&gt; map but separate it
into scope &quot;overlays&quot;, and push a new overlay for each subtree. This
formalizes the &quot;shadow&quot; intuition above. We call this &lt;em&gt;scoped
elaboration&lt;&#x2F;em&gt;. Pseudo-code follows:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;def find_in_scope(value, scope):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  if value in scope.map:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    return scope.map[value]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  elif scope.parent:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    return find_in_scope(value, scope.parent)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  else:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    return None&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;def elaborate_value(value, bb, before, scope):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  if find_in_scope(value, scope):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    # ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  # ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;def elaborate_domtree(bb, scope):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  demand_based_elaboration(bb, scope)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  for child in domtree.children(bb):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    subscope = { map = {}, parent = scope }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    elaborate_domtree(child, subscope)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;def elaborate(func):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  root_scope = { map = {}, parent = None }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  elaborate_domtree(func.entry, root_scope)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;main&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;scoped_hash_map.rs&quot;&gt;real
implementation&lt;&#x2F;a&gt;
of our scoped hashmap takes advantage of the fact that keys will not
overlap between overlay layers (because once defined, a value will not
be re-defined in a lower layer), and this enables us to have true O(1)
rather than O(depth) lookup using some tricks with a &lt;code&gt;layer&lt;&#x2F;code&gt; number
and &lt;code&gt;generation&lt;&#x2F;code&gt;-per-layer (see implementation for
details!). Nevertheless, the semantics are the same as above.&lt;&#x2F;p&gt;
&lt;p&gt;As we foreshadowed above, just as the problem is closely related to
CSE and GVN, scoped elaboration is as well. In fact, the approach of
tracking a definition-within-scope for scopes that correspond to
subtrees in the domtree, given a preorder traversal on the domtree, is
exactly how &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;465913eb2c91998c99ae9222e47f8e9f9a88a546&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;simple_gvn.rs&quot;&gt;Cranelift&#x27;s old
implementation&lt;&#x2F;a&gt;
works as well. We even borrowed the scoped hashmap implementation!&lt;&#x2F;p&gt;
&lt;p&gt;A few more observations are in order. First, it&#x27;s fairly interesting
that we sometimes &lt;em&gt;re-elaborate&lt;&#x2F;em&gt; a node into multiple dom subtrees;
why is this? Does this introduce inefficiency (e.g. in code size) or
is it the best we can do?&lt;&#x2F;p&gt;
&lt;p&gt;The duplication is, in my opinion, best seen as a &lt;em&gt;dual&lt;&#x2F;em&gt; of the
canonicalization. The original code may have multiple copies of a pure
computation in multiple paths, with no common ancestor that computes
that value. When translating to sea-of-nodes, we will canonicalize
that computation, so we can optimize it once. But then when returning
to the original linearized IR, we may need to restore the &lt;em&gt;original&lt;&#x2F;em&gt;
duplication if there truly was no (non-redundancy-producing)
optimization opportunity. Additionally, and very importantly: we
should never elaborate a value in more than one place unless it &lt;em&gt;also&lt;&#x2F;em&gt;
appeared in more than once in the original program. So we should not
grow the program size beyond the original.&lt;&#x2F;p&gt;
&lt;p&gt;Another interesting observation is that by driving elaboration by
demand (from the roots in the side-effecting CFG skeleton), we do
dead-code elimination (DCE) of the pure operations &lt;em&gt;for free&lt;&#x2F;em&gt;. Their
existence in the sea of nodes may cost us some compile time if we
spend effort to optimize them (only to throw them away later); but
anything that becomes dead &lt;em&gt;because of&lt;&#x2F;em&gt; rewrites in sea-of-nodes will
then naturally disappear from the final result.&lt;&#x2F;p&gt;
&lt;p&gt;A third observation is that elaboration gives us a central location to
control when and where code is placed in the final program. In other
words, there is room for us to add &lt;em&gt;heuristics&lt;&#x2F;em&gt; beyond the simplest
version of the algorithm described above. For example: we stated that
we did not want to introduce any partial redundancies. But for
correctness, we don&#x27;t &lt;em&gt;need&lt;&#x2F;em&gt; to adhere to this: our only real
restriction is that a pure computation cannot happen before its
arguments are computed (i.e., we have to obey dataflow dependencies).
So, for example, if we have the &lt;em&gt;loop nest&lt;&#x2F;em&gt; (structure of loops in the
program) available, if a pure computation within a loop does not use
any values that are computed within that loop, we know it is
&lt;em&gt;loop-invariant&lt;&#x2F;em&gt; and we may choose to elaborate it before the loop
begins (into the &quot;preheader&quot;), in a transform known as &lt;em&gt;loop-invariant
code motion&lt;&#x2F;em&gt; (LICM). This is redundant if the loop executes zero
iterations, but most loops execute at least once; and performing a
loop-invariant computation only once can be a huge efficiency
improvement.&lt;&#x2F;p&gt;
&lt;p&gt;In the other direction -- pushing computation downward rather than
upward -- we could choose to implement &lt;em&gt;rematerialization&lt;&#x2F;em&gt; by
strategically &lt;em&gt;forgetting&lt;&#x2F;em&gt; a value in the already-elaborated scope and
recomputing it at a new use. Why would we do this? Perhaps it is
cheaper to recompute than to &lt;em&gt;thread the original value through the
program&lt;&#x2F;em&gt;. For example, constant values are very cheap to &quot;compute&quot;
(typically 1 or 2 instructions) but burning a machine register to keep
a constant across a long function can be expensive.&lt;&#x2F;p&gt;
&lt;p&gt;There is a lot of room for heuristic &lt;em&gt;code scheduling&lt;&#x2F;em&gt; within
elaboration as well (LICM and rematerialization can be seen as
scheduling too, but here I mean the order that operations are
linearized within the block they are otherwise elaborated into). For a
modern
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Out-of-order_execution&quot;&gt;out-of-order&lt;&#x2F;a&gt;
CPU, this may not matter too much to the hardware -- but it &lt;em&gt;may&lt;&#x2F;em&gt;
matter to the register allocator, because reordering instructions
changes the &quot;interference graph&quot;, or the way that different live
register values compete for finite resources (hardware
registers). E.g., pushing an instruction that uses many values for the
last time &quot;earlier&quot; (to eliminate the need to store those values) is
great; but this minimization is not always straightforward.  In fact,
ordering instructions that define and use values to minimize the
coloring count for the resulting live-range interference graph is an
NP-complete problem. So it goes, too often, in compiler engineering!&lt;&#x2F;p&gt;
&lt;p&gt;Despite the complexities that may arise in combining many heuristics,
these three dimensions -- LICM, rematerialization, and code scheduling
for register pressure -- are an interesting high-dimensional cost
optimization problem and one that we still haven&#x27;t fully solved (see
e.g. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;issues&#x2F;6159&quot;&gt;#6159&lt;&#x2F;a&gt;,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;issues&#x2F;6260&quot;&gt;#6260&lt;&#x2F;a&gt; and
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;issues&#x2F;8959&quot;&gt;#8959&lt;&#x2F;a&gt;).&lt;&#x2F;p&gt;
&lt;h3 id=&quot;optimizing-pure-expression-nodes-rewrite-framework&quot;&gt;Optimizing Pure Expression Nodes: Rewrite Framework&lt;&#x2F;h3&gt;
&lt;p&gt;We&#x27;ve covered the transitions into and out of the
sea-of-nodes-with-CFG program representation. We&#x27;ve seen how merely
this translation gives us GVN (deduplication), DCE, LICM, and
rematerialization &quot;for free&quot; (not really free, but falling out as a
natural consequence of the algorithms). But we still haven&#x27;t covered
one of the most classical sets of optimizations: algebraic (and other)
rewrites from one expression to another equivalent one (e.g, &lt;code&gt;x+0&lt;&#x2F;code&gt; to
&lt;code&gt;x&lt;&#x2F;code&gt;). How can we do this on the sea-of-nodes?&lt;&#x2F;p&gt;
&lt;p&gt;In principle, the answer is as &quot;simple&quot; as: build the logic that
pattern-matches the &quot;left-hand side&quot; of a rewrite (the part that we
have a &quot;better&quot; equivalent expression for), and then replaces it with
the &quot;right-hand side&quot;. That is, in &lt;code&gt;x + 0 -&amp;gt; x&lt;&#x2F;code&gt;, the left-hand side is
&lt;code&gt;x + 0&lt;&#x2F;code&gt; and the right-hand side is &lt;code&gt;x&lt;&#x2F;code&gt;. Such a framework is highly
amenable to a &lt;em&gt;domain-specific language&lt;&#x2F;em&gt; to express these rewrites:
ideally one doesn&#x27;t want to write code that manually iterates through
nodes to find these patterns. Fortunately for us, in the Cranelift
project we have the &lt;em&gt;ISLE&lt;&#x2F;em&gt; (instruction-selection and
lowering-expressions) DSL
(&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;rfcs&#x2F;pull&#x2F;27&quot;&gt;RFC&lt;&#x2F;a&gt;, &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;main&#x2F;cranelift&#x2F;isle&#x2F;docs&#x2F;language-reference.md&quot;&gt;language
reference&lt;&#x2F;a&gt;,
&lt;a href=&quot;&#x2F;blog&#x2F;2023&#x2F;01&#x2F;20&#x2F;cranelift-isle&#x2F;&quot;&gt;blog post&lt;&#x2F;a&gt;). I originally designed
ISLE in the context of instruction lowering, as the name implies, but
I was careful to keep a separation between the core language and its
&quot;prelude&quot; binding it to a particular environment. Hence we could adapt
it fairly easily to rewrite a graph of Cranelift IR operators as
well. The idea is that, as in instruction lowering, for mid-end
optimizations we invoke an ISLE constructor (entry point) on a
&lt;em&gt;particular&lt;&#x2F;em&gt; node and the ruleset produces a possibly better node.&lt;&#x2F;p&gt;
&lt;p&gt;That gives us the logic for one expression, but there is still an open
question how to apply these rewrites: to which nodes, in what order,
and how to manage or update any uses of a node when that node is
rewritten.&lt;&#x2F;p&gt;
&lt;p&gt;The two general design axes one might consider are:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Eager or deferred: do we apply rewrites to a node as soon as it
exists, or apply them later (perhaps as some sort of batch-rewrite)?&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Single-rewrite or fixpoint loop: do we rewrite a node only once, or
apply rewrite rules again to the result of a rewrite? Also, if the
operand of a node is rewritten, do we (and how do we) rewrite users
of that node as well, since more tree-matching patterns may now
apply to the new subtree?&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;It is clear that different choices to these questions could lead to
different efficiency-quality tradeoffs: most obviously, applying
rewrites in a fixpoint should produce better code at the cost of
longer compile time. But also, it seems possible that either eager or
deferred rewrite processing could win, depending on the workload and
particular rules: batching (hence, deferred until one bulk pass) often
leads to efficiency advantages (see the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;doi&#x2F;10.1145&#x2F;3434304&quot;&gt;egg
paper&lt;&#x2F;a&gt; and discussion below!),
but also, deferral may require additional bookkeeping vs. eagerly
rewriting before making use of the (soon to be stale) original value.&lt;&#x2F;p&gt;
&lt;p&gt;For the overall design that we have described so far, there turns out
to be a fairly clear optimal answer, surprisingly: because we build an
acyclic sea-of-nodes, as long as we keep it acyclic during rewrites,
we should be able to do a single rewrite pass rather than a
fixpoint. And, to make that single pass work, we rewrite eagerly, as
soon as we create a node; then use the final rewritten version of that
node for any uses of the original value. Because we visit defs before
uses and do rewrites immediately at the def, we never need to update
(and re-canonicalize!) nodes after creation.&lt;&#x2F;p&gt;
&lt;p&gt;An aside is in order: while it is fairly clear why the
sea-of-nodes-with-CFG is initially acyclic -- because SSA permits
dataflow cycles only through block-parameters &#x2F; phi-nodes, and those
remain in the CFG, which we don&#x27;t &quot;look through&quot; when applying
rewrites -- it is less clear why rewrites should &lt;em&gt;maintain&lt;&#x2F;em&gt;
acyclicity, especially in the face of hashconsing, which may &quot;tie the
knot&quot; of a cycle if we&#x27;re not careful. The answer lies in the previous
paragraph: once we create a node, we never update it. That&#x27;s it! We&#x27;ve
now maintained acyclicity, by construction.&lt;&#x2F;p&gt;
&lt;p&gt;Perhaps surprisingly as well, this rewrite process can be &lt;em&gt;fused&lt;&#x2F;em&gt; with
the translation pass into the sea-of-nodes itself. So we can amend the
above &lt;code&gt;canonicalize&lt;&#x2F;code&gt; to&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;def canonicalize(basic_block):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  for inst in basic_block:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    if is_pure(inst):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      # ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      if inst in hashcons_map:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        # ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      else:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        inst = rewrite(inst)          # NEW&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        nodes.push(inst)              # add to the sea-of-nodes&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        hashcons_map[inst] = inst.value&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    else:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;       # ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;i.e., simply add the rewrite rule application at the place we create
nodes, and hashcons based on the final version of the instruction.&lt;&#x2F;p&gt;
&lt;p&gt;Now, note that this is not quite complete yet: &lt;code&gt;inst = rewrite(inst)&lt;&#x2F;code&gt;
is doing some heavy lifting, and is actually a bit too simplistic, in
the sense that this implies that a rewrite rule can only ever rewrite
to &lt;em&gt;one&lt;&#x2F;em&gt; instruction on the right hand side. This isn&#x27;t quite right:
for example, one may want a DeMorgan rewrite rule &lt;code&gt;~(x &amp;amp; y) -&amp;gt; ~x | ~y&lt;&#x2F;code&gt;. The right-hand side includes three operator nodes (instructions):
two bitwise-NOTs and the OR that uses them. What if &lt;code&gt;x&lt;&#x2F;code&gt; or &lt;code&gt;y&lt;&#x2F;code&gt; in this
pattern also match a subexpression that can be simplified with some
logic rule?&lt;&#x2F;p&gt;
&lt;p&gt;There seem to be two general answers: create the original right-hand
side nodes un-rewritten and later apply rewrites, or immediately and
&lt;em&gt;recursively&lt;&#x2F;em&gt; rewrite. As we observed above, deferral requires
additional bookkeeping and re-canonicalization as a node&#x27;s inputs
change, so we choose the recursive approach. So, concretely, given
&lt;code&gt;~((a &amp;amp; b) &amp;amp; (c &amp;amp; d))&lt;&#x2F;code&gt; and the one rewrite rule above, we would:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Encounter the top-level &lt;code&gt;~&lt;&#x2F;code&gt;, and try to match the rewrite rule&#x27;s
left-hand side. It would match with bindings &lt;code&gt;x = (a &amp;amp; b)&lt;&#x2F;code&gt; and &lt;code&gt;y = (c &amp;amp; d)&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;Apply the right-hand side &lt;code&gt;~x | ~y&lt;&#x2F;code&gt; bottom-up, building nodes and rewriting
them as we go:
&lt;ul&gt;
&lt;li&gt;First, &lt;code&gt;~x&lt;&#x2F;code&gt;. This creates &lt;code&gt;~(a &amp;amp; b)&lt;&#x2F;code&gt;, which recursively fires the
rule, which results in &lt;code&gt;(~a | ~b)&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;Then, &lt;code&gt;~y&lt;&#x2F;code&gt;. This creates &lt;code&gt;~(c &amp;amp; d)&lt;&#x2F;code&gt;, again recursively firing the
rule, which results in &lt;code&gt;(~c | ~d)&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;We then create the top-level node on the right-hand side,
resulting in &lt;code&gt;(~a | ~b) | (~c | ~d)&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;One needs to limit the recursion if there is any concern that rule
chain depths may not be statically bounded or easily analyzable, but
otherwise this yields the correct answer in a single pass without the
need to track users of a node to later rewrite and recanonicalize it.&lt;&#x2F;p&gt;
&lt;p&gt;And that&#x27;s the whole pipeline: we now have a way to optimize code by
translating to sea-of-nodes-with-CFG, applying rewrites as we go, then
translating back to classical SSA CFG. In the process we&#x27;ve achieved
all the goals we set out with: GVN, LICM, DCE, rematerialization, and
algebraic rewrites.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;e-graphs-representing-many-possible-rewrites&quot;&gt;E-graphs: Representing Many Possible Rewrites&lt;&#x2F;h2&gt;
&lt;p&gt;So far, we&#x27;ve described a system that has &lt;em&gt;zero or one&lt;&#x2F;em&gt; deterministic
rewrite for any given node; this is analogous to a classical compiler
pipeline that destructively updates instructions&#x2F;operators. This is
great for rewrite rules like &lt;code&gt;x+0 -&amp;gt; x&lt;&#x2F;code&gt;: the right-hand side is
unambiguously better if it is &quot;smaller&quot; (rewrites a whole expression
into only one of its parts). This is also fine when instructions have
clear and very distinct costs, such as integer divide (typically tens
of cycles or more even on modern CPUs) by a constant &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;main&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;opts&#x2F;div_const.rs&quot;&gt;converted into
magic wrapping
multiplies&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;But what about cases where the benefit of a rewrite is less clear, or
depends on context, or depends on how it may or may not be able to
compose with or enable other rewrites in a given program?&lt;&#x2F;p&gt;
&lt;p&gt;For example, consider the classical example from the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;doi&#x2F;pdf&#x2F;10.1145&#x2F;3434304&quot;&gt;2021 paper on
egg, an e-graph
framework&lt;&#x2F;a&gt;: if we have the
expression &lt;code&gt;(x * 2) &#x2F; 2&lt;&#x2F;code&gt; in our program, we would expect that to
simplify to &lt;code&gt;x&lt;&#x2F;code&gt;&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#1&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;. To implement this simplification, we might have a
general rewrite rule &lt;code&gt;(x * k) &#x2F; k -&amp;gt; x&lt;&#x2F;code&gt;. But we might also,
separately, have a rewrite rule that &lt;code&gt;(x * 2^k) -&amp;gt; (x &amp;lt;&amp;lt; k)&lt;&#x2F;code&gt;, i.e.,
convert a multiplication into a left-shift operation. If we performed
this latter rewrite eagerly, the former rewrite rule might never match.&lt;&#x2F;p&gt;
&lt;p&gt;(Now, you might complain that we could &lt;em&gt;also&lt;&#x2F;em&gt; convert the divide into
a right-shift, then we have another rewrite rule that simplifies &lt;code&gt;(x &amp;lt;&amp;lt; k) &amp;gt;&amp;gt; k -&amp;gt; x&lt;&#x2F;code&gt;. In this particular example, that might be
reasonable. But (i) that required careful thinking about canonical
forms, where multiplies&#x2F;divides by powers-of-2 are always
canonicalized down to shifts, and (ii) this same fortunate behavior
might not exist for all rulesets.)&lt;&#x2F;p&gt;
&lt;p&gt;In general, we also have a question at the rule-application level: if
multiple rules apply, which do we take? In the above example, we would
have had to have some prioritization scheme to (say) apply
strength-reduction rules to convert to shifts before we examine
divide-of-multiply. That&#x27;s an extra layer of heuristic engineering
that must be considered when designing the optimizer.&lt;&#x2F;p&gt;
&lt;p&gt;Onto the scene, then, comes a new data structure: the &lt;em&gt;e-graph&lt;&#x2F;em&gt;, or
&lt;em&gt;equivalence graph&lt;&#x2F;em&gt;, which is a kind of sea-of-nodes
program&#x2F;expression representation that can represent &lt;em&gt;many different
equivalent forms of a program at once&lt;&#x2F;em&gt;. The key idea is that, rather
than have a single expression node as a referent for any value, we
have an e-class (equivalence class) that contains many e-nodes, and we
can pick any of these e-nodes to compute the value.&lt;&#x2F;p&gt;
&lt;p&gt;The idea is a sort of &lt;em&gt;principled&lt;&#x2F;em&gt; approach to the optimization
problem: let&#x27;s model the state space explicitly, and then pick the
best result objectively. Typically one uses the result of an e-graph
by &quot;extracting&quot; one possible representation of the program according
to a cost metric. (More on this below, but a simple cost metric could
be a static number per operator kind, plus cost of inputs.)&lt;&#x2F;p&gt;
&lt;p&gt;The magic of e-graphs is how they can &lt;em&gt;compress&lt;&#x2F;em&gt; a very large
combinatorial space of equivalent programs into a small data
structure. A detailed exploration of how this works is beyond the
scope of this blog post (please read the egg paper: it&#x27;s very good!)
but a very short intuitive summary might be something like:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Ensuring that all value uses point to an &lt;em&gt;e-class&lt;&#x2F;em&gt; rather than a
particular node will propagate knowledge of equivalences to
maximally many places. That is, if we know that &lt;code&gt;op1 v1, v2&lt;&#x2F;code&gt; is
equivalent to &lt;code&gt;op2 v3, v4&lt;&#x2F;code&gt;, all users of the &lt;code&gt;op1 v1, v2&lt;&#x2F;code&gt; expression
should automatically get the knowledge propagated that they can use
any form. This knowledge propagation is the essence of &quot;equality
saturation&quot; that e-graphs enable.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;A strong regime of canonicalization and &quot;re-interning&quot;
(re-hashconsing), which the egg paper calls &quot;rebuilding&quot;, ensures
that such information is maximally propagated. Basically, when we
discover that the &lt;code&gt;op1&lt;&#x2F;code&gt; and &lt;code&gt;op2&lt;&#x2F;code&gt; expressions above are equivalent,
we re-process all users of both op1 and op2, looking for more
follow-on consequences. Merging those two might in turn cause other
expressions to be equivalent or other rewrite rules to fire.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;practical-efficiency-of-classical-e-graphs&quot;&gt;Practical Efficiency of Classical E-graphs&lt;&#x2F;h3&gt;
&lt;p&gt;The two problems that arise with a &quot;classical e-graph&quot; (by which I
include the 2021 egg paper&#x27;s batched-rebuilding formulation) are
&lt;em&gt;blowup&lt;&#x2F;em&gt; -- that is, too many rewrite rules apply and the e-graph
becomes too large -- and &lt;em&gt;data-structure inefficiency&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The blowup problem is easier to understand: if we allow for
representing many different forms of the program, maybe we will
represent too many, and run out of memory and processing time. It is
often hard to control how rules will compose and lead to blowup, as
well: each rewrite rule may seem reasonable in isolation, but the
transitive closure of all possible programs under a well-developed set
of equivalences can be massive. So practical applications of e-graphs
usually need some kind of meta&#x2F;strategy driver layer that uses &quot;fuel&quot;
to bound effort, and&#x2F;or selectively applies rewrites where they are
likely to lead to better outcomes. Even then, this operating regime
often has compile-times measured in seconds or worse. This may be
appropriate for certain kinds of optimization problems where
compilation happens once or rarely and the quality of the outcome is
extremely important (e.g., hardware design), but not for a fast
compiler like Cranelift.&lt;&#x2F;p&gt;
&lt;p&gt;We can protect against such outcomes with careful heuristics, though,
and the possibility of allowing for objective choice of the best
possible expression is still very tempting. So in my initial
experiments, I applied the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;crates.io&#x2F;crates&#x2F;egg&quot;&gt;egg crate&lt;&#x2F;a&gt;
to the problem and eventually, with custom tweaks, managed to get
e-graph roundtripping to &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;rfcs&#x2F;pull&#x2F;27#issuecomment-1176689988&quot;&gt;23%
overhead&lt;&#x2F;a&gt;
-- with no rewrites applied. That&#x27;s not bad at first glance but it
proposes to replace an optimization pipeline that itself takes only
10% of compile-time, and we haven&#x27;t yet added the rewrites to the
23%. (And the 23% came after a good amount of data-structure
engineering to reduce storage; the initial overhead was over 2x.)&lt;&#x2F;p&gt;
&lt;p&gt;In profiling the optimizer&#x27;s execution, the overheads were occurring
more or less in &lt;em&gt;building the e-graph itself&lt;&#x2F;em&gt; (that is, cache misses
throughout the code transcribing IR to the e-graph). And what does the
e-graph contain? Per e-class, it contains a &quot;parent pointer&quot; list: we
need to track users of every e-class so that we can re-canonicalize
them during the &quot;rebuild&quot; step when e-classes are merged (a new
equivalence is discovered). And, even more fundamentally, it stores
e-nodes separately from e-classes, which is an essential element of
the idea but means that we have (at least) two different entities for
each value, even when most e-classes have only one e-node.&lt;&#x2F;p&gt;
&lt;p&gt;Is there any way to simplify the data structures so that we don&#x27;t have
to store so many different bits for one value?&lt;&#x2F;p&gt;
&lt;h3 id=&quot;insight-1-implicit-e-graph-in-the-ssa-ir&quot;&gt;Insight #1: Implicit E-graph in the SSA IR&lt;&#x2F;h3&gt;
&lt;p&gt;The first major insight that enabled efficient implementation of an
e-graph in Cranelift was that we could &lt;em&gt;redefine the existing IR into
an implicit e-graph&lt;&#x2F;em&gt;, without copying over the whole function body
into an e-graph and back, thus avoiding the compile-time penalty of
this data movement. (Data movement can be very expensive when the main
loops of a program are otherwise fairly optimized! It is best to keep
and operate on data in-place whenever possible.)&lt;&#x2F;p&gt;
&lt;p&gt;We start with a sea-of-nodes-with-CFG, where we have an IR with SSA
values not placed in basic blocks. We can already build this
&quot;in-place&quot; in Cranelift&#x27;s IR, CLIF, by removing existing SSA
definitions from the CFG but keeping their data in the data-flow graph
(DFG) data structures.&lt;&#x2F;p&gt;
&lt;p&gt;Then, to allow for multi-representation in an e-graph, the idea is to
discard the separation between e-classes and e-nodes, and instead
define a new kind of IR node that is a &lt;em&gt;union&lt;&#x2F;em&gt; node. Rather than two
index spaces, for e-nodes and e-classes, we have only one index space,
the SSA value space. An SSA value is either an ordinary operator
result or a block parameter (as before), or a &lt;em&gt;union&lt;&#x2F;em&gt; of two other SSA
values. Any arbitrary e-class can then be represented via a binary
tree of union nodes. We don&#x27;t need to change anything about operator
arguments to make use of this representation: operators already refer
to value numbers, and an e-class of multiple e-nodes (defined by the
&quot;top&quot; union node in its union tree) already has a value number.&lt;&#x2F;p&gt;
&lt;p&gt;The coolest thing about this representation is: once we have a
sea-of-nodes, it is &lt;em&gt;already implicitly an e-graph&lt;&#x2F;em&gt;, with &quot;trivial&quot;
(one-member) e-classes for each e-node. Thus, the lift from
sea-of-nodes to e-graph is a no-op -- the best (and cheapest) kind of
compile-time pass. We only pay for multi-representation when we use
that capability, creating union nodes as needed.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;insight-2-acyclicity-with-eager-rewrites&quot;&gt;Insight #2: Acyclicity with Eager Rewrites&lt;&#x2F;h3&gt;
&lt;p&gt;The other aspect of the classical e-graph data structure&#x27;s cost has to
do with its need to &lt;em&gt;rebuild&lt;&#x2F;em&gt;, and in order to do so, to track all
uses of an e-class (its &quot;parents&quot; in egg&#x27;s terminology). Cranelift
does not keep bidirectional use-def links, and the binary tree of
union nodes would make this even more complex still to track.&lt;&#x2F;p&gt;
&lt;p&gt;In trying to address this cost, I started with a somewhat radical
question: what would happen if we &lt;em&gt;never&lt;&#x2F;em&gt; rebuilt (to propagate
equalities)? How much &quot;optimization goodness&quot; would that give up?&lt;&#x2F;p&gt;
&lt;p&gt;If one (i) builds an e-graph then (ii) applies rewrite rules to find
better versions of nodes, adding to e-classes, then the answer is that
this would hardly work at all: this would mean that all users of a
value would see only its initial form and never its rewrites. The
rewritten forms would float in the sea-of-nodes, and union-nodes
joining them to the original forms would exist, but no users would
actually refer to those union nodes.&lt;&#x2F;p&gt;
&lt;p&gt;Instead, what is needed is to apply rewrites &lt;em&gt;eagerly&lt;&#x2F;em&gt;. When we create
a new node in the sea-of-nodes, we apply all rewrites immediately,
then join those rewrites with the original form with union nodes. The
&quot;top&quot; of that union tree is then the value number used as the
&quot;optimized form&quot; of that original value, referenced by all subsequent
uses.&lt;&#x2F;p&gt;
&lt;p&gt;The union-node representation plays a key part of this story: it acts
as an &lt;em&gt;immutable data structure&lt;&#x2F;em&gt; in a sense, where we always append
new knowledge and union it with existing values, and refer to that
&quot;newer version&quot; of an e-class; but we never go back and update
existing references.&lt;&#x2F;p&gt;
&lt;p&gt;This has a very nice implication for the graph structure of the sea of
nodes as well: it preserves acyclicity! Classical e-graphs, in their
rebuild step, can create cycles even when the input is acyclic because
they can condense nodes arbitrarily. But when we eagerly rewrite, then
freeze, we can never &quot;tie the knot&quot; and create a cycle.&lt;&#x2F;p&gt;
&lt;p&gt;This acyclicity is important because it permits a &lt;em&gt;single pass&lt;&#x2F;em&gt; for
the rewrites. In fact, taking our sea-of-nodes build algorithm above
as a baseline, we can add eager rewriting as a very small change: when
we apply rewrites, we build a &quot;union-node spine&quot; to join all rewritten
forms, rather than destructively take only the rewritten form.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;def canonicalize_and_rewrite(basic_block):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  for inst in basic_block:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    if is_pure(inst):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      # ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      if inst in hashcons_map:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        # ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      else:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        optimized = rewrites(inst)                     # NEW&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        union = join_with_union_nodes(inst, optimized) # NEW&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        optimized_form[inst.def] = union&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    else:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;       # ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;All of these aspects work together and cannot really be separated:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Union nodes allow for a cheap, pay-as-you-go representation for
e-classes, without a two-level data structure (nodes and classes)
and without parent pointers.&lt;&#x2F;li&gt;
&lt;li&gt;Eager rewriting, applied as we build the e-graph (sea of nodes),
allows for a single-pass algorithm and ensures all members of the
e-class are present before it is &quot;sealed&quot; by union nodes and
referenced by uses.&lt;&#x2F;li&gt;
&lt;li&gt;Acyclicity, present in the input (because of SSA), is preserved by
the append-only, immutable nature of union nodes, and permits eager
rewriting to work in a single pass.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Note that here we are glossing over &lt;em&gt;recursive&lt;&#x2F;em&gt; rewrites. Due to space
constraints I will only outline the problem and solution briefly: the
right-hand side of a rewrite rule application (&lt;code&gt;rewrites&lt;&#x2F;code&gt; above) will
produce nodes that themselves may be able to trigger further
rewrites. Rather than leave this to another iteration of a rewrite
loop, as a classical e-graph driver might do, we want to eagerly
rewrite this right-hand side as well before establishing any uses of
it. So we recursively re-invoke &lt;code&gt;rewrites&lt;&#x2F;code&gt;; and this occurs within the
right-hand side of rules as &lt;em&gt;pieces&lt;&#x2F;em&gt; of the final expression are
created, as well. This recursion is tightly bounded (in levels and in
total rewrite invocations per top-level loop iteration) to prevent
blowup.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, we are also glossing over details of how we apply our
pattern-matching&#x2F;rewrite DSL, ISLE, to the rewrite problem when
&lt;em&gt;multiple&lt;&#x2F;em&gt; rewrites are now permitted. In brief, we extended the
language to permit &quot;multi-extractors&quot; and &quot;multi-constructors&quot;: rather
than matching only one rule, and disambiguating by priority, we take
all applicable rules. The
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;rfcs&#x2F;pull&#x2F;27&quot;&gt;RFC&lt;&#x2F;a&gt; has more
details.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-extraction-problem&quot;&gt;The Extraction Problem&lt;&#x2F;h3&gt;
&lt;p&gt;So we now have a way to represent &lt;em&gt;multiple&lt;&#x2F;em&gt; expressions as
alternatives to compute the same value. How do we compile this
program? It surely wouldn&#x27;t make sense to compile &lt;em&gt;all&lt;&#x2F;em&gt; of these
expressions: they produce the same bits, so we only need one. Which
one do we pick?&lt;&#x2F;p&gt;
&lt;p&gt;This is the &lt;em&gt;extraction problem&lt;&#x2F;em&gt;, and it is both easy to state and
deceptively hard (in fact, NP-hard): choose the &lt;em&gt;easiest&lt;&#x2F;em&gt; (cheapest)
expression to compute any given value.&lt;&#x2F;p&gt;
&lt;p&gt;Why is this &lt;em&gt;hard&lt;&#x2F;em&gt;? First, let&#x27;s construct the case where it&#x27;s easy.
Let&#x27;s say that we have one root expression (say, returned from a
function) with all pure operators. This forms a tree of choices: each
eclass lets us choose one enode to compute it, and that enode has
arguments that themselves refer to eclasses with choices.&lt;&#x2F;p&gt;
&lt;p&gt;Given this &lt;em&gt;tree&lt;&#x2F;em&gt; of choices, with every choice independent, we can
pick the best choice for each subtree, and compute the cost of any
given expression node as best-cost-of-args plus that own node&#x27;s cost
to compute. In more formal algorithmic terms, that is &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Optimal_substructure&quot;&gt;&lt;em&gt;optimal
substructure&lt;&#x2F;em&gt;&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Unfortunately, as soon as we permit &lt;em&gt;references to shared nodes&lt;&#x2F;em&gt; (a
DAG rather than a tree), this nice structure evaporates. To see why,
consider: we could have two eclasses we wish to compute&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;v0 = union v10, v11&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;v1 = union v10, v12&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;with computations (not shown) &lt;code&gt;v10&lt;&#x2F;code&gt; that costs 10 units to compute,
and &lt;code&gt;v11&lt;&#x2F;code&gt; and &lt;code&gt;v12&lt;&#x2F;code&gt; that each cost 7 units to compute. The optimal
choice at each subproblem is to choose the cheaper computation (&lt;code&gt;v11&lt;&#x2F;code&gt;
or &lt;code&gt;v12&lt;&#x2F;code&gt;), but the program would actually be more globally optimal if
we computed only &lt;code&gt;v10&lt;&#x2F;code&gt; (cost of 10 total). A solver that tries to
recognize this would either process each root (&lt;code&gt;v0&lt;&#x2F;code&gt; and &lt;code&gt;v1&lt;&#x2F;code&gt;) one at a
time and &quot;backtrack&quot; at some point once it sees the additional use, or
somehow build a shared representation of the problem, which is no
longer deconstructed in a way that permits sub-problem solutions to
compose.&lt;&#x2F;p&gt;
&lt;p&gt;In fact, the extraction problem is &lt;em&gt;NP-hard&lt;&#x2F;em&gt;. To see why, I will show
a simple linear-time reduction (mapping) from a known NP-hard problem,
weighted set-cover, to eclass extraction.&lt;&#x2F;p&gt;
&lt;p&gt;Take each weighted set &lt;code&gt;S&lt;&#x2F;code&gt; with weight &lt;code&gt;w&lt;&#x2F;code&gt; and elements &lt;code&gt;S = { x_1, x_2, ... }&lt;&#x2F;code&gt;. Add an enode (operator with args) &lt;code&gt;N&lt;&#x2F;code&gt;, with self-cost
(not including args) &lt;code&gt;w&lt;&#x2F;code&gt;, and no arguments. Then for each element
&lt;code&gt;x_n&lt;&#x2F;code&gt; in the universe (the union of all sets&#x27; elements), define an
eclass: that is, if we have an &lt;code&gt;x_i&lt;&#x2F;code&gt;, define an eclass &lt;code&gt;C_i&lt;&#x2F;code&gt;. Then for
each set-element edge (for each &lt;code&gt;i&lt;&#x2F;code&gt;, &lt;code&gt;j&lt;&#x2F;code&gt; such that &lt;code&gt;x_i ∈ S_j&lt;&#x2F;code&gt;), add
an enode to &lt;code&gt;C_i&lt;&#x2F;code&gt; with opaque zero-cost operator &lt;code&gt;SetElt_ij(y)&lt;&#x2F;code&gt; where
&lt;code&gt;y&lt;&#x2F;code&gt; is the eclass for &lt;code&gt;x_i&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Performing an optimal (lowest-cost) extraction, with all eclasses
&lt;code&gt;C_i&lt;&#x2F;code&gt; taken as roots, will compute the lowest-weight set cover: the
choice of enode in each eclass &lt;code&gt;C_i&lt;&#x2F;code&gt; encodes which set we choose to
cover element &lt;code&gt;x_i&lt;&#x2F;code&gt;. Thus, because egraph extraction with shared
structure can compute the solution to an NP-hard problem (weighted set
cover), egraph extraction with shared structure is NP-hard.&lt;&#x2F;p&gt;
&lt;p&gt;OK, but we want a fast compiler. What do we do?&lt;&#x2F;p&gt;
&lt;p&gt;The classical compiler-literature answer to this problem -- seen over
and over in a 50-year history -- is &quot;solve a simpler approximation
problem&quot;. Register allocation, for example, is filled with simplified
problem models (linear scan, no live-range splitting, ...) that
reduce the decision space and allow for a simpler algorithm.&lt;&#x2F;p&gt;
&lt;p&gt;In our case, we solve the extraction problem with a simplifying
choice: we will not try to account for shared substructure and the way
that it complicates accounting of cost. In other words, we&#x27;ll ignore
shared substructure, pretending that each use of a subtree counts that
subtree&#x27;s cost anew. For each enode, having computed the cost of each
of its arguments, we can compute its own cost easily as the sum of its
arguments plus its own computation cost; and for each eclass, we can
pick the minimum-cost enode. That&#x27;s it!&lt;&#x2F;p&gt;
&lt;p&gt;We implement this with a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Dynamic_programming&quot;&gt;dynamic
programming&lt;&#x2F;a&gt;
algorithm: we do a toposort of the aegraph (which can always be done,
because it&#x27;s acyclic), then process nodes from leaves upward,
accumulating cost and picking minima at each subproblem. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;dd2dd8d9f0a0a06c34e364716d58acf67236ba6a&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;egraph&#x2F;elaborate.rs#L307-L391&quot;&gt;This is a
single
pass&lt;&#x2F;a&gt;
and is a relatively fast and straightforward algorithm.&lt;&#x2F;p&gt;
&lt;p&gt;After the Dagstuhl seminar in January, I had an ongoing discussion
with collaborators Alexa VanHattum and Nick Fitzgerald about whether
we could do better here. Alexa and Nick both prototyped a bunch of
interesting alternatives: dynamically updating (shortcutting to zero)
costs when subtrees become used (&quot;sunk-cost&quot; accounting), computing
costs by doing full top-down traversals rather than bottom-up dynamic
programming (and then mixing in memoization somehow), trying to
account for sharing by doing DP but &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;12230&quot;&gt;tracking the full set of covered
leaves&lt;&#x2F;a&gt;, and
some other things. This was an interesting exploration but in the end
we didn&#x27;t find anything that looked better in the compile-time &#x2F;
execution-time tradeoff space. We have an &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;issues&#x2F;12156&quot;&gt;issue tracking
this&lt;&#x2F;a&gt; and
more ideas are always welcome, of course.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;other-aspects&quot;&gt;Other Aspects&lt;&#x2F;h3&gt;
&lt;p&gt;There are two other aspects of our aegraph implementation that I don&#x27;t
have space to go into in this post:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;There is an interesting problem that arises with respect to the
domtree and SSA invariants when different values are merged together
with a union node and some of them have wider &quot;scope&quot; than
others. For example, via store-to-load forwarding we may know that a
load instruction produces a constant &lt;code&gt;0&lt;&#x2F;code&gt;; so we might have a union
node with &lt;code&gt;iconst 0&lt;&#x2F;code&gt;. The load can only happen at its current
location, but &lt;code&gt;iconst 0&lt;&#x2F;code&gt; can be computed anywhere. A user of this
eclass should be able to pick either value (said another way:
extraction should not be load-bearing for correctness). If the user
is within the dominance subtree under the load, then all is fine,
but if not, e.g. if some other user of &lt;code&gt;iconst 0&lt;&#x2F;code&gt; elsewhere in the
function errantly happened upon the eclass-neighbor load
instruction, we might get an invalid program.&lt;&#x2F;p&gt;
&lt;p&gt;There are many ways one might be tempted to solve this, but in the
end we landed on an &quot;available block&quot; analysis that runs as we build
nodes. For every node, we record which block is the &quot;highest&quot; in the
domtree that it can be computed: function entry for pure
zero-argument nodes, current block for any impure nodes, otherwise
the lowest node in the domtree among available blocks for all
arguments. (Claim: the available-nodes for all args of a node will
form an ancestor path in the domtree; one will always exist that is
dominated by all others. This follows from the properties of SSA.)
Then when we insert into the hashcons map, we insert at the level
that the final union is available.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;We also have an important optimization that we call &lt;code&gt;subsume&lt;&#x2F;code&gt;. This
is an identity operator that wraps a value returned by a rewrite
rule. It is not required for correctness, but its semantics are: if
any value is marked &quot;subsume&quot;, all &quot;subsuming&quot; values &lt;em&gt;erase&lt;&#x2F;em&gt;
existing members of the eclass. Usually, only one subsuming rule
will match (but this, also, is not necessary for correctness).&lt;&#x2F;p&gt;
&lt;p&gt;The usual use-case is for rules that have clear &quot;directionality&quot;: it
is always better to say &lt;code&gt;2&lt;&#x2F;code&gt; than &lt;code&gt;(iadd 1 1)&lt;&#x2F;code&gt;, so let&#x27;s go ahead and
shrink the eclass so that all further matching, and eventual
extraction, is more efficient.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;evaluation&quot;&gt;Evaluation&lt;&#x2F;h2&gt;
&lt;p&gt;So how does all of this actually work? Do aegraphs benefit Cranelift&#x27;s
strength as a compiler -- its ability to optimize code, its efficiency
in doing so quickly, or both?&lt;&#x2F;p&gt;
&lt;p&gt;This is the part where I offer a somewhat surprising conclusion: the
tl;dr of this post is that I believe the &lt;em&gt;sea-of-nodes-with-CFG&lt;&#x2F;em&gt;
aspect of this mid-end works great, but the &lt;em&gt;aegraph itself&lt;&#x2F;em&gt; -- the
ability to represent multiple options for one value -- may not (yet?)
be pulling its weight. It doesn&#x27;t really hurt much either, so maybe
it&#x27;s a reasonable capability to keep around. But in any case, it&#x27;s an
interesting conclusion and we&#x27;ll dig more into it below.&lt;&#x2F;p&gt;
&lt;p&gt;The main interesting evaluation is a two-dimension comparison of
&lt;em&gt;compile time&lt;&#x2F;em&gt; -- that is, how long Cranelift takes to compile code --
on the X-axis, versus &lt;em&gt;execution time&lt;&#x2F;em&gt; -- that is, how long the
resulting code takes to execute -- on the Y-axis. This forms a
tradeoff space: it may be good to spend a little more time to compile
if the resulting code runs faster (or vice-versa), for example. Of
course, reducing both is best. One point may be &quot;strictly better&quot; than
another if it reduces both -- then there is no tradeoff, because one
would always choose the configuration that both compiles faster and
produces better code. (One can then find the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Pareto_front&quot;&gt;Pareto
frontier&lt;&#x2F;a&gt; of points that
form a set in which none is strictly better than another -- these are
all &quot;valid configuration points&quot; that one may rationally choose
depending on one&#x27;s goals.)&lt;&#x2F;p&gt;
&lt;p&gt;Below we have a compile-time vs. execution-time plot for a number of
configurations of Cranelift, compiling and running the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;sightglass&#x2F;&quot;&gt;Sightglass
benchmark suite&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;No optimizations enabled;&lt;&#x2F;li&gt;
&lt;li&gt;The (on by default) aegraph-based optimization pipeline, as
described in this post, with several variants (below);&lt;&#x2F;li&gt;
&lt;li&gt;A &quot;classical optimization pipeline&quot; that does &lt;em&gt;not&lt;&#x2F;em&gt; form a
sea-of-nodes-with-CFG at all; instead, it applies exactly the same
rewrite rules, but in-place, and interleaves with classical GVN and
LICM passes;&lt;&#x2F;li&gt;
&lt;li&gt;Variants of the aegraphs pipeline and classical pipeline with the
whole mid-end repeated 2 or 3 times (to test whether code continues
to get better).&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Here&#x27;s the main result:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2026-04-09-scatterplot-web.svg&quot; alt=&quot;Figure: compile time vs. execution time&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;A few conclusions are in order. First, the aegraph pipeline does
generate better code than the classical pipeline. This objective
result is &quot;mission accomplished&quot; with respect to the aegraph effort&#x27;s
original motivation: we wanted to allow optimization passes to
interact more finely and optimize more completely. Note in particular
that repeating the classical pipeline multiple times does &lt;em&gt;not&lt;&#x2F;em&gt; get
the same result; we could not have obtained the ~2% speedup without
building a new optimization framework.&lt;&#x2F;p&gt;
&lt;p&gt;Second, though, there is clearly a Pareto frontier that includes &quot;no
optimizations&quot; and &quot;classical pipeline&quot; as well as the aegraph
variants: each takes more compilation time than the previous. In other
words, moving from a classical compiler pipeline to the design
described here, we spend about 7-8% more compile time. Notably, this
is &lt;em&gt;not&lt;&#x2F;em&gt; the result that we had when we first built the aegraphs
implementation in 2023 and switched over -- at that time, we were more
or less at parity. This is likely a result of the growth of the body
of rewrite rules over the intervening three years.&lt;&#x2F;p&gt;
&lt;p&gt;To get a better picture of how aegraph&#x27;s various design choices
matter, let&#x27;s zoom into the area in the red ellipse above, which
contains multiple &lt;em&gt;variants&lt;&#x2F;em&gt; of the aegraphs pipeline:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&quot;aegraph&quot;: Exactly as described in this post, and default Cranelift
configuration;&lt;&#x2F;li&gt;
&lt;li&gt;&quot;no multivalue (eager pick)&quot;: sea-of-nodes-with-CFG, without union
nodes; i.e., not actually representing more than one equivalent
value in an eclass. Instead, after evaluating rewrite rules, we pick
the best option and use that one option (destructively replacing the
original);&lt;&#x2F;li&gt;
&lt;li&gt;&quot;no rematerialization&quot;: testing the effect of this aspect of the
elaboration algorithm;&lt;&#x2F;li&gt;
&lt;li&gt;&quot;no subsume&quot;: testing this efficiency tweak of the rewrite-rule
application.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Here&#x27;s the plot:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2026-04-09-scatterplot-aegraph-web.svg&quot; alt=&quot;Figure: aegraphs variants&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;One can see that there are some definite tradeoffs, &lt;em&gt;but&lt;&#x2F;em&gt; looking
closely at the axis scales, these effects are very very small. In
particular, moving from sea-of-nodes-with-CFG to true aegraph (taking
all rewritten values, and picking the best in a principled way with
cost-based extraction) nets us ~0.1% execution-time improvement, at
~0.005% compile-time cost. That&#x27;s more-or-less in the noise.&lt;&#x2F;p&gt;
&lt;p&gt;Supporting that conclusion is the statistic that the average eclass
size after rewriting is 1.13 enodes: in other words, very few cases
with our ruleset and benchmark corpus actually result in more than one
option.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, the most interesting question in my view: does the &lt;em&gt;eager&lt;&#x2F;em&gt;
aspect of aegraphs -- applying rewrite rules right away, and never
going back to &quot;fill in&quot; other equivalences -- matter? In other words,
does skipping equality saturation take the egraph goodness out of an
egraph(-alike)?&lt;&#x2F;p&gt;
&lt;p&gt;We can measure this, too: I instrumented our implementation to track
when a subtree of an eclass is &lt;em&gt;not&lt;&#x2F;em&gt; chosen by extraction, and then
any node in that subtree is later actually elaborated (in other words,
when we use a suboptimal choice because we could not see an equality
in the &quot;wrong&quot; direction). This should only happen if, in theory, our
rules rewrite &lt;code&gt;f&lt;&#x2F;code&gt; to &lt;code&gt;g&lt;&#x2F;code&gt; where &lt;code&gt;cost(g) &amp;gt; cost(f)&lt;&#x2F;code&gt;, and we don&#x27;t have
a rewrite &lt;code&gt;g&lt;&#x2F;code&gt; to &lt;code&gt;f&lt;&#x2F;code&gt;: then a user of &lt;code&gt;g&lt;&#x2F;code&gt; might never directly get a
rewrite of &lt;code&gt;f&lt;&#x2F;code&gt; eagerly, but a later coincidentally-occurring &lt;code&gt;f&lt;&#x2F;code&gt; might
rewrite onto &lt;code&gt;g&lt;&#x2F;code&gt; (but we&#x27;ll never propagate that equality into the
original users of &lt;code&gt;g&lt;&#x2F;code&gt;).&lt;&#x2F;p&gt;
&lt;p&gt;It turns out that, in all of our benchmarks, with ~4 million value
nodes created overall, this happens two (2) times. Both instances
occur in &lt;code&gt;spidermonkey.wasm&lt;&#x2F;code&gt; (a large benchmark that consists of the
SpiderMonkey JS engine, compiled to WebAssembly, then run through
Wasmtime+Cranelift), and occur due to an ireduce-of-iadd rewrite rule
that violates this move-toward-lower-cost principle (explicitly, in
the name of simplicity). Overall, we conclude that the eager rewrites
are effective &lt;em&gt;as long as&lt;&#x2F;em&gt; the ruleset is designed with &lt;em&gt;optimization&lt;&#x2F;em&gt;
(rather than mere exploration of all equivalent expressions) in mind.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;discussion&quot;&gt;Discussion&lt;&#x2F;h2&gt;
&lt;p&gt;The most surprising conclusion in all of the data was, for me, that
aegraphs (per se) -- multi-value representations -- don&#x27;t seem to
matter. What?! That was the entire point of the project, and (proper)
e-graphs have seen great promise in other application areas.&lt;&#x2F;p&gt;
&lt;p&gt;I think the main reason for this is that our workload is somewhat
&quot;small&quot; in a combinatorial possibility-space sense: we are (i)
compiling workloads that are often optimized already (as Wasm modules)
before hitting the Cranelift compilation pipeline, and (ii) applying a
set of rewrite rules that, while large and growing (hundreds of
rules), explicitly do &lt;em&gt;not&lt;&#x2F;em&gt; include identities like associativity and
commutativity, or arbitrary algebraic identities, that do not
&quot;simplify&quot; somehow. In other words, if we&#x27;re generally applying
rewrites that look more like simple, obvious &quot;cleanups&quot;, we would
expect that we don&#x27;t hold a &quot;superposition&quot; of multiple good
expression options very often.&lt;&#x2F;p&gt;
&lt;p&gt;Given that it doesn&#x27;t cost us &lt;em&gt;that&lt;&#x2F;em&gt; much compile time to keep
aegraphs around, though, maybe this is... fine? Having the
&lt;em&gt;capability&lt;&#x2F;em&gt; to do principled cost-based extraction is great, versus
having to think about whether a rewrite rule should exist. We still do
try to be careful not to introduce rules that are &lt;em&gt;never&lt;&#x2F;em&gt; productive,
of course.&lt;&#x2F;p&gt;
&lt;p&gt;And, further into the future, one could imagine that workloads with
more optimization opportunity could cause more interesting situations
to occur within the aegraph, leading to more emergent composition in
the rewrites.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;future-directions&quot;&gt;Future Directions&lt;&#x2F;h2&gt;
&lt;p&gt;There are a bunch of directions we could (and should) take this in the
future. In terms of evaluation: finding the &quot;corner of the use-case
domain&quot; where aegraphs truly shine is still an open question. More
concretely: if we evaluate Cranelift with new and different workloads,
and&#x2F;or pile on more rewrite rules, do we get to a point where the
classical benefit of &quot;multi-representation with cost-based extraction&quot;
pays off in a conventional compiler? I don&#x27;t know!&lt;&#x2F;p&gt;
&lt;p&gt;There is also still a lot of room to improve the core algorithms:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Better extraction, as mentioned above: something that accounts for
shared substructure would be great, as long as we don&#x27;t have to pay
the NP-hard cost for it. Maybe there&#x27;s a nice approximation
algorithm that&#x27;s better than our current dynamic-programming
approach.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;We&#x27;d like to be able to handle more rewrites that alter the &lt;em&gt;CFG
skeleton&lt;&#x2F;em&gt; as well. Right now, we have a separate ISLE entry-point
that allows for destructive rewriting of skeleton instructions
(thanks to my colleague Nick Fitzgerald for building this!). However,
maybe we could remove redundant block parameters (phi nodes), for
example; and&#x2F;or maybe we could fold branches; and&#x2F;or maybe we could
apply path-sensitive knowledge to values when used in certain
control-flow contexts (&lt;code&gt;x=1&lt;&#x2F;code&gt; in the dominance subtree under the
&quot;true&quot; branch for &lt;code&gt;if x==1, goto ...&lt;&#x2F;code&gt;). My former colleague Jamey
Sharp wrote up a few excellent, in-depth issues on these topics in
our tracker
(&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;issues&#x2F;5623&quot;&gt;#5623&lt;&#x2F;a&gt;,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;issues&#x2F;6109&quot;&gt;#6109&lt;&#x2F;a&gt;,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;issues&#x2F;6129&quot;&gt;#6129&lt;&#x2F;a&gt;)
and I think there is a lot of potential here.&lt;&#x2F;p&gt;
&lt;p&gt;(The full version of this is, again, something like
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1912.05036&quot;&gt;RVSDG&lt;&#x2F;a&gt; in the node language seems
like the most principled option to express all useful forms of
control-flow rewrites; Jamey also has a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;jameysharp&#x2F;optir&quot;&gt;prototype called
&lt;code&gt;optir&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; for this.)&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;It would be interesting to experiment with incorporating our
&lt;em&gt;lowering backend rules&lt;&#x2F;em&gt; into the aegraph somehow: they are a rich,
fruitful target-specific database of natural &quot;costs&quot; for various
operations. For example, on AArch64 we can fold shifts and extends
into (some) arithmetic operations for &quot;free&quot;; maybe this alters the
extraction choices we make. Or likewise for the various odd corners
of addressing modes on each architecture.&lt;&#x2F;p&gt;
&lt;p&gt;The simple version of this idea is to incorporate lowering rules as
rewrites, and make the egraph&#x27;s node language a union of CLIF and
the machine&#x27;s instruction set. But maybe there&#x27;s something better we
could do instead, allowing multi-extractors to see the aegraph
eclasses directly and keeping various VCode sequences. I need to
write up more of my ideas on this topic someday. Jamey also has more
thoughts on this in
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;issues&#x2F;8529&quot;&gt;#8529&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;I&#x27;m sure there are other things that could be done here too!&lt;&#x2F;p&gt;
&lt;h2 id=&quot;further-reading&quot;&gt;Further Reading&lt;&#x2F;h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;I gave a talk about aegraphs at &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;pldi23.sigplan.org&#x2F;home&#x2F;egraphs-2023&quot;&gt;EGRAPHS
2023&lt;&#x2F;a&gt;:
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;cfallin.org&#x2F;pubs&#x2F;egraphs2023_aegraphs_slides.pdf&quot;&gt;slides&lt;&#x2F;a&gt;,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;vimeo.com&#x2F;843540328&quot;&gt;re-recorded video&lt;&#x2F;a&gt; (the original was
not recorded).&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;I gave a talk about aegraphs at the January 2026 &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.dagstuhl.de&#x2F;seminars&#x2F;seminar-calendar&#x2F;seminar-details&#x2F;26022&quot;&gt;Dagstuhl e-graphs
seminar&lt;&#x2F;a&gt;;
the &lt;a href=&quot;&#x2F;assets&#x2F;cfallin-aegraphs-dagstuhl-20260108.pdf&quot;&gt;slides&lt;&#x2F;a&gt; are a
heavily updated and amended version of the 2023 talk, with the
experiments&#x2F;data I presented here.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;There is a Cranelift RFC on aegraphs
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;rfcs&#x2F;pull&#x2F;27&quot;&gt;here&lt;&#x2F;a&gt;, and one on
ISLE (the rewrite DSL that we use to drive rewrites in the aegraph)
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;rfcs&#x2F;pull&#x2F;15&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;The main PR that implemented the current form of aegraphs is
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;5382&quot;&gt;here&lt;&#x2F;a&gt;,
co-authored by my former colleague Jamey Sharp (this production
implementation was a fantastically fun and productive
pair-programming project!).&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;acknowledgments&quot;&gt;Acknowledgments&lt;&#x2F;h2&gt;
&lt;p&gt;Thanks to many folks for discussion of the ideas around aegraphs
through the years: Nick Fitzgerald, Jamey Sharp, Trevor Elliott, Max
Willsey, Alexa VanHattum, Max Bernstein, and many others at the
Dagstuhl e-graphs seminar. None of them reviewed this post (it had
been languishing for too long already and I wanted to get it out) so all fault
for any errors herein is solely my own!&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;1&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;1&lt;&#x2F;sup&gt;
&lt;p&gt;Possibly with masking of the top bit if our IR semantics have
defined wrapping&#x2F;truncation behavior: &lt;code&gt;x &amp;amp; 0x7fff..ffff&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
</content>
    </entry>
    <entry xml:lang="en">
        <title>Exceptions in Cranelift and Wasmtime</title>
        <published>2025-11-06T00:00:00+00:00</published>
        <updated>2025-11-06T00:00:00+00:00</updated>
        <author>
            <name></name>
        </author>
        <link rel="alternate" type="text/html" href="https://cfallin.org/blog/2025/11/06/exceptions/"/>
        <id>https://cfallin.org/blog/2025/11/06/exceptions/</id>
        <content type="html" xml:base="https://cfallin.org/blog/2025/11/06/exceptions/">&lt;p&gt;&lt;em&gt;Note: this post is also cross-posted to the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bytecodealliance.org&#x2F;articles&#x2F;&quot;&gt;Bytecode Alliance
blog&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bytecodealliance.org&#x2F;articles&#x2F;wasmtime-exceptions&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;p&gt;This is a blog post outlining the odyssey I recently took to implement
the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;webassembly&#x2F;exception-handling&quot;&gt;Wasm exception-handling
proposal&lt;&#x2F;a&gt; in
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;wasmtime.dev&#x2F;&quot;&gt;Wasmtime&lt;&#x2F;a&gt;, the open-source
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;webassembly.org&#x2F;&quot;&gt;WebAssembly&lt;&#x2F;a&gt; engine for which I&#x27;m a core
team member&#x2F;maintainer, and its &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;cranelift.dev&#x2F;&quot;&gt;Cranelift&lt;&#x2F;a&gt;
compiler backend.&lt;&#x2F;p&gt;
&lt;p&gt;When first discussing this work, I made an off-the-cuff estimate in
the Wasmtime biweekly project meeting that it would be &quot;maybe two
weeks on the compiler side and a week in Wasmtime&quot;. Reader, I need to
make a confession now: I was wrong and it was &lt;em&gt;not&lt;&#x2F;em&gt; a three-week
task. This work spanned from late March to August of this year
(roughly half-time, to be fair; I wear many hats). Let that be a
lesson!&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#1&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;In this post we&#x27;ll first cover what exceptions are and why some
languages want them (and what other languages do instead) -- in
particular what the big deal is about (so-called) &quot;zero-cost&quot;
exception handling. Then we&#x27;ll see how Wasm has specified a
bytecode-level foundation that serves as a least-common denominator
but also has some unique properties. We&#x27;ll then take a roundtrip
through what it means for a &lt;em&gt;compiler&lt;&#x2F;em&gt; to support exceptions -- the
control-flow implications, how one reifies the communication with the
unwinder, how all this intersects with the ABI, etc. -- before finally
looking at how Wasmtime puts it all together (and is careful to avoid
performance pitfalls and stay true to the intended performance of the
spec).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;why-exceptions&quot;&gt;Why Exceptions?&lt;&#x2F;h2&gt;
&lt;p&gt;Many readers will already be familiar with exceptions as they are
present in languages as widely varied as
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Python_(programming_language)&quot;&gt;Python&lt;&#x2F;a&gt;,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Java_(programming_language)&quot;&gt;Java&lt;&#x2F;a&gt;,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;JavaScript&quot;&gt;JavaScript&lt;&#x2F;a&gt;,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;C%2B%2B&quot;&gt;C++&lt;&#x2F;a&gt;,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Lisp_(programming_language)&quot;&gt;Lisp&lt;&#x2F;a&gt;,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;OCaml&quot;&gt;OCaml&lt;&#x2F;a&gt;, and many more. But let&#x27;s
briefly review so we can (i) be precise what we mean by an exception,
and (ii) discuss &lt;em&gt;why&lt;&#x2F;em&gt; exceptions are so popular.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Exception_handling_(programming)&quot;&gt;Exception
handling&lt;&#x2F;a&gt;
is a mechanism for &lt;em&gt;nonlocal flow control&lt;&#x2F;em&gt;. In particular, most
flow-control constructs are &lt;em&gt;intraprocedural&lt;&#x2F;em&gt; (send control to other
code in the current function) and &lt;em&gt;lexical&lt;&#x2F;em&gt; (target a location that
can be known statically). For example, &lt;code&gt;if&lt;&#x2F;code&gt; statements and &lt;code&gt;loop&lt;&#x2F;code&gt;s
both work this way: they stay within the local function, and we know
exactly where they will transfer control. In contrast, exceptions are
(or can be) &lt;em&gt;interprocedural&lt;&#x2F;em&gt; (can transfer control to some point in
some other function) and &lt;em&gt;dynamic&lt;&#x2F;em&gt; (target a location that depends on
runtime state).&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#2&quot;&gt;2&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;To unpack that a bit: an exception is &lt;em&gt;thrown&lt;&#x2F;em&gt; when we want to signal
an error or some other condition that requires &quot;unwinding&quot; the current
computation, i.e., backing out of the current context; and it is
&lt;em&gt;caught&lt;&#x2F;em&gt; by a &quot;handler&quot; that is interested in the particular kind of
exception and is currently &quot;active&quot; (waiting to catch that
exception). That handler can be in the current function, or in any
function that has called it. Thus, an exception throw and catch can
result in an abnormal, early return from a function.&lt;&#x2F;p&gt;
&lt;p&gt;One can understand the need for this mechanism by considering how
programs can handle errors. In some languages, such as Rust, it is
common to see function signatures of the form &lt;code&gt;fn foo(...) -&amp;gt; Result&amp;lt;T, E&amp;gt;&lt;&#x2F;code&gt;. The
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;doc.rust-lang.org&#x2F;std&#x2F;result&#x2F;enum.Result.html&quot;&gt;&lt;code&gt;Result&lt;&#x2F;code&gt;&lt;&#x2F;a&gt; type
indicates that &lt;code&gt;foo&lt;&#x2F;code&gt; normally returns a value of type &lt;code&gt;T&lt;&#x2F;code&gt;, but may
produce an error of type &lt;code&gt;E&lt;&#x2F;code&gt; instead. The key to making this ergonomic
is providing some way to &quot;short-circuit&quot; execution if an error is
returned, propagating that error upward: that is, Rust&#x27;s &lt;code&gt;?&lt;&#x2F;code&gt; operator,
for example, which turns into essentially &quot;if there was an error,
return that error from this function&quot;.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#3&quot;&gt;3&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; This is quite conceptually
nice in many ways: why should error handling be different than any
other data flow in the program? Let&#x27;s describe the type of results to
include the possibility of errors; and let&#x27;s use normal control flow
to handle them. So we can write code like&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;fn&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; f&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;()&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; -&amp;gt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Result&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;u32&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Error&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;  if&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; bad&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;    return&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Err&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;Error&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;::&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;new&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;...&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;));&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;  }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;  Ok&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant z-numeric&quot;&gt;0&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;fn&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; g&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;()&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; -&amp;gt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Result&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;u32&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Error&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; The `?` propagates any error to our caller, returning early.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;  let&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; result&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; =&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; f&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;()&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;?&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;  Ok&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;result&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; +&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant z-numeric&quot;&gt; 1&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;and we don&#x27;t have to do anything special in &lt;code&gt;g&lt;&#x2F;code&gt; to propagate errors
from &lt;code&gt;f&lt;&#x2F;code&gt; further, other than use the &lt;code&gt;?&lt;&#x2F;code&gt; operator.&lt;&#x2F;p&gt;
&lt;p&gt;But there is a &lt;em&gt;cost&lt;&#x2F;em&gt; to this: it means that every error-producing
function has a larger return type, which might have ABI implications
(another return register at least, if not a stack-allocated
representation of the &lt;code&gt;Result&lt;&#x2F;code&gt; and the corresponding loads&#x2F;stores to
memory), and also, there is at least one conditional branch after
every call to such a function that checks if we need to handle the
error. The dynamic efficiency of the &quot;happy path&quot; (with no thrown
exceptions) is thus impacted. Ideally, we skip any cost unless an
error actually occurs (and then perhaps we accept slightly more cost
in that case, as tradeoffs often go).&lt;&#x2F;p&gt;
&lt;p&gt;It turns out that this is possible with the &lt;em&gt;help of the language
runtime&lt;&#x2F;em&gt;. Consider what happens if we omit the &lt;code&gt;Result&lt;&#x2F;code&gt; return types
and error checks at each return. We will need to reach the code that
handles the error in some other way. Perhaps we can jump directly to
this code somehow?&lt;&#x2F;p&gt;
&lt;p&gt;The key idea of &quot;zero-cost exception handling&quot; is to get the compiler
to build side-tables to &lt;em&gt;tell us&lt;&#x2F;em&gt; where this code -- known as a
&quot;handler&quot; -- is. We can walk the callstack, visiting our caller and
its caller and onward, until we find a function that would be
interested in the error condition we are raising. This logic is
implemented with the help of these side-tables and some code in the
language runtime called the &quot;unwinder&quot; (because it &quot;unwinds&quot; the
stack). If no errors are raised, then none of this logic is executed
at runtime. And we no longer have our explicit checks for error
returns in the &quot;happy path&quot; where no errors occur. This is why the
common term for this style of error-handling is called &quot;zero-cost&quot;:
more precisely, it is zero-cost when &lt;em&gt;no&lt;&#x2F;em&gt; errors occur, but the
unwinding in case of error can still be expensive.&lt;&#x2F;p&gt;
&lt;p&gt;This is the status quo for exception-handling implementations in most
production languages: for example, in the C++ world, exception
handling is commonly implemented via the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;itanium-cxx-abi.github.io&#x2F;cxx-abi&#x2F;abi-eh.html&quot;&gt;Itanium C++
ABI&lt;&#x2F;a&gt;&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#4&quot;&gt;4&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;, which
defines a comprehensive set of tables emitted by the compiler and a
complex dance between the system unwinding library and
compiler-generated code to find and transfer control to
handlers. Handler tables and stack unwinders are common in interpreted
and just-in-time (JIT)-compiled language implementations, too: for
example, SpiderMonkey has &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;firefox-main&#x2F;rev&#x2F;50a34d25155fd70628ee69c7d68a2509c0e3445d&#x2F;js&#x2F;src&#x2F;vm&#x2F;StencilEnums.h#18&quot;&gt;try
notes&lt;&#x2F;a&gt;
on its bytecode (so named for &quot;try blocks&quot;) and a
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;firefox-main&#x2F;rev&#x2F;a5316cedc669bcec09efae23521e0af6b9d3d257&#x2F;js&#x2F;src&#x2F;jit&#x2F;JitFrames.cpp#691&quot;&gt;HandleException&lt;&#x2F;a&gt;
function that &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;firefox-main&#x2F;rev&#x2F;a5316cedc669bcec09efae23521e0af6b9d3d257&#x2F;js&#x2F;src&#x2F;jit&#x2F;JitFrames.cpp#751-845&quot;&gt;walks stack
frames&lt;&#x2F;a&gt;
to find a handler.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-wasm-exception-handling-spec&quot;&gt;The Wasm Exception-Handling Spec&lt;&#x2F;h2&gt;
&lt;p&gt;The WebAssembly specification now (since version 3.0) has &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;webassembly&#x2F;exception-handling&quot;&gt;exception
handling&lt;&#x2F;a&gt;. This
proposal was a long time in the making by various folks in the
standards, toolchain and browser worlds, and the CG (standards group)
has now merged it into the spec and included it in the
recently-released &quot;Wasm 3.0&quot; milestone. If you&#x27;re already familiar
with the proposal, you can skip over this section to the Cranelift-
and Wasmtime-specific bits below.&lt;&#x2F;p&gt;
&lt;p&gt;First: let&#x27;s discuss &lt;em&gt;why&lt;&#x2F;em&gt; Wasm needs an extension to the bytecode
definition to support exceptions. As we described above, the key idea
of zero-cost exception handling is that an unwinder visits stack
frames and looks for handlers, transferring control directly to the
first handler it finds, outside the normal function return
path. Because the call stack is &lt;em&gt;protected&lt;&#x2F;em&gt;, or not directly readable
or writable from Wasm code (part of Wasm&#x27;s &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Control-flow_integrity&quot;&gt;control-flow
integrity&lt;&#x2F;a&gt;
aspect), an unwinder that works this way necessarily must be a
privileged part of the Wasm runtime itself. We can&#x27;t implement it in
&quot;userspace&quot; because there is no way for Wasm bytecode to transfer
control directly back to a distant caller, aside from a chain of
returns. This missing functionality is what the extension to the
specification adds.&lt;&#x2F;p&gt;
&lt;p&gt;The implementation comes down to only three opcodes (!), and some new
types in the bytecode-level type system. (In other words -- given the
length of this post -- it&#x27;s deceptively simple.) These opcodes are:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;try_table&lt;&#x2F;code&gt;, which wraps an inner body, and specifies &lt;em&gt;handlers&lt;&#x2F;em&gt; to
be active during that body. For example:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;(block $b1    ;; defines a label for a forward edge to the end of this block&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  (block $b2  ;; likewise, another label&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    (try_table&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      (catch $tag1 $b1) ;; exceptions with tag `$tag1` will be caught by code at $b1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      (catch_all $b2)   ;; all other exceptions will be caught by code at $b2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      body...)))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In this example, if an exception is thrown from within the code in
&lt;code&gt;body&lt;&#x2F;code&gt;, and it matches one of the specified tags (more below!),
control will transfer to the location defined by the end of the
given block. (This is the same as other control-flow transfers in
Wasm: for example, a branch &lt;code&gt;br $b1&lt;&#x2F;code&gt; also jumps to the end of
&lt;code&gt;$b1&lt;&#x2F;code&gt;.)&lt;&#x2F;p&gt;
&lt;p&gt;This construct is the single all-purpose &quot;catch&quot; mechanism, and is
powerful enough to directly translate typical &lt;code&gt;try&lt;&#x2F;code&gt;&#x2F;&lt;code&gt;catch&lt;&#x2F;code&gt; blocks
in most programming languages with exceptions.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;throw&lt;&#x2F;code&gt;: an instruction to directly throw a new exception. It
carries the tag for the exception, like: &lt;code&gt;throw $tag1&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;throw_ref&lt;&#x2F;code&gt;, used to rethrow an exception that has already been
caught and is held by reference (more below!).&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;And that&#x27;s it! We implement those three opcodes and we are &quot;done&quot;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;payloads&quot;&gt;Payloads&lt;&#x2F;h3&gt;
&lt;p&gt;That&#x27;s not the whole story, of course. Ordinarily a source language
will offer the ability to carry some &lt;em&gt;data&lt;&#x2F;em&gt; as part of an exception:
that is, the error condition is not just one of a static set of kinds
of errors, but contains some fields as well. (E.g.: not just &quot;file not
found&quot;, but &quot;file not found: $PATH&quot;.)&lt;&#x2F;p&gt;
&lt;p&gt;One could build this on top of an bytecode-level exception-throw
mechanism that only had throw&#x2F;catch with static tags, with the help of
some global state, but that would be cumbersome; instead, the Wasm
specification offers &lt;em&gt;payloads&lt;&#x2F;em&gt; on each exception. For full
generality, this payload can actually take the form of a &lt;em&gt;list&lt;&#x2F;em&gt; of
values; i.e., it is a full product type (struct type).&lt;&#x2F;p&gt;
&lt;p&gt;We alluded to &quot;tags&quot; above but didn&#x27;t describe them in detail. These
tags are key to the payload definition: each tag is effectively a type
definition that specifies its list of payload value types as
well. (Technically, in the Wasm AST, a tag definition names a
&lt;em&gt;function type&lt;&#x2F;em&gt; with only parameters, no returns, which is a nice way
of reusing an existing entity&#x2F;concept.) Now we show how they are
defined with a sample module:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;(module&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt; ;; Define a &amp;quot;tag&amp;quot;, which serves to define the specific kind of exception&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt; ;; and specify its payload values.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt; (tag $t (param i32 i64))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt; (func $f (param i32 i64)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;       ;; Throw an exception, to be caught by whatever handler is &amp;quot;closest&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;       ;; dynamically.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;       (throw $t (local.get 0) (local.get 1)))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt; (func $g (result i32 i64)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;       (block $b (result i32 i64)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;              ;; Run a body below, with the given handlers (catch-clauses)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;              ;; in-scope to catch any matching exceptions.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;              ;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;              ;; Here, if an exception with tag `$t` is thrown within the body,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;              ;; control is transferred to the end of block `$b` (as if we had&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;              ;; branched to it), with the payload values for that exception&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;              ;; pushed to the operand stack.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;              (try_table (catch $t $b)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                         (call $f (i32.const 1) (i64.const 2)))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;              (i32.const 3)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;              (i64.const 4))))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Here we&#x27;ve defined one tag (the Wasm text format lets us attach a name
&lt;code&gt;$t&lt;&#x2F;code&gt;, but in the binary format it is only identified by its index, 0),
with two payload values. We can throw an exception with this tag given
values of these types (as in function &lt;code&gt;$f&lt;&#x2F;code&gt;) and we can catch it if we
specify a catch destination as the end of a block meant to return
exactly those types as well.  Here, if function &lt;code&gt;$g&lt;&#x2F;code&gt; is invoked, the
exception payload values &lt;code&gt;1&lt;&#x2F;code&gt; and &lt;code&gt;2&lt;&#x2F;code&gt; will be thrown with the
exception, which will be caught by the &lt;code&gt;try_table&lt;&#x2F;code&gt;; the results of
&lt;code&gt;$g&lt;&#x2F;code&gt; will be &lt;code&gt;1&lt;&#x2F;code&gt; and &lt;code&gt;2&lt;&#x2F;code&gt;. (The values &lt;code&gt;3&lt;&#x2F;code&gt; and &lt;code&gt;4&lt;&#x2F;code&gt; are present to allow
the Wasm module to validate, i.e. have correct types, but they are
dynamically unreachable because of the throw in &lt;code&gt;$f&lt;&#x2F;code&gt; and will not be
returned.)&lt;&#x2F;p&gt;
&lt;p&gt;This is an instance where Wasm, being a bytecode, can afford to
generalize a bit relative to real-metal ISAs and offer conveniences to
the Wasm producer (i.e., toolchain generating Wasm modules). In this
sense, it is a little more like a compiler IR. In contrast, most other
exception-throw ABIs have a fixed definition of payload, e.g., one or
two machine register-sized values. In practice some producers might
choose a small fixed signature for all exception tags anyway, but
there is no reason to impose such an artificial limit if there is a
compiler and runtime behind the Wasm in any case.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;unwind-cleanup-and-destructors&quot;&gt;Unwind, Cleanup, and Destructors&lt;&#x2F;h3&gt;
&lt;p&gt;So far, we&#x27;ve seen how Wasm&#x27;s primitives can allow for basic exception
throws and catches, but what about languages with scoped resources,
e.g. C++ with its destructors? If one writes something like&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;cpp&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;struct&lt;&#x2F;span&gt;&lt;span&gt; Scoped&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;    Scoped&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;() {}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;    ~Scoped&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;() {&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; cleanup&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;()&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;}&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;void&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; f&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;() {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    Scoped&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; s&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;()&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;    throw&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; my_exception&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;()&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;then the &lt;code&gt;throw&lt;&#x2F;code&gt; should transfer control out of &lt;code&gt;f&lt;&#x2F;code&gt; and upward to
whatever handler matches, but the destructor of &lt;code&gt;s&lt;&#x2F;code&gt; still needs to run
and call &lt;code&gt;cleanup&lt;&#x2F;code&gt;. This is not quite a &quot;catch&quot; because we don&#x27;t want
to terminate the search: we aren&#x27;t actually handling the error
condition.&lt;&#x2F;p&gt;
&lt;p&gt;The usual approach to compile such a program is to &quot;catch and
rethrow&quot;. That is, the program is lowered to something like&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;cpp&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;try&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;    throw&lt;&#x2F;span&gt;&lt;span&gt; ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;}&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; catch_any&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;(&lt;&#x2F;span&gt;&lt;span&gt;e&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;    cleanup&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;()&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    rethrow e&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;where &lt;code&gt;catch_any&lt;&#x2F;code&gt; catches &lt;em&gt;any&lt;&#x2F;em&gt; exception propagating past this point
on the stack, and &lt;code&gt;rethrow&lt;&#x2F;code&gt; re-throws the same exception.&lt;&#x2F;p&gt;
&lt;p&gt;Wasm&#x27;s exception primitives provide exactly the pieces we need for
this: a &lt;code&gt;catch_all_ref&lt;&#x2F;code&gt; clause, which &lt;em&gt;catches all exceptions&lt;&#x2F;em&gt; and
&lt;em&gt;boxes the caught exception as a reference&lt;&#x2F;em&gt;; and a &lt;code&gt;throw_ref&lt;&#x2F;code&gt;
instruction, which &lt;em&gt;re-throws a previously-caught exception&lt;&#x2F;em&gt;.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#5&quot;&gt;5&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;In actuality there is a two-by-two matrix of &quot;catch&quot; options: we can
&lt;code&gt;catch&lt;&#x2F;code&gt; a specific tag or &lt;code&gt;catch_all&lt;&#x2F;code&gt;; and we can catch and
immediately unpack the exception into its payload values (as we saw
above), or we can catch it as a reference. So we have &lt;code&gt;catch&lt;&#x2F;code&gt;,
&lt;code&gt;catch_ref&lt;&#x2F;code&gt;, &lt;code&gt;catch_all&lt;&#x2F;code&gt;, and &lt;code&gt;catch_all_ref&lt;&#x2F;code&gt;.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#6&quot;&gt;6&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;dynamic-identity-and-compositionality&quot;&gt;Dynamic Identity and Compositionality&lt;&#x2F;h3&gt;
&lt;p&gt;There is one final detail to the Wasm proposal, and in fact it&#x27;s the
part that I find the most interesting and unique. Given the above
introduction, and any familiarity with exception systems in other
language semantics and&#x2F;or runtime systems, one might expect that the
&quot;tags&quot; identifying kinds of exceptions and matching throws with
particular catch handlers would be static labels. In other words, if I
throw an exception with tag &lt;code&gt;$tA&lt;&#x2F;code&gt;, then the first handler for &lt;code&gt;$tA&lt;&#x2F;code&gt;
anywhere up the stack, from any module, should catch it.&lt;&#x2F;p&gt;
&lt;p&gt;However, one of Wasm&#x27;s most significant properties as a bytecode is
its emphasis on isolation. It has a distinction between static
&lt;em&gt;modules&lt;&#x2F;em&gt; and dynamic &lt;em&gt;instances&lt;&#x2F;em&gt; of those modules, and modules have
no &quot;static members&quot;: every entity (e.g., memory, table, or global
variable) defined by a module is replicated per instance of that
module. This creates a clean separation between instances and means
that, for example, one can freely reuse a common module (say, some
kind of low-level glue or helper module) with separate instances in
many places without them somehow communicating or interfering with
each other.&lt;&#x2F;p&gt;
&lt;p&gt;Consider what happens if we have an instance A that invokes some other
(dynamically provided) function reference which ultimately invokes a
callback in A. Say that the instance throws an exception from within
its callback in order to unwind all the way to its outer stack frames,
across the intermediate functions in some other Wasm instance(s):&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                A.f   ---------call---------&amp;gt;   B.g   --------call---------&amp;gt;    A.callback&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                 ^                                                                  v&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;               catch $t                                                           throw $t&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                 |                                                                  |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                 `----------------------------&amp;lt;-------------------------------------&amp;#39;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The instance A expects that the exception that it throws from its
callback function to &lt;code&gt;f&lt;&#x2F;code&gt; is a &lt;em&gt;local&lt;&#x2F;em&gt; concern to that instance only,
and that B cannot interfere. After all, if the exception tag is
defined inside A, and Wasm preserves modularity, then B should not be
able to name that tag to catch exceptions by that tag, even if it also
uses exception handling internally. The two modules should not
interact: that is the meaning of modularity, and it permits us to
reason about each instance&#x27;s behavior locally, with the effects of
&quot;the rest of the world&quot; confined to imports and exports.&lt;&#x2F;p&gt;
&lt;p&gt;Unfortunately, if one designed a straightforward &quot;static&quot; tag-matching
scheme, this might not be the case if B were an instance of the same
module as A: in that case, if B also used a tag &lt;code&gt;$t&lt;&#x2F;code&gt; internally and
registered handlers for that tag, it could interfere with the desired
throw&#x2F;catch behavior, and violate modularity.&lt;&#x2F;p&gt;
&lt;p&gt;So the Wasm exception handling standard specifies that tags have
&lt;em&gt;dynamic instances&lt;&#x2F;em&gt; as well, just as memories, tables and globals
do. (Put in programming-language theory terms, tags are &lt;em&gt;generative&lt;&#x2F;em&gt;.)
Each instance of a module creates its own dynamic identities for the
statically-defined tags in those modules, and uses those dynamic
identities to tag exceptions and find handlers. This means that no
matter what instance B is, above, if instance A does not export its
tag &lt;code&gt;$t&lt;&#x2F;code&gt; for B to import, there is no way for B to catch the thrown
exception explicitly (it can still catch &lt;em&gt;all&lt;&#x2F;em&gt; exceptions, and it may
do so and rethrow to perform some cleanup). Local modular reasoning is
restored.&lt;&#x2F;p&gt;
&lt;p&gt;Once we have tags as dynamic entities, just like Wasm memories, we can
take the same approach that we do for the other entities to allow them
to be imported and exported. Thus, visibility of exception payloads
and ability for modules to catch certain exceptions is completely
controlled by the instantiation graph and the import&#x2F;export linking,
just as for all other Wasm storage.&lt;&#x2F;p&gt;
&lt;p&gt;This is surprising (or at least was to me)! It creates some pretty
unique implementation challenges in the unwinder -- in essence, it
means that we need to know about instance identity for each stack
frame, not just static code location and handler list.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;compiling-exceptions-in-cranelift&quot;&gt;Compiling Exceptions in Cranelift&lt;&#x2F;h2&gt;
&lt;p&gt;Before we implement the primitives for exception handling in Wasmtime,
we need to support exceptions in our underlying compiler backend,
Cranelift.&lt;&#x2F;p&gt;
&lt;p&gt;Why should this be a compiler concern? What is special about
exceptions that makes them different from, say, new Wasm instructions
that implement additional mathematical operators (when we already have
many arithmetic operators in the IR), or Wasm memories (when we
already have loads&#x2F;stores in the IR)?&lt;&#x2F;p&gt;
&lt;p&gt;In brief, the complexities come in three flavors: new kinds of control
flow, fundamentally different than ordinary branches or calls in that
they are &quot;externally actuated&quot; (by the unwinder); a new facet of the
ABI (that we get to define!) that governs how the unwinder interacts
with compiled code; and interactions between the &quot;scoped&quot; nature of
handlers and inlining in particular. We&#x27;ll talk about each below.&lt;&#x2F;p&gt;
&lt;p&gt;Note that much of this discussion started with an
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;rfcs&#x2F;pull&#x2F;36&quot;&gt;RFC&lt;&#x2F;a&gt; for
Wasmtime&#x2F;Cranelift, which had been posted way back in August of 2024
by Daniel Hillerstrom with help from my colleague Nick Fitzgerald, and
was discussed then; many of the choices within were subsequently
refined as I discovered interesting nuances during implementation and
we talked them through.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;control-flow&quot;&gt;Control Flow&lt;&#x2F;h3&gt;
&lt;p&gt;There are a few ways to think about exception handlers from the point
of view of compiler &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Intermediate_representation&quot;&gt;IR (intermediate
representation)&lt;&#x2F;a&gt;.
First, let&#x27;s recognize that exception handling (i) is a form of
control flow, and (ii) has all the same implications various compiler
stages that other kinds of control flow do. For example, the register
allocator has to consider how to get registers into the right state
whenever control moves from one basic block to the next (&quot;edge
moves&quot;); exception catches are a new kind of edge, and so the regalloc
needs to be aware of that, too.&lt;&#x2F;p&gt;
&lt;p&gt;One could see every call or other opcode that could throw as having
regular control-flow edges to every possible handler that could
match. I&#x27;ll call this the &quot;regular edges&quot; approach. The upside is that
it&#x27;s pretty simple to retrofit: one &quot;only&quot; needs to add new kinds of
control-flow opcodes that have out-edges, but that&#x27;s already a kind of
thing that IRs have. The disadvantage is that, in functions with a lot
of possible throwing opcodes and&#x2F;or handlers, the overhead can get
quite high. And control-flow graph overhead is a bad kind of overhead:
many analyses&#x27; runtimes are heavily dependent the edge and node (basic
block) counts, sometimes superlinearly.&lt;&#x2F;p&gt;
&lt;p&gt;The other major option is to build a kind of &lt;em&gt;implicit&lt;&#x2F;em&gt; new control
flow into the IR&#x27;s semantics. For example, one could lower the
source-language semantics of a &quot;try block&quot; down to regions in the IR,
with one set of handlers attached.  This is clearly more efficient
than adding out-edges from (say) every callsite within the try-block
to every handler in scope. On the other hand, it&#x27;s hard to understate
how invasive this change would be. This means that &lt;em&gt;every&lt;&#x2F;em&gt; traversal
over IR, analyzing dataflow or reachability or any other property, has
to consider these new implicit edges anyway. In a large established
compiler like Cranelift, we can lean on Rust&#x27;s type system for a lot
of different kinds of refactors, but changing a fundamental invariant
goes beyond that: we would likely have a long tail of issues stemming
from such a change, and it would permanently increase the cognitive
overhead of making new changes to the compiler. In general we want to
trend toward a smaller, simpler core and compositional rather than
entangled complexity.&lt;&#x2F;p&gt;
&lt;p&gt;Thus, the choice is clear: in Cranelift we opted to introduce one new
instruction, &lt;code&gt;try_call&lt;&#x2F;code&gt;, that calls a function and catches (some)
exceptions.  In other words, there are now two possible kinds of
return paths: a normal return or (possibly one of many) exceptional
return(s). The handled exceptions and block targets are enumerated in
an &lt;em&gt;exception table&lt;&#x2F;em&gt;. Because there are control-flow edges stemming
from this opcode, it is a block terminator, like a conditional
branch. It looks something like (in Cranelift&#x27;s IR, CLIF):&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;function %f0(i32) -&amp;gt; i32, f32, f64 {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    sig0 = (i32) -&amp;gt; f32 tail&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    fn0 = %g(i32) -&amp;gt; f32 tail&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    block0(v1: i32):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        v2 = f64const 0x1.0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        ;; exception-catching callsite&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        try_call fn0(v1), sig0, block1(ret0, v2), [ tag0: block2(exn0), default: block3(exn0) ]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ;; normal return path&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    block1(v3: f32, v4: f64):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        v5 = iconst.i32 1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        return v5, v3, v4&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ;; exception handler for tag0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    block2(v6: i64):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        v7 = ireduce.i32 v6&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        v8 = iadd_imm.i32 v7, 1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        v9 = f32const 0x0.0        &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        return v8, v9, v2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ;; exception handler for all other exceptions&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    block2(v10: i64):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        v11 = ireduce.i32 v10&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        v12 = f32.const 0x0.0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        v13 = f64.const 0x0.0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        return v11, v12, v13&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;There are a few aspects to note here. First, why are we only concerned
with calls? What about other sources of exceptions? This is an
important invariant in the IR: exception &lt;em&gt;throws&lt;&#x2F;em&gt; are &lt;em&gt;only externally
sourced&lt;&#x2F;em&gt;. In other words, if an exception has been thrown, if we go
deep enough into the callstack, we will find that that throw was
implemented by calling out into the runtime.  The IR itself has no
other opcodes that throw! This turns out to be sufficient: (i) we only
need to build what Wasmtime needs, here, and (ii) we can implement
Wasm&#x27;s throw opcodes as &quot;libcalls&quot;, or calls into the Wasmtime
runtime. So, within Cranelift-compiled code, exception throws always
happen at callsites. We can thus get away with adding only one opcode,
&lt;code&gt;try_call&lt;&#x2F;code&gt;, and attach handler information directly to that opcode.&lt;&#x2F;p&gt;
&lt;p&gt;The next characteristic of note is that handlers are ordinary basic
blocks.  Thus may not seem remarkable unless one has seen other
compiler IRs, such as LLVM&#x27;s, where exception handlers are definitely
special: they start with &quot;landing pad&quot; instructions, and cannot be
branched to as ordinary basic blocks. That might look something like:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;function %f() {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    block0:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        ;; Callsite defining a return value `v0`, with normal&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        ;; return path to `block1` and exception handler `block2`.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        v0 = try_call ..., block1, [ tag0: block2 ]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    block1:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        ;; Normal return; use returned value.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        return v0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    block2 exn_handler: ;; Specially-marked block!&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        ;; Exception handler payload value.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        v1 = exception_landing_pad&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This bifurcation of kinds of blocks (normal and exception handler) is
undesirable from our point of view: just as exceptional edges add a
new cross-cutting concern that every analysis and transform needs to
consider, so would new kinds of blocks with restrictions. It was an
explicit design goal (and we have tests that show!) that the same
block can be both an ordinary block and a handler block -- not because
that would be common, necessarily (handlers usually do very different
things than normal code paths), but because it&#x27;s one less weird quirk
of the IR.&lt;&#x2F;p&gt;
&lt;p&gt;But then if handlers are normal blocks, the data flow question becomes
very interesting. An exception-catching call, unlike every other
opcode in our IR, has &lt;em&gt;conditionally-defined values&lt;&#x2F;em&gt;: that is, its
normal function return value(s) are available only if the callee
returns normally, and the &lt;em&gt;exception payload value(s)&lt;&#x2F;em&gt;, which are
passed in from the unwinder and carry information about the caught
exception, are available only if the callee throws an exception that
we catch. How can we ensure that these values are represented such
that they can only be used in valid ways? We can&#x27;t make them all
regular SSA definitions of the opcode: that would mean that all
successors (regular return and exceptional) get to use them, as in:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;function %f() {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    block0:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        ;; Callsite defining a return value `v0`, with normal return path&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        ;; to `block1` and exception handler `block2`.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        v0 = try_call ..., block1, [ tag0: block2 ]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    block1:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        ;; Use `v0` legally: it is defined on normal return.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        return v0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    block2:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        ;; Oops! We use `v0` here, but the normal return value is undefined&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        ;; when an exception is caught and control reaches this handler block.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        return v0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is the reason that a compiler may choose to make handler blocks
special: by bifurcating the universe of blocks, one ensures that
normal-return and exceptional-return values are used only where
appropriate. Some compiler IRs reify exceptional return payloads via
&quot;landing pad&quot; instructions that must start handler blocks, just as
phis start regular blocks (in phi- rather than blockparam-based
SSA). But, again, this bifurcation is undesirable.&lt;&#x2F;p&gt;
&lt;p&gt;Our insight here, after &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;rfcs&#x2F;pull&#x2F;36&quot;&gt;a lot of
discussion&lt;&#x2F;a&gt;, was to
put the definitions where they belong: &lt;em&gt;on the edges&lt;&#x2F;em&gt;. That is,
regular returns are only defined once we know we&#x27;re following the
regular-return edge, and likewise for exception payloads. But we don&#x27;t
want to have special instructions that must be in the successor
blocks: that&#x27;s a weird distributed invariant and, again, likely to
lead to bugs when transforming IR. Instead, we leverage the fact that
we use &lt;em&gt;blockparam-based SSA&lt;&#x2F;em&gt; and we widen the domain of allowable
block-call arguments.&lt;&#x2F;p&gt;
&lt;p&gt;Whereas previously one might end a block like &lt;code&gt;brif v1, block2(v2, v3), block3(v4, v5)&lt;&#x2F;code&gt;, i.e. with blockparams assigned values in the
chosen successor via a list of value-uses in the branch, we now allow
(i) SSA values, (ii) a special &quot;normal return value&quot; sentinel, or
(iii) a special &quot;exceptional return value&quot; sentinel. The latter two
are indexed because there can be more than one of each. So one can
write a block-call in a &lt;code&gt;try_call&lt;&#x2F;code&gt; as &lt;code&gt;block2(ret0, v1, ret1)&lt;&#x2F;code&gt;, which
passes the two return values of the call and a normal SSA value; or
&lt;code&gt;block3(exn0, exn1)&lt;&#x2F;code&gt;, which passes just the two exception payload
values.  We do have a new well-formedness check on the IR that ensures
that (i) normal returns are used only in the normal-return blockcall,
and exception payloads are used only in the handler-table blockcalls;
(ii) normal returns&#x27; indices are bounded by the signature; and (iii)
exception payloads&#x27; indices are bounded by the ABI&#x27;s number of
exception payload values; but all of these checks are local to the
instruction, not distributed across blocks. That&#x27;s nice, and conforms
with the way that all of our other instructions work, too. (Block-call
argument types are then checked against block-parameter types in the
successor block, but that happens the same as for any branch.) So we
have, repeating from above, a callsite like&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    block1:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        try_call fn0(v1), block2(ret0), [ tag0: block3(exn0, exn1) ]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;with all of the desired properties: only one kind of block, explicit
control flow, and SSA values defined only where they are legal to use.&lt;&#x2F;p&gt;
&lt;p&gt;All of this may seem somewhat obvious in hindsight, but as attested by
the above GitHub discussions and Cranelift weekly meeting minutes, it
was far from clear when we started how to design all of this to
maximize simplicity and generality and minimize quirks and
footguns. I&#x27;m pretty happy with our final design: it feels like a
natural extension of our core blockparam-SSA control flow graph, and I
managed to &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10510&quot;&gt;put it into the
compiler&lt;&#x2F;a&gt;
without too much trouble at all (well, &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10502&quot;&gt;a
few&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10485&quot;&gt;PRs&lt;&#x2F;a&gt; and
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10555&quot;&gt;associated&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10554&quot;&gt;fixes&lt;&#x2F;a&gt; to
Cranelift
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;214&quot;&gt;and&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;220&quot;&gt;regalloc2&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;216&quot;&gt;functionality&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;224&quot;&gt;and testing&lt;&#x2F;a&gt;;
and I&#x27;m sure I&#x27;ve missed a few).&lt;&#x2F;p&gt;
&lt;h3 id=&quot;data-flow-and-abi&quot;&gt;Data Flow and ABI&lt;&#x2F;h3&gt;
&lt;p&gt;So we have defined an IR that can express exception handlers -- what
about the interaction between this function body and the unwinder? We
will need to define a different kind of semantics to nail down that
interface: in essence, it is a property of the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Application_Binary_Interface&quot;&gt;ABI (Application
Binary
Interface)&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;As mentioned above, existing exception-handling ABIs exist for native
code, such as compiled C++. While we are certainly willing to draw
inspiration from native ABIs and align with them as much as makes
sense, in Wasmtime we already define our own ABI&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#7&quot;&gt;7&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;, and so we are
not necessarily constrained by existing standards.&lt;&#x2F;p&gt;
&lt;p&gt;In particular, there is a very good reason we would prefer not to: to
unwind to a particular exception handler, register state must be
restored as specified in the ABI, and the standard Itanium ABI
requires the usual callee-saved (&quot;non-volatile&quot;) registers on the
target ISA to be restored. But this requires (i) having the register
state at time of throw, and (ii) processing unwind metadata at each
stack frame as we walk up the stack, reading out values of saved
registers from stack frames. The latter is &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;2710&quot;&gt;already
supported&lt;&#x2F;a&gt;
with a generic &quot;unwind pseudoinstruction&quot; framework I built four years
ago, but would still add complexity to our unwinder, and this
complexity would be load-bearing for correctness; and the former is
extremely difficult with Wasmtime&#x27;s normal runtime-entry
trampolines. So we instead choose to have a simpler exception ABI: all
&lt;code&gt;try_call&lt;&#x2F;code&gt;s, that is, callsites with handlers, clobber &lt;em&gt;all&lt;&#x2F;em&gt;
registers. This means that the compiler&#x27;s ordinary register-allocation
behavior will save all live values to the stack and restore them on
either a normal or exceptional return. We only have to restore the
stack (stack pointer and frame pointer registers) and redirect the
program counter (PC) to a handler.&lt;&#x2F;p&gt;
&lt;p&gt;The other aspect of the ABI that matters to the exception-throw
unwinder is exceptional payload. The native Itanium ABI specifies two
registers on most platforms (e.g.: &lt;code&gt;rax&lt;&#x2F;code&gt; and &lt;code&gt;rdx&lt;&#x2F;code&gt; on x86-64, or &lt;code&gt;x0&lt;&#x2F;code&gt;
and &lt;code&gt;x1&lt;&#x2F;code&gt; on aarch64) to carry runtime-defined playload; so for
simplicity, we adopt the same convention.&lt;&#x2F;p&gt;
&lt;p&gt;That&#x27;s all well and good; now how do we implement &lt;code&gt;try_call&lt;&#x2F;code&gt; with the
appropriate register-allocator behavior to conform to this? We already
have fairly complex ABI handling
(&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;main&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;machinst&#x2F;abi.rs&quot;&gt;machine-independent&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;main&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;isa&#x2F;x64&#x2F;abi.rs&quot;&gt;and&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;main&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;isa&#x2F;aarch64&#x2F;abi.rs&quot;&gt;five&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;main&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;isa&#x2F;s390x&#x2F;abi.rs&quot;&gt;different&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;main&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;isa&#x2F;riscv64&#x2F;abi.rs&quot;&gt;architecture&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;main&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;isa&#x2F;pulley_shared&#x2F;abi.rs&quot;&gt;implementations&lt;&#x2F;a&gt;)
in Cranelift, but it follows a general pattern: we generate a single
instruction at the register-allocator level, and emit uses and defs
with fixed-register constraints. That is, we tell regalloc that
parameters must be in certain registers (e.g., &lt;code&gt;rdi&lt;&#x2F;code&gt;, &lt;code&gt;rsi&lt;&#x2F;code&gt;, &lt;code&gt;rcx&lt;&#x2F;code&gt;,
&lt;code&gt;rdx&lt;&#x2F;code&gt;, &lt;code&gt;r8&lt;&#x2F;code&gt;, &lt;code&gt;r9&lt;&#x2F;code&gt; on x86-64 System-V calling-convention platforms, or
&lt;code&gt;x0&lt;&#x2F;code&gt; up to &lt;code&gt;x7&lt;&#x2F;code&gt; on aarch64 platforms) and let it handle any necessary
moves. So in the simplest case, a call might look like (on aarch64),
with register-allocator uses&#x2F;defs and constraints annotated:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;bl (call) v0 [def, fixed(x0)], v1 [use, fixed(x0)], v2 [use, fixed(x1)]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It is not always this simple, however: calls are not actually always a
single instruction, and this turned out to be quite problematic for
exception-handling support. In particular, when values are returned in
memory, as the ABI specifies they must be when there are more return
values than registers, we add (or added, prior to this work!) load
instructions &lt;em&gt;after&lt;&#x2F;em&gt; the call to load the extra results from their
locations on the stack. So a callsite might generate instructions like&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;bl v0 [def, fixed(x0)], ..., v7 [def, fixed(x7)] # first eight return values&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;ldr v8, [sp]     # ninth return value&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;ldr v9, [sp, #8] # tenth return value&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;and so on. This is problematic simply because we said that the
&lt;code&gt;try_call&lt;&#x2F;code&gt; was a terminator; and it is at the IR level, but no longer
at the regalloc level, and regalloc expects correctly-formed
control-flow graphs as well. So I had to &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10502&quot;&gt;do a
refactor&lt;&#x2F;a&gt; to
merge these return-value loads into a single regalloc-level
pseudoinstruction, and in turn this cascaded into a few regalloc fixes
(&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;226&quot;&gt;allowing more than 256
operands&lt;&#x2F;a&gt; and
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;214&quot;&gt;more aggressively splitting live-ranges to allow worst-case
allocation&lt;&#x2F;a&gt;,
plus a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;220&quot;&gt;fix to the live range-splitting
fix&lt;&#x2F;a&gt; and a
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;216&quot;&gt;fuzzing
improvement&lt;&#x2F;a&gt;).&lt;&#x2F;p&gt;
&lt;p&gt;There is one final question that might arise when considering the
interaction of exception handling and register allocation in
Cranelift-compiled code. In Cranelift, we have an invariant that the
register allocator is allowed to insert &lt;em&gt;moves&lt;&#x2F;em&gt; between any two
instructions -- register-to-register, or loads or stores to&#x2F;from
spill-slots in the stack frame, or moves between different spill-slots
-- and indeed it does this whenever there is more state than fits in
registers. It also needs to insert &lt;em&gt;edge moves&lt;&#x2F;em&gt; &quot;between&quot; blocks,
because when jumping to another spot in the code, we might need the
register values in a differently-assigned configuration. When we have
an unwinder that jumps to a different spot in the code to invoke a
handler, we need to ensure that all the proper moves have executed so
the state is as expected.&lt;&#x2F;p&gt;
&lt;p&gt;The answer here turns out to be a careful argument that we don&#x27;t need
to do anything at all. (That&#x27;s the best kind of solution to a problem,
but only if one is correct!) The crux of the argument has to do with
critical edges. A critical edge is one from a block with multiple
successors to one with multiple predecessors: for example, in the graph&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;   A    D&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  &#x2F; \  &#x2F;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt; B   C&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;where A can jump to B or C, and D can also jump to C, then A-to-C is a
critical edge. The problem with critical edges is that there is
nowhere to put code that has to run on the transition from A to C (it
can&#x27;t go in A, because we may go to B or C; and it can&#x27;t go in C,
because we may have come from A or D). So the register allocator
prohibits them, and we &quot;split&quot; them when generating code by inserting
empty blocks (&lt;code&gt;e&lt;&#x2F;code&gt; below) on them:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;   A    D&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  &#x2F; \   |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt; |   e  |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt; |   \ &#x2F;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt; B    C&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The key insight is that a &lt;code&gt;try_call&lt;&#x2F;code&gt; always has more than one
successor as long as it has a handler (because it must always have a
normal return-path successor too)&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#8&quot;&gt;8&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;; and in this case, because we
split critical edges, the immediate successor block on the
exception-catch path has only one predecessor. So the register
allocator can always put its moves that have to run on catching an
exception in the successor (handler) block rather than the predecessor
block. Our rule for where to put edge moves prefers the successor
(block &quot;after&quot; the edge) unless it has multiple in-edges, so this was
already the case. The only thing we have to be careful about is to
record the address of the &lt;em&gt;inserted edge block&lt;&#x2F;em&gt;, if any (&lt;code&gt;e&lt;&#x2F;code&gt; above),
rather than the IR-level handler block (&lt;code&gt;C&lt;&#x2F;code&gt; above), in the handler
table.&lt;&#x2F;p&gt;
&lt;p&gt;And that&#x27;s pretty much it, as far as register allocation is concerned!&lt;&#x2F;p&gt;
&lt;p&gt;We&#x27;ve now covered the basics of Cranelift&#x27;s exception support. At this
point, having landed the compiler half but not the Wasmtime half, I
context-switched away for a bit, and in the meantime, bjorn3 picked
this support up right away as a means to add panic-unwinding support
to
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;rust-lang&#x2F;rustc_codegen_cranelift&quot;&gt;&lt;code&gt;rustc_codegen_cranelift&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;,
the Cranelift-based Rust compiler backend. With &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10593&quot;&gt;a few small
changes&lt;&#x2F;a&gt; they
contributed, and a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10709&quot;&gt;followup edge-case
fix&lt;&#x2F;a&gt; and a
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10609&quot;&gt;refactor&lt;&#x2F;a&gt;,
panic-unwinding support in &lt;code&gt;rustc_codegen_cranelift&lt;&#x2F;code&gt; was working. That
was very good intermediate validation that what I had built was usable
and relatively solid.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;exceptions-in-wasmtime&quot;&gt;Exceptions in Wasmtime&lt;&#x2F;h2&gt;
&lt;p&gt;We have a compiler that supports exceptions; we understand Wasm
exception semantics; let&#x27;s build support into Wasmtime! How hard could
it be?&lt;&#x2F;p&gt;
&lt;h3 id=&quot;challenge-1-garbage-collection-interactions&quot;&gt;Challenge 1: Garbage Collection Interactions&lt;&#x2F;h3&gt;
&lt;p&gt;I started by sketching out the codegen for each of the three opcodes
(&lt;code&gt;try_table&lt;&#x2F;code&gt;, &lt;code&gt;throw&lt;&#x2F;code&gt;, and &lt;code&gt;throw_ref&lt;&#x2F;code&gt;). My mental model at the very
beginning of this work, having read but not fully internalized the
Wasm exception-handling proposal, was that I would be able to
implement a &quot;basic&quot; throw&#x2F;catch first, and then somehow build the
&lt;code&gt;exnref&lt;&#x2F;code&gt; objects later. And I had figured I could build &lt;code&gt;exnref&lt;&#x2F;code&gt;s in a
(in hindsight) somewhat hacky way, by aggregating values together in a
kind of tuple and creating a table of such tuples indexed by exnrefs,
just as Wasmtime does for externrefs.&lt;&#x2F;p&gt;
&lt;p&gt;This understanding quickly gave way to a deeper one when I realized a
few things:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Exception objects (exnrefs) can carry references to other GC objects
(that is, GC types can be part of the payload signature of an
exception), and GC objects can store exnrefs in fields. Hence,
exnrefs need to be traced, and can participate in GC cycles; this
either implies an additional collector on top of our GC collector
(ugh) or means that exception objects needs to be on the GC heap
when GC is enabled.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;We&#x27;ll need a host API to introspect and build exception objects, and
we already have nice host APIs for GC objects.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;There was a question &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11230&quot;&gt;in an extensively-discussed
PR&lt;&#x2F;a&gt; whether
we could build a cheap &quot;subset&quot; implementation that doesn&#x27;t mandate
the existence of a GC heap for storing exception objects. This would
be great in theory for guests that use exceptions for C-level
setjmp&#x2F;longjmp but no other GC features.  However, it&#x27;s a little
tricky for a few reasons. First, this would require the subset to
exclude &lt;code&gt;throw_ref&lt;&#x2F;code&gt; (so we don&#x27;t have to invent another kind of
exception object storage). But it&#x27;s not great to subset the spec --
and &lt;code&gt;throw_ref&lt;&#x2F;code&gt; is not just for GC guest languages, but also for
rethrows. Second, more generally, this is additional maintenance and
testing surface that we&#x27;d rather not have for now. Instead we expect
that we can make GC cheap enough, and its growth heuristic smart
enough that a &quot;frequent setjmp&#x2F;longjmp&quot; stress-test of exceptions (for
example) should live within a very small (e.g., few-kilobyte) GC heap,
essentially approximating the purpose-built storage. My colleague Nick
Fitzgerald (who built and is driving improvements to Wasmtime&#x27;s GC
support) wrote up &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;issues&#x2F;11256&quot;&gt;a nice
issue&lt;&#x2F;a&gt;
describing the tradeoffs and ideas we have.&lt;&#x2F;p&gt;
&lt;p&gt;All of that said, we&#x27;ll only build one exception object implementation
-- great! -- but it will have to be a new kind of GC object. This
spawned a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11230&quot;&gt;large
PR&lt;&#x2F;a&gt; to build
out exception objects first, prior to actual support for throwing and
catching them, with host APIs to allocate them and inspect their
fields. In essence, they are structs with immutable fields and with a
less-exposed type lattice and no subtyping.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;challenge-2-generative-tags-and-dynamic-identity&quot;&gt;Challenge 2: Generative Tags and Dynamic Identity&lt;&#x2F;h3&gt;
&lt;p&gt;So there I was, implementing the &lt;code&gt;throw&lt;&#x2F;code&gt; instruction&#x27;s libcall
(runtime implementation), and finally getting to the heart of the
matter: the unwinder itself, which walks stack frames to find a
matching exception handler.  This is the final bit of functionality
that ties it all together. We&#x27;re almost there!&lt;&#x2F;p&gt;
&lt;p&gt;But wait: check out that &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;webassembly.github.io&#x2F;spec&#x2F;core&#x2F;exec&#x2F;instructions.html#xref-syntax-instructions-syntax-instr-control-mathsf-throw-x&quot;&gt;spec
language&lt;&#x2F;a&gt;.
We load the &quot;tag address&quot; from the store in step 9: we allocate the
exception instance &lt;code&gt;{tag z.tags[x], fields val^n}&lt;&#x2F;code&gt;. What is this
&lt;code&gt;tags&lt;&#x2F;code&gt; array on the store (&lt;code&gt;z&lt;&#x2F;code&gt;) in the runtime semantics? Tags have
dynamic identity, not static identity! (This is the part where I
learned about the thing I described
&lt;a href=&quot;https:&#x2F;&#x2F;cfallin.org&#x2F;blog&#x2F;2025&#x2F;11&#x2F;06&#x2F;exceptions&#x2F;#dynamic-identity-and-compositionality&quot;&gt;above&lt;&#x2F;a&gt;.)&lt;&#x2F;p&gt;
&lt;p&gt;This was a problem, because I had defined exception tables to
associate handlers with tags that were identified by integer (&lt;code&gt;u32&lt;&#x2F;code&gt;)
-- like most other entities in Cranelift IR, I had figured this would
be sufficient to let Wasmtime define indices (say: index of the tag in
the module), and then we could compare static tag IDs.&lt;&#x2F;p&gt;
&lt;p&gt;Perhaps this is no problem: the static index defines the entity ID in
the module (defined or imported tag), and we can compare that and the
instance ID to see if a handler is a match. But how do we get the
instance ID from the stack frame?&lt;&#x2F;p&gt;
&lt;p&gt;It turns out that Wasmtime didn&#x27;t have a way, because nothing had
needed that yet. (This deficiency had been noticed before when
implementing Wasm coredumps, but there hadn&#x27;t been enough reason or
motivation to fix it then.) So I &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;issues&#x2F;11285&quot;&gt;filed an
issue&lt;&#x2F;a&gt; with
a few ideas. We could add a new field in every frame storing the
instance pointer -- and in fact this is a simple version of what at
least one other production Wasm implementation, in the SpiderMonkey
web engine,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;firefox-main&#x2F;rev&#x2F;643d732886fe0de4e2a3eee3c5ed9bd0d47c77cf&#x2F;js&#x2F;src&#x2F;wasm&#x2F;WasmFrame.h#112-115&quot;&gt;does&lt;&#x2F;a&gt;
(though as described in that &lt;code&gt;[SMDOC]&lt;&#x2F;code&gt; comment, it only stores
instance pointers on transitions between frames of different
instances; this is enough for the unwinder when walking linearly up
the stack). But that would add overhead to &lt;em&gt;every&lt;&#x2F;em&gt; Wasm function (or
with SpiderMonkey&#x27;s approach, require adding trampolines between
instances, which would be a large change for Wasmtime), and exception
handling is still used somewhat rarely in practice.  Ideally we&#x27;d have
a &quot;pay-as-you-go&quot; scheme with as little extra complexity as posible.&lt;&#x2F;p&gt;
&lt;p&gt;Instead, I came up with an idea to &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11321&quot;&gt;add &quot;dynamic context&quot; items to
exception handler
lists&lt;&#x2F;a&gt;. The
idea is that we inject an SSA value into the list and it is stored in
a stack location that is given in the handler table metadata, so the
stack-walker can find it. To Cranelift, this is some arbitrary opaque
value; Wasmtime will use it to store the raw instance pointer
(&lt;code&gt;vmctx&lt;&#x2F;code&gt;) for use by the unwinder.&lt;&#x2F;p&gt;
&lt;p&gt;This filled out the design to a more general state nicely: it is
symmetric with exception payload, in the sense that the compiled code
can communicate context or state &lt;em&gt;to&lt;&#x2F;em&gt; the unwinder as it reads the
frames, and the unwinder in turn can communicate data &lt;em&gt;to&lt;&#x2F;em&gt; the
compiled code when it unwinds.&lt;&#x2F;p&gt;
&lt;p&gt;It turns out -- though I didn&#x27;t intend this at all at the time -- that
this also nicely solves the &lt;em&gt;inlining problem&lt;&#x2F;em&gt;. In brief, we want all
of our IR to be &quot;local&quot;, not treating the function boundary specially;
this way, IR can be composed by the inliner without anything
breaking. Storing some &quot;current instance&quot; state for the whole function
will, of course, break when we inline a function from one module
(hence instance) into another!&lt;&#x2F;p&gt;
&lt;p&gt;Instead, we can give a nice operational semantics to handler tables
with dynamic-context items: the unwinder should read left-to-right,
updating its &quot;current dynamic context&quot; at each dynamic-context item,
and checking for a tag match at tag-handler items. Then the inliner
can &lt;em&gt;compose&lt;&#x2F;em&gt; exception tables: when a &lt;code&gt;try_call&lt;&#x2F;code&gt; callsite inlines a
function body as its callee, and that body itself has any other
callsites, we attach a handler table that simply concatenates the
exception table items.&lt;&#x2F;p&gt;
&lt;p&gt;It&#x27;s important, here, to point out another surprising fact about Wasm
semantics: we &lt;em&gt;cannot do certain optimizations&lt;&#x2F;em&gt; to resolve handlers
statically or optimize the handler list, or at least not naively,
without global program analysis to understand where tags come
from. For example, if we see a handler for tag 0 then one for tag 1,
and we see a throw for tag 1 directly inside the &lt;code&gt;try_table&lt;&#x2F;code&gt;s body, we
cannot necessarily resolve it: tag 0 and tag 1 could be the same tag!&lt;&#x2F;p&gt;
&lt;p&gt;Wait, how can that be? Well, consider &lt;em&gt;tag imports&lt;&#x2F;em&gt;:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;(module&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  (import &amp;quot;test&amp;quot; &amp;quot;e0&amp;quot; (tag $e0))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  (import &amp;quot;test&amp;quot; &amp;quot;e1&amp;quot; (tag $e1))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  (func ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        (try_table&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                   (catch $e0 $b0)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                   (catch $e1 $b1)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                   (throw $e1)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                   (unreachable))))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We could instantiate this module giving the same dynamic tag instance
twice, for both imports; then the first handler (to block &lt;code&gt;$b0&lt;&#x2F;code&gt;)
matches; or separate tags; then the block &lt;code&gt;$b1&lt;&#x2F;code&gt; matches. The only way
to win the optimization game is not to play -- we have to preserve the
original handler list.  Fortunately, that makes the compiler&#x27;s job
easier. We transcribe the &lt;code&gt;try_table&lt;&#x2F;code&gt;&#x27;s handlers directly to Cranelift
exception-handler tables, and those directly to metadata in the
compiled module, read in exactly that order by the unwinder&#x27;s
handler-matching logic.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;challenge-3-rooting&quot;&gt;Challenge 3: Rooting&lt;&#x2F;h3&gt;
&lt;p&gt;Since exception objects are GC-managed objects, we have to ensure that
they are properly &lt;em&gt;rooted&lt;&#x2F;em&gt;: that is, any handles to these objects
outside of references inside other GC objects need to be known to the
GC so the objects remain alive (and so the references are updated in
the case of a moving GC).&lt;&#x2F;p&gt;
&lt;p&gt;Within a Wasm-to-Wasm exception throw scenario, this is fairly easy:
the references are rooted in the compiled code on either side of the
control-flow transfer, and the reference only briefly passes through
the unwinder. As long as we are careful to handle it with the
appropriate types, all will work fine.&lt;&#x2F;p&gt;
&lt;p&gt;Passing exceptions across the host&#x2F;Wasm boundary is another matter,
though. We support the full matrix of {host, Wasm} x {host, Wasm}
exception catch&#x2F;throw pairs: that is, exceptions can be thrown from
native host code called by Wasm (via a Wasm import), and exceptions
can be thrown out of Wasm code and returned as a kind of error to the
host code that invoked the Wasm. This works by boxing the exception
inside an &lt;code&gt;anyhow::Error&lt;&#x2F;code&gt; so we use Rust-style value-based error
propagation (via &lt;code&gt;Result&lt;&#x2F;code&gt; and the &lt;code&gt;?&lt;&#x2F;code&gt; operator) in host code.&lt;&#x2F;p&gt;
&lt;p&gt;What happens when we have a value inside the &lt;code&gt;Error&lt;&#x2F;code&gt; that holds an
exception object in the Wasmtime &lt;code&gt;Store&lt;&#x2F;code&gt;? How does Wasmtime know this
is rooted?&lt;&#x2F;p&gt;
&lt;p&gt;The answer in Wasmtime prior to recent work was to use one of two
kinds of external rooting wrappers: &lt;code&gt;Rooted&lt;&#x2F;code&gt; and
&lt;code&gt;ManuallyRooted&lt;&#x2F;code&gt;. Both wrappers hold an index into a table contained
inside the &lt;code&gt;Store&lt;&#x2F;code&gt;, and that table contains the actual GC
reference. This allows the GC to easily see the roots and update them.&lt;&#x2F;p&gt;
&lt;p&gt;The difference lies in the lifetime disciplines: &lt;code&gt;ManuallyRooted&lt;&#x2F;code&gt;
requires, as the name implies, manual unrooting; it has no &lt;code&gt;Drop&lt;&#x2F;code&gt;
implementation, and so easily creates leaks. &lt;code&gt;Rooted&lt;&#x2F;code&gt;, on the other
hand, had a LIFO (last-in first-out) discipline based on a &lt;code&gt;Scope&lt;&#x2F;code&gt;, an
RAII type created by the embedder (user) of Wasmtime. &lt;code&gt;Rooted&lt;&#x2F;code&gt; GC
references that escape that dynamic scope are unrooted, and will cause
an error (panic) at runtime if used. Neither of those behaviors is
ideal for a value type -- an exception -- that is &lt;em&gt;meant&lt;&#x2F;em&gt; to escape
scopes via &lt;code&gt;?&lt;&#x2F;code&gt;-propagation.&lt;&#x2F;p&gt;
&lt;p&gt;The design that we landed on, instead, takes a different and much
simpler approach: the &lt;code&gt;Store&lt;&#x2F;code&gt; has a single, explicit root slot for the
&quot;pending exception&quot;, and host code can set this and then return a
&lt;em&gt;sentinel value&lt;&#x2F;em&gt; (&lt;code&gt;wasmtime::ThrownException&lt;&#x2F;code&gt;) in the &lt;code&gt;Result&lt;&#x2F;code&gt;&#x27;s error
type (boxed up into an &lt;code&gt;anyhow::Error&lt;&#x2F;code&gt;). This easily allows
propagation to work as expected, with no unbounded leaks (there is
only one pending exception that is rooted) and no unrooted propagating
exceptions (because no actual GC reference propagates, only the
sentinel).&lt;&#x2F;p&gt;
&lt;p&gt;As a side-quest, while thinking through this rooting dilemma, I also
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;issues&#x2F;11445&quot;&gt;realized&lt;&#x2F;a&gt;
that it &lt;em&gt;should&lt;&#x2F;em&gt; be possible to create an &quot;owned&quot; rooted reference
that behaves more like a conventional owned Rust value (e.g. &lt;code&gt;Box&lt;&#x2F;code&gt;);
hence &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11514&quot;&gt;&lt;code&gt;OwnedRooted&lt;&#x2F;code&gt; was born to replace
&lt;code&gt;ManuallyRooted&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;.
This type works without requiring access to the &lt;code&gt;Store&lt;&#x2F;code&gt; to unroot when
dropped; the key idea is to hold a refcount to a separate tiny
allocation that is used as a &quot;drop flag&quot;, and then have the store
periodically scan these drop-flags and lazily remove roots, with a
thresholding algorithm to give that scanning amortized linear-time
behavior.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#9&quot;&gt;9&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Now that we have this, in theory, we could pass an
&lt;code&gt;OwnedRooted&amp;lt;ExnRef&amp;gt;&lt;&#x2F;code&gt; directly in the &lt;code&gt;Error&lt;&#x2F;code&gt; type to propagate
exceptions through host code; but the store-rooted approach is simple
enough, has a marginal performance advantage (no separate allocation),
and so I don&#x27;t see a strong need to change the API at the moment.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;life-of-an-exception-quick-walkthrough&quot;&gt;Life of an Exception: Quick Walkthrough&lt;&#x2F;h3&gt;
&lt;p&gt;Now that we&#x27;ve discussed all the design choices, let&#x27;s walk through
the life an exception throw&#x2F;catch, from start to finish. Let&#x27;s assume
a Wasm-to-Wasm throw&#x2F;catch for simplicity here.&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;First, the Wasm program is executing within a &lt;code&gt;try_table&lt;&#x2F;code&gt;, which
results in an exception handler &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;cranelift&#x2F;src&#x2F;translate&#x2F;code_translator.rs#L609-L613&quot;&gt;catch blocks being
created&lt;&#x2F;a&gt;
for each handler case listed in the &lt;code&gt;try_table&lt;&#x2F;code&gt; instruction. The
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;cranelift&#x2F;src&#x2F;translate&#x2F;code_translator.rs#L4325&quot;&gt;&lt;code&gt;create_catch_block&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
function generates code that invokes
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;cranelift&#x2F;src&#x2F;func_environ&#x2F;gc&#x2F;enabled.rs#L415&quot;&gt;&lt;code&gt;translate_exn_unbox&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;,
which reads out all of the fields from the exception object and
pushes them onto the Wasm operand stack in the handler path. This
handler block is registered in the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;cranelift&#x2F;src&#x2F;translate&#x2F;stack.rs#L570&quot;&gt;&lt;code&gt;HandlerState&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;,
which tracks the current lexical stack of handlers (and hands out
checkpoints so that when we pop out of a Wasm block-type operator,
we can pop the handlers off the state as well). These handlers are
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;cranelift&#x2F;src&#x2F;translate&#x2F;stack.rs#L611&quot;&gt;provided as an
iterator&lt;&#x2F;a&gt;
which is &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;cranelift&#x2F;src&#x2F;translate&#x2F;code_translator.rs#L661&quot;&gt;passed to the &lt;code&gt;translate_call&lt;&#x2F;code&gt;
method&lt;&#x2F;a&gt;
and eventually ends up &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;cranelift&#x2F;src&#x2F;func_environ.rs#L2366-L2379&quot;&gt;creating an exception
table&lt;&#x2F;a&gt;
on a &lt;code&gt;try_call&lt;&#x2F;code&gt; instruction. This &lt;code&gt;try_call&lt;&#x2F;code&gt; will invoke whatever
Wasm code is about to throw the exception.&lt;&#x2F;li&gt;
&lt;li&gt;Then, the Wasm program reaches a &lt;code&gt;throw&lt;&#x2F;code&gt; opcode, which is
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;cranelift&#x2F;src&#x2F;translate&#x2F;code_translator.rs#L621&quot;&gt;translated&lt;&#x2F;a&gt;
via
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;cranelift&#x2F;src&#x2F;func_environ.rs#L2825&quot;&gt;&lt;code&gt;FuncEnvironment::translate_exn_throw&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
to a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;cranelift&#x2F;src&#x2F;func_environ&#x2F;gc&#x2F;enabled.rs#L473-L482&quot;&gt;three-operation
sequence&lt;&#x2F;a&gt;
that fetches the current instance ID (via a libcall into the
runtime), allocates a new exception object with that instance ID and
a fixed tag number and fills in its slots with the given values
popped from the Wasm operand stack, and delegates to &lt;code&gt;throw_ref&lt;&#x2F;code&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;The &lt;code&gt;throw_ref&lt;&#x2F;code&gt; opcode implementation then invokes the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;cranelift&#x2F;src&#x2F;func_environ&#x2F;gc&#x2F;enabled.rs#L519&quot;&gt;&lt;code&gt;throw_ref&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
libcall.&lt;&#x2F;li&gt;
&lt;li&gt;This libcall is deceptively simple: its
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;wasmtime&#x2F;src&#x2F;runtime&#x2F;vm&#x2F;libcalls.rs#L1695-L1707&quot;&gt;implementation&lt;&#x2F;a&gt;
sets the pending exception on the store, and returns a sentinel that
signals a pending exception. That&#x27;s it!&lt;&#x2F;li&gt;
&lt;li&gt;This works because the glue code for &lt;em&gt;all&lt;&#x2F;em&gt; libcalls processes errors
(via the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;wasmtime&#x2F;src&#x2F;runtime&#x2F;vm&#x2F;traphandlers.rs#L152&quot;&gt;&lt;code&gt;HostResult&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
trait implementations) and eventually reaches &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;wasmtime&#x2F;src&#x2F;runtime&#x2F;vm&#x2F;traphandlers.rs#L773-L786&quot;&gt;this
case&lt;&#x2F;a&gt;
which sees a pending exception sentinel and invokes
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;wasmtime&#x2F;src&#x2F;runtime&#x2F;vm&#x2F;throw.rs#L15&quot;&gt;&lt;code&gt;compute_handler&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;. Now
we&#x27;re getting to the heart of the exception-throw implementation.&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;compute_handler&lt;&#x2F;code&gt; walks the stack with
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;unwinder&#x2F;src&#x2F;throw.rs#L45&quot;&gt;&lt;code&gt;Handler::find&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;,
which itself is based on
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;unwinder&#x2F;src&#x2F;stackwalk.rs#L248&quot;&gt;&lt;code&gt;visit_frames&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;,
which does about what one would expect for code with a frame-pointer
chain: it walks the singly-linked list of frames. At each frame, the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;wasmtime&#x2F;src&#x2F;runtime&#x2F;vm&#x2F;throw.rs#L51&quot;&gt;closure&lt;&#x2F;a&gt;
that &lt;code&gt;compute_handler&lt;&#x2F;code&gt; gave to &lt;code&gt;Handler::find&lt;&#x2F;code&gt; looks up the program
counter in that frame (which will be a return address, i.e., the
instruction after the call that created the next lower frame) using
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;wasmtime&#x2F;src&#x2F;runtime&#x2F;module&#x2F;registry.rs#L74&quot;&gt;&lt;code&gt;lookup_module_by_pc&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
to find a &lt;code&gt;Module&lt;&#x2F;code&gt;, which itself has an
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;unwinder&#x2F;src&#x2F;exception_table.rs#L225&quot;&gt;&lt;code&gt;ExceptionTable&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
(a parser for serialized metadata produced during compilation from
Cranelift metadata) that knows how to look up a PC within a
module. This will produce an &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;unwinder&#x2F;src&#x2F;exception_table.rs#L310&quot;&gt;&lt;code&gt;Iterator&lt;&#x2F;code&gt; over
handlers&lt;&#x2F;a&gt;
which we &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;wasmtime&#x2F;src&#x2F;runtime&#x2F;vm&#x2F;throw.rs#L63&quot;&gt;test in
order&lt;&#x2F;a&gt;
to see if any match. (The groups of exception-handler table items
that come out of Cranelift are post-processed
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;unwinder&#x2F;src&#x2F;exception_table.rs#L144&quot;&gt;here&lt;&#x2F;a&gt;
to generate the tables that the above routines search.)&lt;&#x2F;li&gt;
&lt;li&gt;If we find a handler, that is, if &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;wasmtime&#x2F;src&#x2F;runtime&#x2F;vm&#x2F;throw.rs#L108-L109&quot;&gt;the dynamic tag instance is the
same&lt;&#x2F;a&gt;
or &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;wasmtime&#x2F;src&#x2F;runtime&#x2F;vm&#x2F;throw.rs#L67&quot;&gt;we reach a catch-all
handler&lt;&#x2F;a&gt;,
then we have an exception handler! We return the PC and SP to
restore
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;wasmtime&#x2F;src&#x2F;runtime&#x2F;vm&#x2F;throw.rs#L112-L120&quot;&gt;here&lt;&#x2F;a&gt;,
computing SP via an FP-to-SP offset (i.e., the size of the frame),
which is fixed and included in the exception tables when we
construct them.&lt;&#x2F;li&gt;
&lt;li&gt;That action then becomes an &lt;code&gt;UnwindState::UnwindToWasm&lt;&#x2F;code&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;wasmtime&#x2F;src&#x2F;runtime&#x2F;vm&#x2F;traphandlers.rs#L779&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;li&gt;
&lt;li&gt;This &lt;code&gt;UnwindToWasm&lt;&#x2F;code&gt; state then triggers &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;wasmtime&#x2F;src&#x2F;runtime&#x2F;vm&#x2F;traphandlers.rs#L913-L933&quot;&gt;this
case&lt;&#x2F;a&gt;
in the &lt;code&gt;unwind&lt;&#x2F;code&gt; libcall, which is invoked whenever any libcall
returns an error code; that eventually calls the no-return function
&lt;code&gt;resume_to_exception_handler&lt;&#x2F;code&gt;, which is a little function written in
inline assembly that does exactly what it says on the tin. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;unwinder&#x2F;src&#x2F;arch&#x2F;x86.rs#L32-L34&quot;&gt;These
three
instructions&lt;&#x2F;a&gt;
set &lt;code&gt;rsp&lt;&#x2F;code&gt; and &lt;code&gt;rbp&lt;&#x2F;code&gt; to their new values, and jump to the new &lt;code&gt;rip&lt;&#x2F;code&gt;
(PC). The same stub exists for each of our four native-compilation
architectures (x86-64 above,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;unwinder&#x2F;src&#x2F;arch&#x2F;aarch64.rs#L60-L62&quot;&gt;aarch64&lt;&#x2F;a&gt;,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;unwinder&#x2F;src&#x2F;arch&#x2F;riscv64.rs#L29-L31&quot;&gt;riscv64&lt;&#x2F;a&gt;,
and
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;8e22ff89f6affe4f79fcebdb416d0ab401d43c97&#x2F;crates&#x2F;unwinder&#x2F;src&#x2F;arch&#x2F;s390x.rs#L32-L33&quot;&gt;s390x&lt;&#x2F;a&gt;&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#10&quot;&gt;10&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;).
That transfers control to the catch-block created above, and the
Wasm continues running, unboxing the exception payload and running
the handler!&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;So we have Wasm exception handling now! For all of the interesting
design questions we had to work through, the end was pretty
anticlimactic. I landed &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11326&quot;&gt;the final
PR&lt;&#x2F;a&gt;, and
after a follow-up cleanup PR
(&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11467&quot;&gt;1&lt;&#x2F;a&gt;) and
some fuzzbug fixes
(&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11500&quot;&gt;1&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11507&quot;&gt;2&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11530&quot;&gt;3&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11531&quot;&gt;4&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11535&quot;&gt;5&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11564&quot;&gt;6&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11554&quot;&gt;7&lt;&#x2F;a&gt;) having
mostly to do with null-pointer handling and other edge cases in the
type system, plus one interaction with tail-calls (and a
separate&#x2F;pre-existing &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11689&quot;&gt;s390x ABI
bug&lt;&#x2F;a&gt; that it
uncovered), it has been basically stable. We pretty quickly got a few
user reports:
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bytecodealliance.zulipchat.com&#x2F;#narrow&#x2F;channel&#x2F;217117-cranelift&#x2F;topic&#x2F;Running.20lua.205.2E1.20wasi&#x2F;near&#x2F;535368031&quot;&gt;here&lt;&#x2F;a&gt;
it was reported as working for a Lua interpreter using setjmp&#x2F;longjmp
inside Wasm based on exceptions, and
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bytecodealliance.zulipchat.com&#x2F;#narrow&#x2F;channel&#x2F;217126-wasmtime&#x2F;topic&#x2F;WebAssembly.20exceptions.20proposal.20is.20now.20implemented&#x2F;near&#x2F;536299383&quot;&gt;here&lt;&#x2F;a&gt;
it enabled Kotlin-on-Wasm to run and &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bytecodealliance.zulipchat.com&#x2F;#narrow&#x2F;channel&#x2F;217126-wasmtime&#x2F;topic&#x2F;WebAssembly.20exceptions.20proposal.20is.20now.20implemented&#x2F;near&#x2F;543748960&quot;&gt;pass a large
testsuite&lt;&#x2F;a&gt;.
Not bad!&lt;&#x2F;p&gt;
&lt;!--
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;220: +82 -84
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;223: +5 -6
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;224: +16 -19
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10510: +4199 -423
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10609: +109 -100
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10709: +49 -5
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;226: +26 -10
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;221: +1 -1
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10571: +36 -24
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;225: +1 -1
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10590: +19 -5
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;227: +1 -1
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10747: +84 -5
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10748: +84 -5
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10919: +1472 -347
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11230: +2490 -191
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;231: +93 -12
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11321: +1771 -322
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11326: +2593 -523
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11467: +269 -227
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11500: +246 -116
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11507: +15 -1
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11511: +6 -2
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11514: +758 -531
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11530: +12 -2
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11531: +1 -0
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11533: +31 -20
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11535: +31 -1
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11554: +2 -0
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11564: +29 -5
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10485: +303 -203
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;212: +1 -2
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10502: +1338 -767
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;214: +7 -3
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;216: +56 -8
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10554: +15 -26
  https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10555: +13 -6
--&gt;
&lt;p&gt;All told, this took 37 PRs with a diff-stat of &lt;code&gt;+16264 -4004&lt;&#x2F;code&gt; (16KLoC
total) -- certainly not the &quot;small-to-medium-sized&quot; project I had
initially optimistically expected, but I&#x27;m happy we were able to build
it out and get it to a stable state relatively easily. It was a
rewarding journey in a different way than a lot of my past work
(mostly on the Cranelift side) -- where many of my past projects have
been really very open-ended design or even research questions, here we
had the high-level shape already and all of the work was in designing
high-quality details and working out all the interesting interactions
with the rest of the system. I&#x27;m happy with how clean the IR design
turned out in particular, and I don&#x27;t think it would have done so
without the really excellent continual discussion with the rest of the
Cranelift and Wasmtime contributors (thanks to Nick Fitzgerald and
Alex Crichton in particular here).&lt;&#x2F;p&gt;
&lt;p&gt;As an aside: I am happy to see how, aside from use-cases for Wasm
exception handling, the exception support in Cranelift itself has been
useful too.  As mentioned above, &lt;code&gt;cg_clif&lt;&#x2F;code&gt; picked it up almost as soon
as it was ready; but then, as an unexpected and pleasant surprise,
Alex subsequently &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11592&quot;&gt;rewrote Wasmtime&#x27;s trap
unwinding&lt;&#x2F;a&gt; to
use Cranelift exception handlers in our entry trampolines rather than
a setjmp&#x2F;longjmp, as the latter have longstanding semantic
questions&#x2F;issues in Rust. This took &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;11629&quot;&gt;one more
intrinsic&lt;&#x2F;a&gt;,
which I implemented after discussing with Alex how best to expose
exception handler addresses to custom unwind logic without the full
exception unwinder, but was otherwise a pretty direct application of
&lt;code&gt;try_call&lt;&#x2F;code&gt; and our exception ABI.  General building blocks prove
generally useful, it seems!&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;em&gt;Thanks to Alex Crichton and Nick Fitzgerald for providing feedback on
a draft of this post!&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;1&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;1&lt;&#x2F;sup&gt;
&lt;p&gt;To explain myself a bit, I underestimated the interactions of
exception handling with garbage collection (GC); I hadn&#x27;t
realized yet that &lt;code&gt;exnref&lt;&#x2F;code&gt;s were a full first-class value and
would need to be supported including in the host API. Also, it
turns out that exceptions can cross the host&#x2F;guest boundary, and
goodness knows that gets really fun really fast. I was &lt;em&gt;only&lt;&#x2F;em&gt;
off by a factor of two on the compiler side at least!&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;2&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;2&lt;&#x2F;sup&gt;
&lt;p&gt;From an implementation perspective, the dynamic, interprocedural
nature of exceptions is what makes them far more interesting,
and involved, than classical control flow such as conditionals,
loops, or calls! This is why we need a mechanism that involves
runtime data structrues, &quot;stack walks&quot;, and lookup tables,
rather than simply generating a jump to the right place: the
target of an exception-throw can only be computed at runtime,
and we need a convention to transfer control with &quot;payload&quot; to
that location.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;3&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;3&lt;&#x2F;sup&gt;
&lt;p&gt;For those so inclined, this is a
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Monad_(functional_programming)&quot;&gt;&lt;em&gt;monad&lt;&#x2F;em&gt;&lt;&#x2F;a&gt;,
and e.g. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Haskell&quot;&gt;Haskell&lt;&#x2F;a&gt;
implements the ability to have &quot;result or error&quot; types that
return from a sequence early via
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;hackage.haskell.org&#x2F;package&#x2F;base-4.21.0.0&#x2F;docs&#x2F;Data-Either.html#t:Either&quot;&gt;&lt;code&gt;Either&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;,
explicitly describing the concept as such. The &lt;code&gt;?&lt;&#x2F;code&gt; operator
serves as the &quot;bind&quot; of the monad: it connects an
error-producing computation with a use of the non-error value,
returning the error directly if one is given instead.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;4&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;4&lt;&#x2F;sup&gt;
&lt;p&gt;So named for the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;IA-64&quot;&gt;Intel Itanium
(IA-64)&lt;&#x2F;a&gt;, an instruction-set
architecture that happened to be the first ISA where this scheme was
implemented for C++, and is now essentially dead (before its time! woefully
misunderstood!) but for that legacy...&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;5&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;5&lt;&#x2F;sup&gt;
&lt;p&gt;It&#x27;s worth briefly noting here that the Wasm exception handling
proposal went through a somewhat twisty journey, with an earlier
variant (now called &quot;legacy exception handling&quot;) that shipped in
some browsers but was never standardized handling rethrows in a
different way. In particular, that proposal did not offer
first-class exception object references that could be rethrown;
instead, it had an explicit &lt;code&gt;rethrow&lt;&#x2F;code&gt; instruction. I wasn&#x27;t
around for the early debates about this design, but in my
opinion, providing first-class exception object references that
can be plumbed around via ordinary dataflow is far nicer. It
also permits a simpler implementation, as long as one literally
implements the semantics by always allocating an exception
object.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#11&quot;&gt;11&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;6&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;6&lt;&#x2F;sup&gt;
&lt;p&gt;To be precise, because it may be a little surprising:
&lt;code&gt;catch_ref&lt;&#x2F;code&gt; pushes both the payload values &lt;em&gt;and&lt;&#x2F;em&gt; the exception
reference onto the operand stack at the handler destination. In
essence, the rule is: tag-specific variants always unpack the
payloads; and &lt;em&gt;also&lt;&#x2F;em&gt;, &lt;code&gt;_ref&lt;&#x2F;code&gt; variants always push the exception
reference.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;7&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;7&lt;&#x2F;sup&gt;
&lt;p&gt;In particular, we have defined our own ABI in Wasmtime to allow
universal tail calls between any two signatures to work, as
required by the Wasm tail-calling opcodes. This ABI, called
&quot;&lt;code&gt;tail&lt;&#x2F;code&gt;&quot;, is based on the standard System V calling convention
but differs in that the callee cleans up any stack arguments.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;8&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;8&lt;&#x2F;sup&gt;
&lt;p&gt;It&#x27;s not compiler hacking without excessive trouble from
edge-cases, of course, so we had one &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10709&quot;&gt;interesting
bug&lt;&#x2F;a&gt;
from the &lt;em&gt;empty handler-list&lt;&#x2F;em&gt; case which means we have to force
edge-splitting anyway for all &lt;code&gt;try_call&lt;&#x2F;code&gt;s for this subtle
reason.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;9&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;9&lt;&#x2F;sup&gt;
&lt;p&gt;Of course, while doing this, I managed to create
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;security&#x2F;advisories&#x2F;GHSA-vvp9-h8p2-xwfc&quot;&gt;CVE-2025-61670&lt;&#x2F;a&gt;
in the C&#x2F;C++ API by a combination of (i) a simple typo in the C
FFI bindings (&lt;code&gt;as&lt;&#x2F;code&gt; vs. &lt;code&gt;from&lt;&#x2F;code&gt;, which is important when
transferring ownership!) and (ii) not realizing that the C++
wrapper does not properly maintain single ownership. We didn&#x27;t
have ASAN tests, so I didn&#x27;t see this upfront; Alex discovered
the issue while updating the Python bindings (which quickly
found the leak) and managed the CVE. Sorry and thanks!&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;10&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;10&lt;&#x2F;sup&gt;
&lt;p&gt;It turns out that even three lines of assembly are hard to get
right: the s390x variant &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;10973&quot;&gt;had a
bug&lt;&#x2F;a&gt;
where we got the register constraints wrong (GPR 0 is special on
s390x, and a branch-to-register can only take GPR 1--15; we
needed a different constraint to represent that)and had a
miscompilation as a result. Thanks to our resident s390x
compiler hacker Ulrich Weigand for tracking this down.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;11&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;11&lt;&#x2F;sup&gt;
&lt;p&gt;Of course, always boxing exceptions is not the only way to
implement the proposal. It should be possible to &quot;unbox&quot;
exceptions and skip the allocation, carrying payloads directly
through some other engine state, if they are not caught as
references. We haven&#x27;t implemented this optimization in Wasmtime
and we expect the allocation performance for small exception
objects to be adequate for most use-cases.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
</content>
    </entry>
    <entry xml:lang="en">
        <title>Compilation of JavaScript to Wasm, Part 3: Partial Evaluation</title>
        <published>2024-08-28T00:00:00+00:00</published>
        <updated>2024-08-28T00:00:00+00:00</updated>
        <author>
            <name></name>
        </author>
        <link rel="alternate" type="text/html" href="https://cfallin.org/blog/2024/08/28/weval/"/>
        <id>https://cfallin.org/blog/2024/08/28/weval/</id>
        <content type="html" xml:base="https://cfallin.org/blog/2024/08/28/weval/">&lt;p&gt;&lt;em&gt;This is the final post of a three-part series covering my work on
&quot;fast JS on Wasm&quot;; the &lt;a href=&quot;&#x2F;blog&#x2F;2023&#x2F;10&#x2F;11&#x2F;spidermonkey-pbl&#x2F;&quot;&gt;first
post&lt;&#x2F;a&gt; covered PBL, a portable
interpreter that supports inline caches, the &lt;a href=&quot;&#x2F;blog&#x2F;2024&#x2F;08&#x2F;27&#x2F;aot-js&#x2F;&quot;&gt;second
post&lt;&#x2F;a&gt; covered ahead-of-time compilation in
general terms, and this post discusses how we actually build the
ahead-of-time compiler backends. Please read the first two posts for
useful context!&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;In the last post, we covered how one might perform ahead-of-time
compilation of JavaScript to low-level code -- WebAssembly bytecode --
in high-level terms. We discussed the two kinds of bytecode present in
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;spidermonkey.dev&#x2F;&quot;&gt;SpiderMonkey&lt;&#x2F;a&gt;: bytecode for JavaScript
function bodies (as a 1-to-1 translation from the source JS) and
bytecode for the inline cache (IC) bodies that implement each operator
in those function bodies. We outlined the low-level code that the AOT
compilation of each bytecode would produce. But how do we actually
perform that compilation?&lt;&#x2F;p&gt;
&lt;h2 id=&quot;direct-single-pass-compiler-style&quot;&gt;Direct (Single-Pass) Compiler Style&lt;&#x2F;h2&gt;
&lt;p&gt;It would be straightforward enough to implement a direct compiler from
each bytecode (JS and CacheIR) to Wasm. The most direct kind of
compiler is sometimes called a &quot;template compiler&quot; or (in the context
of JIT engines) a &quot;template JIT&quot;: for each source bytecode, emit a
fixed sequence of target code. This is a fairly common technique, and
if one examines the compiler tiers of the well-known JIT compilers
today, one finds implementations such as
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;rev&#x2F;43d431ff148b331b463fcf61e99c176e3d3c0fb4&#x2F;js&#x2F;src&#x2F;jit&#x2F;BaselineCodeGen.cpp#2764-2782&quot;&gt;this&lt;&#x2F;a&gt;
(implementation of &lt;code&gt;StrictEq&lt;&#x2F;code&gt; opcode in SpiderMonkey&#x27;s JS opcode
baseline compiler, emitting two pops, an IC invocation, and a push) or
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;2389dcceaad666ad1fb6acfd1520bbb4ea5a85e4&#x2F;winch&#x2F;codegen&#x2F;src&#x2F;visitor.rs#L1889-L1902&quot;&gt;this&lt;&#x2F;a&gt;
(implementation of a ternary &lt;code&gt;select&lt;&#x2F;code&gt; Wasm opcode in Wasmtime&#x27;s Winch
baseline compiler, emitting three pops, a compare, a conditional move,
and a push). SpiderMonkey already has baseline compiler
implementations for &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;source&#x2F;js&#x2F;src&#x2F;jit&#x2F;BaselineCodeGen.cpp&quot;&gt;JS
opcodes&lt;&#x2F;a&gt;
and &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;source&#x2F;js&#x2F;src&#x2F;jit&#x2F;BaselineCacheIRCompiler.cpp&quot;&gt;CacheIR
opcodes&lt;&#x2F;a&gt;,
abstracted over the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;rev&#x2F;43d431ff148b331b463fcf61e99c176e3d3c0fb4&#x2F;js&#x2F;src&#x2F;jit&#x2F;MacroAssembler.h#53&quot;&gt;MacroAssembler&lt;&#x2F;a&gt;
allowing for reasonably easy retargeting to different ISAs. Why not
port these backends to produce Wasm bytecode?&lt;&#x2F;p&gt;
&lt;h2 id=&quot;porting-spidermonkey-jit-to-wasm&quot;&gt;Porting SpiderMonkey JIT to Wasm?&lt;&#x2F;h2&gt;
&lt;p&gt;This approach &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bugzilla.mozilla.org&#x2F;show_bug.cgi?id=1863986&quot;&gt;has actually been
taken&lt;&#x2F;a&gt;, more or
less performing a &quot;direct port&quot; to Wasm by emitting Wasm bytecode
&quot;in-process&quot; (inside the Wasm module) and then using a special
hostcall interface to add this as a new callable function. This works
well enough where the hostcall approach is acceptable, but as I
discussed &lt;a href=&quot;&#x2F;blog&#x2F;2024&#x2F;08&#x2F;27&#x2F;aot-js&#x2F;#unique-characteristics-of-a-wasm-platform&quot;&gt;last
time&lt;&#x2F;a&gt;,
a few factors conspire against a direct port (i.e., the ISA target is
only part of the problem):&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;For operational reasons (ahead-of-time deployment and instant start)
and security reasons (less attack surface, statically-knowable
code), Wasm-based platforms often disallow runtime code
generation. The &quot;add a new function&quot; hostcall&#x2F;hook might simply not
be available.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;We might be running very short-lived request handlers or similar,
each starting from an identical snapshot: thus, there is no time to
&quot;warm up&quot;, generate appropriate Wasm bytecode at runtime, and load
and run it, even though execution may still be bottlenecked on (many
instances of) these short-lived request handlers.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;There is another practical reason as well. JITs are famously hard to
develop and debug because (i) the code that is actually running on the
system does not exist anywhere in the source -- it is ephemeral -- and
(ii) it is often tightly integrated with various low-level ABI details
and system invariants which can be easy to get wrong. On native
platforms, the difficulty of debugging subtle bugs in Firefox (and its
SpiderMonkey JIT engine) led Mozilla engineers to develop
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;rr-project.org&#x2F;&quot;&gt;rr&lt;&#x2F;a&gt;, the well-known open-source reversible
(time-travel) debugger. Now imagine developing a JIT that runs within
a Wasm module with runtime codegen hooks, &lt;em&gt;without&lt;&#x2F;em&gt; a state-of-the-art
debugging capability; in fact, in &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;wasmtime.dev&quot;&gt;Wasmtime&lt;&#x2F;a&gt;,
which I help to develop, source-level debugging is improving slowly
but no debugger for Wasm bytecode-level execution exists at all
yet. (We hope to get there someday.) If I am to have any hope of
success, it seems like I will need to find another way to nail down
the correct semantics of the compiler&#x27;s output -- either by testing
and debugging in another (native) build somehow, or building better
tooling.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;a-new-hope-single-source-of-semantics&quot;&gt;A New Hope: Single Source of Semantics&lt;&#x2F;h2&gt;
&lt;p&gt;Here I come back to &lt;a href=&quot;&#x2F;blog&#x2F;2023&#x2F;10&#x2F;11&#x2F;spidermonkey-pbl&#x2F;&quot;&gt;PBL&lt;&#x2F;a&gt;: I
already have an
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;source&#x2F;js&#x2F;src&#x2F;vm&#x2F;PortableBaselineInterpret.cpp&quot;&gt;interpreter&lt;&#x2F;a&gt;
that I painstakingly developed for both CacheIR bodies of IC stubs,
and the JS bytecode that invokes them, faithfully upholding all the
invariants of baseline-level execution in SpiderMonkey. In addition,
work on PBL can proceed in a native build as well -- in fact much of
my debugging was with the venerable &lt;code&gt;rr&lt;&#x2F;code&gt;, just as with any
native-platform SpiderMonkey development. PBL is &quot;just&quot; a portable C++
program, and so all of the work developing it on any platform then
transfers right to the Wasm build as well. PBL embodies a lot of
hard-won work encoding the correct semantics for each opcode,
sometimes not well-documented in the native backends -- for example,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;rev&#x2F;43d431ff148b331b463fcf61e99c176e3d3c0fb4&#x2F;js&#x2F;src&#x2F;vm&#x2F;PortableBaselineInterpret.cpp#4300-4302&quot;&gt;this
tidbit&lt;&#x2F;a&gt;
(far too much time to discover that wrinkle!), or &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;rev&#x2F;43d431ff148b331b463fcf61e99c176e3d3c0fb4&#x2F;js&#x2F;src&#x2F;vm&#x2F;PortableBaselineInterpret.cpp#1303-1304&quot;&gt;this
one&lt;&#x2F;a&gt;. What&#x27;s
more, these semantics sometimes change, or new opcodes are added.&lt;&#x2F;p&gt;
&lt;p&gt;Ideally we do not add more locations that encode these semantics than
absolutely necessary -- once per tier is already quite a lot. Can we
somehow develop (or -- major foreshadowing here -- &lt;em&gt;automatically
derive&lt;&#x2F;em&gt;) our compiler from all of the semantics written in direct
style in interpreter code?&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s take a look once more at (slightly simplified) implementations
of an &quot;integer add&quot; opcode on a hypothetical interpreted stack
machine, and then a baseline compiler implementation of that opcode
where the operand stack uses the native machine stack&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#1&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&#x2F;&#x2F; Interpreter                    | &#x2F;&#x2F; Compiler&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;...                               | ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;case Opcode::Add: {               | void visit_Add(Assembler&amp;amp; asm) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  uint32_t lhs = stack.pop();     |   asm.pop(r1);&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  uint32_t rhs = stack.pop();     |   asm.pop(r2);&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  uint32_t result = lhs + rhs;    |   asm.add(r1, r2);&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  stack.push(result);             |   asm.push(r1);&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  break;                          | }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If you squint enough, you might almost forget whether you&#x27;re looking
at the interpreter (&lt;em&gt;directly&lt;&#x2F;em&gt; executing the semantics) or the
compiler (&lt;em&gt;indirectly&lt;&#x2F;em&gt; executing the semantics, by emitting
code). This correspondence is not accidental, and the observation that
doing-the-thing and emitting-code-to-do-the-thing are so close is at
the heart of what is sometimes called staged programming, with (again
given sufficient squinting) practical examples in Lisp macros,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.cambridge.org&#x2F;core&#x2F;books&#x2F;abs&#x2F;twolevel-functional-languages&#x2F;contents&#x2F;4DA77E9B08CF24C00221555FC241D1C9&quot;&gt;&quot;two-level functional
languages&quot;&lt;&#x2F;a&gt;,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;doi&#x2F;10.1145&#x2F;258993.259019&quot;&gt;MetaML&lt;&#x2F;a&gt;, and &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;doi&#x2F;10.1145&#x2F;2184319.2184345&quot;&gt;more
modern takes&lt;&#x2F;a&gt; that
strive to provide as transparent a syntax as possible. There are even
proposals to leverage this correspondence directly when writing JIT
compilers (see
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;blog.mozilla.org&#x2F;javascript&#x2F;2017&#x2F;10&#x2F;20&#x2F;holyjit-a-new-hope&#x2F;&quot;&gt;HolyJit&lt;&#x2F;a&gt;),
writing down the semantics once to provide both the interpreter and
the compiler tier.&lt;&#x2F;p&gt;
&lt;p&gt;Most prominently in this space, of course, is
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;doi&#x2F;10.1145&#x2F;3062341.3062381&quot;&gt;GraalVM&lt;&#x2F;a&gt; (&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;labs.oracle.com&#x2F;pls&#x2F;apex&#x2F;f?p=LABS:0::APPLICATION_PROCESS%3DGETDOC_INLINE:::DOC_ID:1014&quot;&gt;alternate
PDF&lt;&#x2F;a&gt;),
a production-grade JIT compiler framework that takes &lt;em&gt;interpreters&lt;&#x2F;em&gt;
for arbitrary languages and &lt;em&gt;partially evaluating&lt;&#x2F;em&gt; the interpreter
code itself, with the user&#x27;s program, to produce compiled code. This
mind-bending technique is known as the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Partial_evaluation#Futamura_projections&quot;&gt;first Futamura
projection&lt;&#x2F;a&gt;. (This
is the approach we will take too, with some significant
&lt;a href=&quot;https:&#x2F;&#x2F;cfallin.org&#x2F;blog&#x2F;2024&#x2F;08&#x2F;28&#x2F;weval&#x2F;#graalvm&quot;&gt;differences&lt;&#x2F;a&gt;.)&lt;&#x2F;p&gt;
&lt;p&gt;How does this work?&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-first-futamura-projection&quot;&gt;The First Futamura Projection&lt;&#x2F;h2&gt;
&lt;p&gt;Let&#x27;s say that we have a function &lt;code&gt;f(x, y)&lt;&#x2F;code&gt;. If we take &lt;code&gt;x&lt;&#x2F;code&gt; as some
constant &lt;code&gt;C&lt;&#x2F;code&gt;, we should be able to derive a function &lt;code&gt;f_C(y) = f(C, y)&lt;&#x2F;code&gt; that eliminates all occurrences of &lt;code&gt;x&lt;&#x2F;code&gt; (i.e., &lt;code&gt;x&lt;&#x2F;code&gt; is no longer a
free variable). This is &quot;just&quot; constant propagation, or &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Partial_application&quot;&gt;partial
application&lt;&#x2F;a&gt;, of
the function.&lt;&#x2F;p&gt;
&lt;p&gt;A little more concretely, let&#x27;s take &lt;code&gt;f&lt;&#x2F;code&gt; to be an interpreter, with
inputs &lt;code&gt;x&lt;&#x2F;code&gt;, the program to be interpreted, and &lt;code&gt;y&lt;&#x2F;code&gt;, the input that the
interpreted program computes on. &lt;code&gt;f(x, y)&lt;&#x2F;code&gt; is then the result of
running the program. If we could find &lt;code&gt;f_C(y)&lt;&#x2F;code&gt;, then &lt;code&gt;f_C&lt;&#x2F;code&gt; would be,
somehow, a &lt;em&gt;combination&lt;&#x2F;em&gt; of the interpreter and the program it
interprets, merged together: a compiled program.&lt;&#x2F;p&gt;
&lt;p&gt;Futamura, in his &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;repository.kulib.kyoto-u.ac.jp&#x2F;dspace&#x2F;handle&#x2F;2433&#x2F;103401&quot;&gt;seminal
paper&lt;&#x2F;a&gt;,
calls this combination a &quot;projection&quot; and describes three levels of
projection. The first is as above: combining an interpreter with its
program. The second and third projections are far more exotic:
combining the partial evaluator itself with an interpreter, producing
a compiler (that can then be applied to a program); or combining the
partial evaluator with itself as input, producing a compiler-compiler
(that can then be applied to an interpreter, producing a compiler,
that can then be applied to an input program, producing compiled
output). Mind-bending stuff.&lt;&#x2F;p&gt;
&lt;p&gt;I can now tell you what &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;cfallin&#x2F;weval&quot;&gt;weval&lt;&#x2F;a&gt;, the
WebAssembly partial evaluator, is: it is a tool that derives the first
Futamura projection of an interpreter inside a snapshot of a
WebAssembly module together with its input (bytecode).&lt;&#x2F;p&gt;
&lt;p&gt;Alright, but how?! How does one &quot;combine&quot; an interpreter with its
input?&lt;&#x2F;p&gt;
&lt;p&gt;The thinking that gives rise to the Futamura projections is very
&lt;em&gt;algebraic&lt;&#x2F;em&gt; in nature. In conventional algebra (with addition and
multiplication over the reals, for example, as taught to students
everywhere), we have a notion of &quot;plugging in&quot; certain constants and
simplifying the expression. Given &lt;code&gt;z = 2x + 3y + 4&lt;&#x2F;code&gt;, we can hold &lt;code&gt;x&lt;&#x2F;code&gt;
constant, say &lt;code&gt;x = 10&lt;&#x2F;code&gt;, then &quot;partially evaluate&quot; &lt;code&gt;z = 2*10 + 3y + 4 = 3y + 24&lt;&#x2F;code&gt;. We have produced an output (expression) that fully
incorporates the new knowledge and has no further references to the
input (variable) &lt;code&gt;x&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;An interpreter and its interpreted program are far, far more complex
than an algebraic expression and a numeric constant, of course. Let&#x27;s
explore in a bit more detail how one might &quot;plug in&quot; a program to an
interpreter and simplify the result.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;practical-first-futamura-projection-via-context-specialization&quot;&gt;Practical First Futamura Projection via Context Specialization&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;a-naive-approach-constant-propagation&quot;&gt;A Naive Approach: Constant Propagation&lt;&#x2F;h3&gt;
&lt;p&gt;We might start with the observation that, in the compiler-optimization
sphere, &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Constant_folding&quot;&gt;constant
folding&lt;&#x2F;a&gt; looks a lot
like the &quot;plug in a value and simplify&quot; mechanism that we know from
algebra. In essence, what we want to say is: take an interpreter,
specify that the program it interprets is a &lt;em&gt;constant&lt;&#x2F;em&gt;, and (somehow)
propagate the consequences of that through the code.&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s take a look at an interpreter body that interprets a &lt;em&gt;single
opcode&lt;&#x2F;em&gt; first:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;cpp&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;switch&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;*&lt;&#x2F;span&gt;&lt;span&gt;pc&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;  case&lt;&#x2F;span&gt;&lt;span&gt; Opcode&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;::&lt;&#x2F;span&gt;&lt;span&gt;Add&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-punctuation z-section&quot;&gt;: {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;    uint32_t&lt;&#x2F;span&gt;&lt;span&gt; lhs &lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;=&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; stack&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;[&lt;&#x2F;span&gt;&lt;span&gt;sp&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;++&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;]&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;        &#x2F;&#x2F; stack pop&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;    uint32_t&lt;&#x2F;span&gt;&lt;span&gt; rhs &lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;=&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; stack&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;[&lt;&#x2F;span&gt;&lt;span&gt;sp&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;++&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;]&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;        &#x2F;&#x2F; stack pop&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;    uint32_t&lt;&#x2F;span&gt;&lt;span&gt; result &lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; lhs &lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;+&lt;&#x2F;span&gt;&lt;span&gt; rhs&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;       &#x2F;&#x2F; operation logic&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;    stack&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;[&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;--&lt;&#x2F;span&gt;&lt;span&gt;sp&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;]&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; result&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;              &#x2F;&#x2F; stack push&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword z-punctuation z-terminator&quot;&gt;    break;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;  }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;  case&lt;&#x2F;span&gt;&lt;span&gt; Opcode&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;::&lt;&#x2F;span&gt;&lt;span&gt;Sub&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-punctuation z-section&quot;&gt;: {&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt; &#x2F;* ... *&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;  case&lt;&#x2F;span&gt;&lt;span&gt; Opcode&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;::&lt;&#x2F;span&gt;&lt;span&gt;Cmp&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-punctuation z-section&quot;&gt;: {&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt; &#x2F;* ... *&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;  case&lt;&#x2F;span&gt;&lt;span&gt; Opcode&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;::&lt;&#x2F;span&gt;&lt;span&gt;Jmp&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-punctuation z-section&quot;&gt;: {&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt; &#x2F;* ... *&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Let&#x27;s take &lt;code&gt;pc&lt;&#x2F;code&gt; to be a constant, and furthermore assume that the
&lt;em&gt;memory that &lt;code&gt;pc&lt;&#x2F;code&gt; points to&lt;&#x2F;em&gt; (the bytecode) is constant; and for this
example, assume that &lt;code&gt;pc[0] == Opcode::Add&lt;&#x2F;code&gt;. Let&#x27;s also say that &lt;code&gt;sp&lt;&#x2F;code&gt;
starts at &lt;code&gt;1024&lt;&#x2F;code&gt;. Then we can &quot;constant fold&quot; the whole interpreter by
(i) simplifying the switch-statement, which is now operating on a
&lt;em&gt;constant&lt;&#x2F;em&gt; input (the opcode), and (ii) propagating through any other
constants we know based on initial state, such as &lt;code&gt;sp&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;cpp&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; at pc[0]: Opcode::Add, with initial sp == 1024.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;  uint32_t&lt;&#x2F;span&gt;&lt;span&gt; lhs &lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;=&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; stack&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;[&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant z-numeric&quot;&gt;1024&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;]&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;  uint32_t&lt;&#x2F;span&gt;&lt;span&gt; rhs &lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;=&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; stack&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;[&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant z-numeric&quot;&gt;1025&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;]&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;  uint32_t&lt;&#x2F;span&gt;&lt;span&gt; result &lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; lhs &lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;+&lt;&#x2F;span&gt;&lt;span&gt; rhs&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;  stack&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;[&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant z-numeric&quot;&gt;1025&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;]&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; result&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  sp &lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;=&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant z-numeric&quot;&gt; 1025&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If one squints appropriately, one could claim that we &quot;compiled&quot; the
&lt;code&gt;Add&lt;&#x2F;code&gt; opcode here: we converted the generic interpreter, with
switch-cases for every opcode and operand-stack accesses parameterized
on current stack depth, to code that is specific to the actual program
we&#x27;re interpreting.&lt;&#x2F;p&gt;
&lt;p&gt;So is that it? Can we do a first Futamura projection by holding the
&quot;bytecode&quot; memory constant, and doing constant folding (including
branch folding)? It can&#x27;t be that easy, right?&lt;&#x2F;p&gt;
&lt;h3 id=&quot;the-problem-lattices-and-loops&quot;&gt;The Problem: Lattices and Loops&lt;&#x2F;h3&gt;
&lt;p&gt;Indeed, it becomes much more difficult quickly. Consider what happens
as we widen our scope from programs of &lt;em&gt;one&lt;&#x2F;em&gt; opcode to programs of
&lt;em&gt;two&lt;&#x2F;em&gt; opcodes (!). We&#x27;ll have to update our interpreter to include a
loop, and an update to &lt;code&gt;pc&lt;&#x2F;code&gt; after it &quot;fetches&quot; each opcode:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;cpp&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;while&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt; (&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant z-language&quot;&gt;true&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;  switch&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt; (&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;*&lt;&#x2F;span&gt;&lt;span&gt;pc&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;++&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;    case&lt;&#x2F;span&gt;&lt;span&gt; Opcode&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;::&lt;&#x2F;span&gt;&lt;span&gt;Add&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-punctuation z-section&quot;&gt;: {&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt; &#x2F;* ... *&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;    case&lt;&#x2F;span&gt;&lt;span&gt; Opcode&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;::&lt;&#x2F;span&gt;&lt;span&gt;Sub&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-punctuation z-section&quot;&gt;: {&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt; &#x2F;* ... *&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;    case&lt;&#x2F;span&gt;&lt;span&gt; Opcode&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;::&lt;&#x2F;span&gt;&lt;span&gt;Cmp&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-punctuation z-section&quot;&gt;: {&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt; &#x2F;* ... *&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;    case&lt;&#x2F;span&gt;&lt;span&gt; Opcode&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;::&lt;&#x2F;span&gt;&lt;span&gt;Jmp&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-punctuation z-section&quot;&gt;: {&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt; &#x2F;* ... *&#x2F;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;    &#x2F;&#x2F; ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;  }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A compiler&#x27;s constant-propagation&#x2F;folding pass, and other passes that
compute &lt;em&gt;properties of program values&lt;&#x2F;em&gt; and then mutate the program
accordingly, usually are built around an iterated dataflow fixpoint
solver&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#2&quot;&gt;2&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;. To modify a program at compile time, we must work with
properties that are always true -- even when a particular expression
or variable in the program might have many different values during
execution. For example, a loop body might increment an &lt;code&gt;index&lt;&#x2F;code&gt;
variable, giving a different value for each iteration of the loop, so
we cannot conclude that the value is constant. We need a way of
merging different possibilities together to find properties that are
true in all cases.&lt;&#x2F;p&gt;
&lt;p&gt;The constant-propagation analysis works by computing properties that
lie in a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Lattice&quot;&gt;lattice&lt;&#x2F;a&gt;, a
mathematical structure with values that allow for &quot;merging&quot; (the
&quot;meet&quot; function) and with certain properties around that merging that
make it well-behaved, so that we always arrive at the same single
solution. Typically, on a lattice used by a
constant-propagation&#x2F;folding analysis, if we see in one instance that
&lt;code&gt;x&lt;&#x2F;code&gt; is constant &lt;code&gt;K1&lt;&#x2F;code&gt;, and in another instance that &lt;code&gt;x&lt;&#x2F;code&gt; is constant
&lt;code&gt;K2&lt;&#x2F;code&gt;, then we have to move &quot;down the lattice&quot; and &quot;merge&quot; to a fact
that claims nothing at all: &quot;&lt;code&gt;x&lt;&#x2F;code&gt; has some arbitrary non-constant value
at runtime&quot;. This is sometimes called the &quot;meet-over-all-paths&quot;
solution, because we merge possible behavior over all control-flow
paths that could lead to some program point.&lt;&#x2F;p&gt;
&lt;p&gt;Consider the &lt;code&gt;pc&lt;&#x2F;code&gt; variable in this loop: on the first iteration, it
begins at the start of the bytecode buffer (offset &lt;code&gt;0&lt;&#x2F;code&gt;). After one
opcode, it lies at offset &lt;code&gt;1&lt;&#x2F;code&gt;. And so on. Can a constant-propagation
pass &quot;bake in&quot; a single constant value for &lt;code&gt;pc&lt;&#x2F;code&gt; that will be valid for
every iteration of the loop? No: in fact, it is not constant.&lt;&#x2F;p&gt;
&lt;p&gt;Likewise for the bytecode pointed-to by &lt;code&gt;pc&lt;&#x2F;code&gt;: in the first iteration
of the loop, it may be &lt;code&gt;Add&lt;&#x2F;code&gt;, and in the second iteration, something
else; there is no &lt;em&gt;single&lt;&#x2F;em&gt; constant opcode that we can propagate into
the &lt;code&gt;switch&lt;&#x2F;code&gt; statement to simplify the whole interpreter body down to
the specific code for each opcode.&lt;&#x2F;p&gt;
&lt;p&gt;The essence of the problem here is that the first Futamura projection
of an interpreter needs to &lt;em&gt;somehow be aware of the interpreter
loop&lt;&#x2F;em&gt;. In essence, it needs to &quot;unroll&quot; the loop: it needs to do a
&lt;em&gt;separate&lt;&#x2F;em&gt; constant propagation for each opcode.&lt;&#x2F;p&gt;
&lt;p&gt;One might be tempted to build a transform that &quot;traces&quot; across edges,
including the interpreter loop
backedge(s). &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;pypy.org&#x2F;&quot;&gt;PyPy&lt;&#x2F;a&gt;&#x27;s metatracing works something
like this. One runs into two issues though: (i) the compilation is
&quot;loop-centric&quot; (rather than complete compilation of each
function&#x2F;method), and (ii) merge-points are tricky. Consider an
if-else sequence with two sides that eventually jump back to the same
&lt;code&gt;pc&lt;&#x2F;code&gt;. Do we keep &quot;tracing&quot; the path on each side or do we detect
reconvergence and stitch the resulting code back together? That is, if
we have the bytecode&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2024-08-22-bytecode-cfg-web.svg&quot; alt=&quot;Figure: control-flow if-else diamond with bytecode&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;we might naively &quot;follow the control flow&quot; in the interpreter,
generating one path that is &lt;code&gt;pc=0,1,2,3,5,6,7&lt;&#x2F;code&gt; and another that is
&lt;code&gt;pc=0,1,4,5,6,7&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2024-08-22-bytecode-cfg-traces-web.svg&quot; alt=&quot;Figure: control-flow if-else diamond with bytecode, with traces along control-flow paths&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Note that we have duplicated the &quot;tail&quot; (&lt;code&gt;pc=5&lt;&#x2F;code&gt; onward) and this code
growth can become exponential if we have, say, a tree of conditionals.&lt;&#x2F;p&gt;
&lt;p&gt;The property that we want is that the CFG of the original bytecode is
somehow reflected into the compiled code: take the original
&lt;em&gt;structure&lt;&#x2F;em&gt; of the bytecode, for each opcode simulate the execution of
one iteration of the interpreter loop (on that opcode), and stitch it
together. So given the initial &lt;em&gt;interpreter loop&lt;&#x2F;em&gt; (left) and &lt;em&gt;bytecode
being interpreter&lt;&#x2F;em&gt; (right) here:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2024-08-22-interpreter-and-bytecode-cfg-web.svg&quot; alt=&quot;Figure: interpreter CFG and bytecode CFG&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;we want something like the following, where each interpreted opcode is
initially &quot;unrolled&quot; into a whole copy of the interpreter loop:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2024-08-22-bytecode-cfg-full-of-interpreters-web.svg&quot; alt=&quot;Figure: bytecode CFG filled with copies of interpreter CFG&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;This might seem a bit wild at first -- a whole bunch of copies of the
entire interpreter? -- until one sees that this &lt;em&gt;reduces the problem
to a previously-solved one above&lt;&#x2F;em&gt;, namely, how to specialize an
interpreter for a single opcode. We can take each &lt;em&gt;copy&lt;&#x2F;em&gt; of the
interpreter loop and constant-propagate and branch-fold it. The effect
is as if we surgically plucked the implementations of each opcode out
of the middle of the interpreter CFG and built a compiled version of
the bytecode with them:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2024-08-22-bytecode-cfg-full-of-specialized-interpreters-web.svg&quot; alt=&quot;Figure: bytecode CFG with specialized copies of interpreter CFG&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;To make this work, it still seems like we would need some sort of
&quot;visibility&quot; into the workings of the interpreter: we would need to
tell the transform tool, the first Futamura projector, about our
bytecode and its structure, and the interpreter loop that is meant to
operate on it, so it could perform this careful surgery.&lt;&#x2F;p&gt;
&lt;p&gt;GraalVM solves this problem in a very simple and direct way: the
interpreter needs to be written in terms of GraalVM AST classes, and
given uses of those classes, the transform can &quot;pick out&quot; the right
parts of the logic, while otherwise being aware of the overall CFG of
the interpreted program &lt;em&gt;directly&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;I considered but rejected such an approach: I did not want to build a
tool that would require rewriting an existing interpreter, or deep
surgery to rebuild it in terms of the framework&#x27;s new
abstractions. Such a transform would be very error-prone and hinder
adoption. Is there another way to get the transform to &quot;see&quot; the
iterations of the interpreter loop separately?&lt;&#x2F;p&gt;
&lt;h3 id=&quot;contexts&quot;&gt;Contexts&lt;&#x2F;h3&gt;
&lt;p&gt;The breakthrough for weval came when I realized that &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Pointer_analysis#Context-Insensitive,_Flow-Insensitive_Algorithms&quot;&gt;&lt;em&gt;context
sensitivity&lt;&#x2F;em&gt;&lt;&#x2F;a&gt;,
a standard tool in static analysis (e.g. pointer&#x2F;alias analysis),
could allow us to escape the tyranny of meet-over-all-paths collapsing
our carefully-curated constant values into a mush of &quot;runtime&quot;.&lt;&#x2F;p&gt;
&lt;p&gt;The key idea is to make &lt;code&gt;pc&lt;&#x2F;code&gt; itself the context. The constant-folding
analysis otherwise &lt;em&gt;doesn&#x27;t know anything about interpreters&lt;&#x2F;em&gt;; it just
has an intrinsic that means &quot;please process the code that flows from
this point in a different context&quot;, and it keeps the &quot;constant value&quot;
state &lt;em&gt;separately&lt;&#x2F;em&gt; for each context.&lt;&#x2F;p&gt;
&lt;p&gt;Given this intrinsic, we can &quot;simply&quot; update the context whenever we
update &lt;code&gt;pc&lt;&#x2F;code&gt;; so the loop looks something like this:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;cpp&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;update_context&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;(&lt;&#x2F;span&gt;&lt;span&gt;pc&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;)&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; intrinsic visible to the analysis, otherwise a no-op&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;while&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt; (&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant z-language&quot;&gt;true&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;  switch&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt; (&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;*&lt;&#x2F;span&gt;&lt;span&gt;pc&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;    case&lt;&#x2F;span&gt;&lt;span&gt; Opcode&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;::&lt;&#x2F;span&gt;&lt;span&gt;Add&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-punctuation z-section&quot;&gt;: {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;      &#x2F;&#x2F; ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      pc&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator z-punctuation z-terminator&quot;&gt;++;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;      update_context&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;(&lt;&#x2F;span&gt;&lt;span&gt;pc&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;)&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; intrinsic visible to the analysis, otherwise a no-op&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword z-punctuation z-terminator&quot;&gt;      break;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;               &#x2F;&#x2F; loop backedge&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;    }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;    &#x2F;&#x2F; ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;  }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Let&#x27;s walk through a constant-folding example. We start with &lt;code&gt;pc&lt;&#x2F;code&gt; at
the beginning of the bytecode (for simplicity let&#x27;s say &lt;code&gt;pc=0&lt;&#x2F;code&gt;, though
it&#x27;s actually somewhere in the heap snapshot), and we know that &lt;code&gt;*pc&lt;&#x2F;code&gt;
is &lt;code&gt;Opcode::Add&lt;&#x2F;code&gt;. So in the &lt;em&gt;context&lt;&#x2F;em&gt; of &lt;code&gt;pc=0&lt;&#x2F;code&gt;, we can know that &lt;code&gt;pc&lt;&#x2F;code&gt;
at the top of the loop starts as a constant. We see the &lt;code&gt;switch&lt;&#x2F;code&gt;, we
branch-fold because the opcode at &lt;code&gt;*pc&lt;&#x2F;code&gt; is a constant (because &lt;code&gt;pc&lt;&#x2F;code&gt; is
a constant and it points to constant memory). We trace through the
implementation of &lt;code&gt;Opcode::Add&lt;&#x2F;code&gt; (and disregard all the other
switch-cases). We increment &lt;code&gt;pc&lt;&#x2F;code&gt; and see a loop backedge -- isn&#x27;t this
where the constant-value analysis sees that the &lt;code&gt;pc=0&lt;&#x2F;code&gt; case (this
iteration) and the &lt;code&gt;pc=1&lt;&#x2F;code&gt; case (the next iteration) &quot;meet&quot; and we
can&#x27;t conclude it&#x27;s a constant at all?&lt;&#x2F;p&gt;
&lt;p&gt;No! Because just before the loop backedge, we &lt;em&gt;updated the
context&lt;&#x2F;em&gt;. As we trace through the code and track constant values, and
follow control-flow edges and propagate that state to their
destinations, we are now in the context of &lt;code&gt;pc=1&lt;&#x2F;code&gt;. We reach the loop
header again, but in a &lt;em&gt;new context&lt;&#x2F;em&gt;, so nothing collapses to &quot;not
constant&quot; &#x2F; &quot;runtime&quot;; &lt;code&gt;pc&lt;&#x2F;code&gt; is actually a known constant
everywhere. In other words, we go from this analysis situation:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2024-08-22-constant-prop-loop-web.svg&quot; alt=&quot;Figure: constant-propagation attempting to find a constant PC in an interpreter loop&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;where the analysis &lt;em&gt;correctly&lt;&#x2F;em&gt; determines that PC is not constant (the
backedge carries a value &lt;code&gt;pc=1&lt;&#x2F;code&gt; into the same block that receives
&lt;code&gt;pc=0&lt;&#x2F;code&gt; from the entry point, so we can only conclude &lt;code&gt;Unknown&lt;&#x2F;code&gt;), to
this one:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2024-08-22-constant-prop-loop-ctx-web.svg&quot; alt=&quot;Figure: constant-propagation finds constant PCs given context-sensitivity&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;em&gt;contexts&lt;&#x2F;em&gt; are thus the way in which we do the code duplication
implied earlier (a separate copy of the interpreter body for each PC
location, then simplify). However, note that they are completely
driven by the intrinsics in the code that the interpreter developer
writes: the analysis &lt;em&gt;does not know about the interpreter loop&lt;&#x2F;em&gt;
otherwise.&lt;&#x2F;p&gt;
&lt;p&gt;The overall analysis loop processes &lt;code&gt;(context, block)&lt;&#x2F;code&gt; &lt;em&gt;tuples&lt;&#x2F;em&gt;, and
carries abstract state (known constant, or &lt;code&gt;Unknown&lt;&#x2F;code&gt;) per &lt;code&gt;(context, SSA value)&lt;&#x2F;code&gt; &lt;em&gt;tuple&lt;&#x2F;em&gt;. When we perform the transform, we duplicate each
original basic block into a basic block per context, but only on
demand, when the block is reached in some context. Branches in each
block resolve to the appropriate target block in the appropriate
(possibly updated) context. This is key: the interpreter &lt;em&gt;backedge&lt;&#x2F;em&gt; in
the original CFG becomes an edge from &quot;tail block in context i&quot; to
&quot;loop header block in context i+1&quot;; it becomes an edge to the next
opcode&#x27;s code in the compiled function body. This may be a forward
edge or a backedge, depending on the original bytecode CFG&#x27;s shape.&lt;&#x2F;p&gt;
&lt;p&gt;One can see this as a sort of tracing through the control flow of the
interpreter, but with one crucial difference: contexts serve as a way
to &lt;em&gt;reconnect&lt;&#x2F;em&gt; merge points. We thus don&#x27;t get straight-line traces;
rather, when we encounter a point in the interpreter that we could
branch to &lt;code&gt;pc=K1&lt;&#x2F;code&gt; or &lt;code&gt;pc=K2&lt;&#x2F;code&gt;, we see a context update to &lt;code&gt;K1&lt;&#x2F;code&gt; or &lt;code&gt;K2&lt;&#x2F;code&gt;
and a branch to &lt;code&gt;entry&lt;&#x2F;code&gt;; we emit in the specialized function body an
edge to the &lt;code&gt;(K1, entry)&lt;&#x2F;code&gt; or &lt;code&gt;(K2, entry)&lt;&#x2F;code&gt; block, which is the point
in the compiled code that corresponds to the start of the &quot;copy of the
interpreter for that opcode&quot;.&lt;&#x2F;p&gt;
&lt;p&gt;We thus get &lt;em&gt;exactly the behavior we outlined as our desired endpoint
above&lt;&#x2F;em&gt;, where the bytecode&#x27;s CFG gets populated with interpreter cases
for each opcode, and it &quot;just falls out&quot; of context sensitivity. This
is mind-blowing (at least, to me!).&lt;&#x2F;p&gt;
&lt;p&gt;An animation of a worked example for a simple two-opcode loop is
available in my talk
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=_T3s6-C38JI&amp;amp;t=27m06s&quot;&gt;here&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;generic-mechanism-vs-interpreter-specific-transform&quot;&gt;Generic Mechanism vs. Interpreter-Specific Transform&lt;&#x2F;h3&gt;
&lt;p&gt;Seen a certain way, the use of constant-folding plus contexts seems a
little circuitous at first: &lt;code&gt;pc&lt;&#x2F;code&gt; at the top of the loop is a constant
&lt;code&gt;k&lt;&#x2F;code&gt; when we&#x27;re in context &lt;code&gt;pc=k&lt;&#x2F;code&gt;; isn&#x27;t that more complex than a
mechanism that &quot;just&quot; understands interpreter PCs directly? Well,
perhaps on the surface, but actually the implications of the
context-sensitivity are fairly deep and make much of the rest of the
analysis &quot;just work&quot; as well. In particular, &lt;em&gt;all&lt;&#x2F;em&gt; values in the loop
body get their own PC-specific context to specialize in, and
context-sensitivity is a fairly mechanical change to make to a
standard dataflow analysis, whereas something &lt;em&gt;specific&lt;&#x2F;em&gt; to
interpreters would likely be a lot more fragile and complex to
maintain.&lt;&#x2F;p&gt;
&lt;p&gt;One way that the robustness of this &quot;compositional&quot; approach becomes
more clear is when considering (and trying to convince oneself of) the
&lt;em&gt;correctness&lt;&#x2F;em&gt; of the approach. An extremely important property of the
&quot;context&quot; is that it is, with respect to correctness (i.e., semantic
equivalence to the original interpreter), &lt;em&gt;purely heuristic&lt;&#x2F;em&gt;. We could
decide to compute some arbitrary value and enter that context at any
point; we could get the PC &quot;wrong&quot;, miss an update somewhere, etc. The
worst that will happen is that we have two different constant values
of &lt;code&gt;pc&lt;&#x2F;code&gt; merge at the loop header, and we can no longer branch-fold and
specialize the interpreter loop body for an opcode. The degenerate
behavior is that we get a copy of the interpreter loop body (with
runtime control flow remaining) for every opcode. That&#x27;s not great as
an optimization, and of course we don&#x27;t want it, but it is &lt;em&gt;still
correct&lt;&#x2F;em&gt;. There is nothing that the (minor) modifications to the
interpreter, to add context sensitivity, can get wrong that will break
the semantics.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-weval-transform&quot;&gt;The weval Transform&lt;&#x2F;h2&gt;
&lt;p&gt;We&#x27;re finally ready to see &quot;the weval transform&quot;, which is the heart
of the first Futamura projection that weval performs. It actually does
the constant-propagation analysis at the same time as the code
transform.&lt;&#x2F;p&gt;
&lt;p&gt;The analysis and transform operate on SSA-based IR, which weval
obtains from the original Wasm function via my
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;cfallin&#x2F;waffle&quot;&gt;waffle&lt;&#x2F;a&gt; library, and then compiles
the SSA IR of the resulting specialized function back to Wasm function
bytecode.&lt;&#x2F;p&gt;
&lt;p&gt;The transform can be summarized with the following pseudocode:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Input: - original function in SSA,   Output: - specialized function in SSA,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;       - with some parameters                  whose semantics are&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;         specified as                          identical to the&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;         - &amp;quot;constant&amp;quot; or                       original function,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;         - &amp;quot;points to constant mem&amp;quot;,           - given constants for&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;       - and with &amp;quot;update_context&amp;quot;               parameters, and&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;         intrinsics added.                     - assuming &amp;quot;constant&amp;quot; memory&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                                 remains the same.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;State:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;- Map from (context, block_orig) to block_specialized&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;- Map from (context, value_orig) to value_specialized&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;- Workqueue of (context, block_orig)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;- Abstract state (Constant, ConstantMem or Unknown)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                 for each (context, value)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;- Dependency map from value_specialized to&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                      set((context, block_orig))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;- Flow-sensitive state&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  (context, abstract values for any global state)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  per basic-block input&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Init:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;- Push (empty context, entry block) onto workqueue.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Main loop:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;- Pull a (context, block) from the workqueue.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;- If a specialized block doesn&amp;#39;t exist in the map,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  create one.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;- For each value in the original block:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  - Fetch the abstract values for all operands&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    (Constant or Unknown).&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  - Evaluate the operator: if constant input(s) imply&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    a constant output, compute that.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  - Copy either the original operator with translated&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    input values, or a &amp;quot;constant value&amp;quot; operator,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    to the output block.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  - Update value map and abstract state.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    - If dependency map shows dependency to&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      already-processed block, enqueue on workqueue&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      for reprocessing.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  - On visiting an update_context intrinsic, update&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    flow-sensitive state.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;- For the block terminator:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  - Branch-fold if able based on constant inputs&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    to conditionals or switches.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  - For each possible target,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    - update blockparams, meeting into existing&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      abstract state;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    - enqueue on workqueue if blockparam abstract state&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      changed or if new block;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    - rewrite target in branch to specialized block&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      for this context, block pair.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;And that&#x27;s it!&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#3&quot;&gt;3&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; Note a few things in particular:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The algorithm is never aware that it is operating on &quot;an
interpreter&quot;, much less any specific interpreter. This is good: we
don&#x27;t need to tightly couple the compiler tooling here with the
use-case (a SpiderMonkey interpreter loop), and it means that we can
debug and test each separately.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;The algorithm, at this high level at least, has no special cases or
other oddities: it is more or less what one would expect out of a
constant-propagation and branch folding pass that operates in a
fixpoint, with the only really novel part that it keeps a &quot;current
context&quot; as state and parameterizes all lookups in maps on that
context. Once one accepts that &quot;duplication of code&quot; (processing the
same blocks multiple times, in different contexts) is correct, it&#x27;s
reasonable to convince oneself that the whole algorithm is correct.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;The algorithm leaves room for other kinds of flow-sensitive state
and intrinsics, if we &lt;em&gt;do&lt;&#x2F;em&gt; want to provide more optimizations to the
user. (More on this below!)&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;I am omitting some details around SSA properties -- specifically, the
use of values in basic blocks in the original function that are
defined in other blocks higher in the domtree (which is perfectly
legal in SSA). Because the new specialized function body may have
different dominance relationships between blocks, some of these
use-def links are no longer legal. The naive approach to this problem
(and the one I took at first) is to transform into &quot;maximal SSA&quot;
first, in which all live values are carried as blockparams at every
block; that turns out to be really expensive, so weval computes a
&quot;cut-set&quot; of blocks at which max-SSA is actually necessary based on
the locations of context-update intrinsics (intuitively, approximately
just the interpreter backedge, but we want the definition of this to
be independent of the notion of an interpreter and where it does
context updates).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;using-weval-function-specialization-in-a-snapshot-abstraction&quot;&gt;Using weval: &quot;Function Specialization in a Snapshot&quot; Abstraction&lt;&#x2F;h2&gt;
&lt;p&gt;So how does one interact with this tool, as an interpreter author?&lt;&#x2F;p&gt;
&lt;p&gt;As noted above, for a Futamura projection, the key abstraction is
&quot;function specialization&quot; (or partial evaluation). In the context of
weval, this means that the interpreter requests a specialization of
some &quot;generic&quot; interpreter function, with some constant inputs (at
least the bytecode, presumably!), and gets back a function pointer to
another function in return. That specialized function is semantically
equivalent to the original, except that it has all of the specified
constants (function parameters and memory contents) &quot;baked in&quot; and has
been optimized accordingly.&lt;&#x2F;p&gt;
&lt;p&gt;Said another way: the specialized function body will be the compiled
form of the given bytecode.&lt;&#x2F;p&gt;
&lt;p&gt;Because the specialization works in the context of data in memory (the
bytecode), it has to happen in the context of running program
state. For example, in SpiderMonkey, there is a &lt;code&gt;class JSFunction&lt;&#x2F;code&gt;
that ultimately points to some bytecode, and we want to produce a new
Wasm function that bakes in everything about that
&lt;code&gt;JSFunction&lt;&#x2F;code&gt;. Because this object only exists in the Wasm heap after
the Wasm code has run, parsed some JS source, allocated some space on
the heap, and emitted its internal bytecode, we cannot do the
specialization on the &quot;original Wasm&quot; outside the scope of some
particular execution.&lt;&#x2F;p&gt;
&lt;p&gt;Recall from the &lt;a href=&quot;&#x2F;blog&#x2F;2024&#x2F;08&#x2F;27&#x2F;aot-js&#x2F;&quot;&gt;earlier post&lt;&#x2F;a&gt; that we want
a &quot;phase separation&quot;: we want a fully ahead-of-time process. We
definitely cannot embed a weval transform in a Wasm engine and expose
it at runtime. How do we reconcile this need for phasing with weval&#x27;s
requirement to see the heap snapshot and specialize over it?&lt;&#x2F;p&gt;
&lt;p&gt;The answer comes in the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wizer&quot;&gt;Wizer&lt;&#x2F;a&gt; tool, or more
generally the notion of &lt;em&gt;snapshotting&lt;&#x2F;em&gt;: we modify the top-level logic
of the language runtime so that we can parse source and produce
internal bytecode when invoked with some entry point, and set up
global state so that an invocation from another entry point actually
runs the code. This work has already been done for various runtimes,
including SpiderMonkey, because it is a useful transform to &quot;bundle&quot;
an interpreter with pre-parsed bytecode and in-memory data structures,
even if we don&#x27;t do any compilation.&lt;&#x2F;p&gt;
&lt;p&gt;weval thus builds on top of Wizer: it takes a Wasm snapshot, with all
the runtime&#x27;s data structures, finds the bytecode and interpreter
function, does the specialization, and appends new functions to the
Wasm snapshot.&lt;&#x2F;p&gt;
&lt;p&gt;How does weval find what it needs to specialize? The interpreter,
inside the Wasm module, makes &quot;weval requests&quot; via an API that
(nominally, in simplified terms) looks like:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;c&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;void&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; weval_request&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-support z-type&quot;&gt;func_t&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;*&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-parameter&quot;&gt; specialized&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-support z-type&quot;&gt; func_t&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-parameter&quot;&gt; generic&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-support z-type&quot;&gt; param_t&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;*&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-parameter&quot;&gt; params&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;)&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;where &lt;code&gt;generic&lt;&#x2F;code&gt; is a function pointer to the generic function,
&lt;code&gt;specialized&lt;&#x2F;code&gt; is a location on the heap to place a function pointer to
the resulting specialized function, and &lt;code&gt;params&lt;&#x2F;code&gt; encodes all of the
information to specialize on (including the pointer to bytecode).&lt;&#x2F;p&gt;
&lt;p&gt;Internally, this &quot;request&quot; is recorded as a data structure with a
well-known format in the Wasm heap, and is linked into a linked list
that is reachable via a well-known global. weval knows where to look
in the Wasm snapshot to find these requests.&lt;&#x2F;p&gt;
&lt;p&gt;From the point of view of the guest, these requests are fulfilled
asynchronously: the &lt;code&gt;specialized&lt;&#x2F;code&gt; function pointer will not be filled
in right away with a new function, but will be at some point in the
future. The interpreter thus must have a conditional when invoking
each function: if the specialized version exists, call that, otherwise
call the generic interpreter body.&lt;&#x2F;p&gt;
&lt;p&gt;Why asynchronously? Because this is how snapshotting appears from the
guest perspective: code runs during Wizer&#x27;s pre-initialization phase,
the language runtime&#x27;s frontend parses source and creates bytecode,
and weval requests are created for every function that is &quot;born&quot;. It
is only when pre-initialization execution completes, a snapshot of the
heap is taken, and that snapshot is processed by weval, that weval can
append new functions, fill in the function pointers in the heap
snapshot, and write out a new &lt;code&gt;.wasm&lt;&#x2F;code&gt; file. When &lt;em&gt;that&lt;&#x2F;em&gt; Wasm module is
later instantiated, it &quot;wakes up&quot; again from the snapshot and the
specialized functions suddenly exist. Voilà, compiled code.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;correctness-via-semantics-preserving-specialization&quot;&gt;Correctness via Semantics-Preserving Specialization&lt;&#x2F;h2&gt;
&lt;p&gt;It&#x27;s worth emphasizing one aspect of this whole design again. In order
to use weval, an interpreter (i) adds some &lt;code&gt;update_context&lt;&#x2F;code&gt; intrinsics
whenever &lt;code&gt;pc&lt;&#x2F;code&gt; is updated, and (ii) makes weval requests for function
pointers to specialized code. That&#x27;s it.&lt;&#x2F;p&gt;
&lt;p&gt;In particular, nothing in the interpreter has to encode the execution
semantics of the guest-language bytecode in a way that is only seen
during compiled-code execution. Rather, weval&#x27;d compiled code and
interpreted code execute using &lt;em&gt;exactly the same opcode
implementations&lt;&#x2F;em&gt;, by construction. This is a &lt;em&gt;radical simplification&lt;&#x2F;em&gt;
in the testing complexity of the whole source-language compilation
approach: we can test and debug the interpreter (and running wherever
we like, including in a native rather than Wasm build), and we can
separately test and debug weval.&lt;&#x2F;p&gt;
&lt;p&gt;How do we test weval? During its development, I added a lockstep
execution mode where the &quot;generic&quot; function and &quot;specialized&quot; function
both run in a snapshot and we compare the results. If the transform is
correct, they should be identical, independent of whatever the
interpreter code is doing. Certainly we are running end-to-end tests
of weval&#x27;d JS code as well for added assurance, but this test
factoring was a huge help in the &quot;bring-up&quot; of this approach.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;optimized-codegen-abstracted-interpreter-state&quot;&gt;Optimized Codegen: Abstracted Interpreter State&lt;&#x2F;h2&gt;
&lt;p&gt;One final suite of optimization ideas extends from the question: how
can we make the interpreter&#x27;s management of guest-language state more
efficient?&lt;&#x2F;p&gt;
&lt;p&gt;To unpack this a bit: when compiling any imperative language (at least
languages with reasonable semantics), we can usually identify elements
of the guest-language state, such as local variables, and map that
state directly to &quot;registers&quot; or other fast storage with fixed, static
names. (In Wasm bytecode, we use locals for this.) However, when
interpreting such a language, usually we have implementations for
opcodes that are generic over &lt;em&gt;which&lt;&#x2F;em&gt; variables we are reading and
writing, so we have runtime indirection. To make this concrete, if we
have a statement&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;z = x + y;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;we can compile this to something like&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;add z, x, y&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;in machine code or Wasm bytecode (with &lt;code&gt;add&lt;&#x2F;code&gt; suitably replaced by
whatever the static or dynamic types require). But if we have a case
for this operator in an interpreter loop, we have to write&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;regs[z] = regs[x] + regs[y];&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;and in the compiled interpreter body, this implies many reads and
writes to the &lt;code&gt;regs&lt;&#x2F;code&gt; array in memory.&lt;&#x2F;p&gt;
&lt;p&gt;The weval approach to compilation-by-specializing-interpreters
directly copies the interpreter cases for each opcode. We can
specialize the register&#x2F;variable indices &lt;code&gt;x&lt;&#x2F;code&gt;, &lt;code&gt;y&lt;&#x2F;code&gt;, and &lt;code&gt;z&lt;&#x2F;code&gt; in the
above code, but we cannot (or should not) turn memory loads and stores
to an in-memory &lt;code&gt;regs&lt;&#x2F;code&gt; array into direct values in registers (Wasm
locals).&lt;&#x2F;p&gt;
&lt;p&gt;Why? Not only would this require a fair amount of complexity, it would
often be &lt;em&gt;incorrect&lt;&#x2F;em&gt;, at least in the case where &lt;code&gt;regs&lt;&#x2F;code&gt; is observable
from other parts of the interpreter. For example, if the interpreter
has a moving garbage collector, every part of the interpreter state
might be subject to tracing and pointer rewriting at any &quot;GC
safepoint&quot;, which could be at many different points in the interpreter
body.&lt;&#x2F;p&gt;
&lt;p&gt;Better, and more in the spirit of weval, would be to provide
intrinsics that the interpreter can &lt;em&gt;opt into&lt;&#x2F;em&gt; to indicate that some
value storage &lt;em&gt;can&lt;&#x2F;em&gt; be rewritten into direct dataflow and memory loads
and stores can be optimized out. weval provides such intrinsics for
&quot;locals&quot;, which are indexed by integers that must be constant during
const-prop (i.e., must come directly from the bytecode), and for an
&quot;operand stack&quot;, so that interpreters of stack VM-based bytecode (as
SpiderMonkey&#x27;s JS opcode VM is) can perform &lt;em&gt;virtual&lt;&#x2F;em&gt; pushes and
pops. weval tracks the abstract state of these locals and stack slots,
and &quot;flushes&quot; the state to memory only on a &quot;synchronize&quot; intrinsic,
which the interpreter uses when its state may be externally observable
(or never, if its design allows that). These intrinsics provided a
substantial speedup in SpiderMonkey&#x27;s case.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;other-optimizations-and-details&quot;&gt;Other Optimizations and Details&lt;&#x2F;h2&gt;
&lt;p&gt;I&#x27;ve elided a fair number of other details, of course! Worthy of note
are a few other things:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The weval tool has top-level caching per &quot;specialization
request&quot;. It takes a hash of the input Wasm module and the &quot;argument
string&quot;, which includes the bytecode and other constant values, of
each specialization request. If it has seen the exact input module
and a particular argument string before, it copies the resulting
Wasm function body directly out of the cache, as fast as SQLite and
the user&#x27;s disk can go. This turns out to be really useful for the
&quot;AOT ICs&quot; corpus mentioned in the previous post.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;I added special handling to elide the remnants of LLVM&#x27;s shadow
stack in function bodies where constant-propagation and
specialization removed all uses of data on the shadow stack. This
technically removes &quot;side-effects&quot; (updates to the shadow stack
pointer global) but only if they are unobserved, as determined by an
escape analysis.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Some kinds of patterns result from the partial evaluation that can
be cleaned up further. In particular, when an initial value (such as
a &lt;code&gt;pc&lt;&#x2F;code&gt; or &lt;code&gt;sp&lt;&#x2F;code&gt;) is &lt;code&gt;Unknown&lt;&#x2F;code&gt; (not constant or known at compile
time), but &lt;em&gt;offsets&lt;&#x2F;em&gt; from that value &lt;em&gt;are&lt;&#x2F;em&gt; constant properties of
program points in the bytecode, then we see chains of &lt;code&gt;pc&#x27; = pc + 3&lt;&#x2F;code&gt;, &lt;code&gt;pc&#x27;&#x27; = pc&#x27; + 4&lt;&#x2F;code&gt;, &lt;code&gt;pc&#x27;&#x27;&#x27; = pc&#x27;&#x27; + 1&lt;&#x2F;code&gt;, etc. We can rewrite all of
these in terms of the original non-constant value, in essence a
re-association of &lt;code&gt;(x + k1) + k2&lt;&#x2F;code&gt; to &lt;code&gt;x + (k1 + k2)&lt;&#x2F;code&gt;, which removes
a lot of extraneous dataflow from the fastpath of compiled code.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;switch&lt;&#x2F;code&gt; statements in the source language result in
user-data-dependent &quot;next contexts&quot; and &lt;code&gt;pc&lt;&#x2F;code&gt; values; what is needed
is to instead &quot;value-specialize&quot;, i.e. split into sub-contexts each
of which assumes &lt;code&gt;i = 0&lt;&#x2F;code&gt;, &lt;code&gt;i = 1&lt;&#x2F;code&gt;, ..., &lt;code&gt;i &amp;gt;= N&lt;&#x2F;code&gt;, generate a
Wasm-level switch opcode in the output, and constant-propagate
accordingly. This is how we can turn arbitrary-fanout control flow
in the source bytecode directly into the same control flow in Wasm.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;results-annotation-overhead-and-runtime-speedups&quot;&gt;Results: Annotation Overhead and Runtime Speedups&lt;&#x2F;h2&gt;
&lt;p&gt;weval is a general tool, usable by any interpreter; to evaluate it,
we&#x27;ll need to consider it in the consequence of particular
interpreters. Because it was developed with the SpiderMonkey JS engine
in mind, and and for my work to enable fully ahead-of-time JS
compilation, I&#x27;ll mainly present results in that context&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Annotation Overhead&lt;&#x2F;em&gt;: the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;gecko-dev&#x2F;pull&#x2F;48&quot;&gt;PR that adds weval
support&lt;&#x2F;a&gt; to
SpiderMonkey shows a total lines-of-code delta of &lt;code&gt;+1045 -2&lt;&#x2F;code&gt; -- a
little over a thousand lines of code -- though much of this is the
vendored &lt;code&gt;weval.h&lt;&#x2F;code&gt; (part of weval proper, imported into the tree), and
a C++ RAII wrapper around weval requests (reusable in other
interpreters). The actual changes to the interpreter loop come in at
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;gecko-dev&#x2F;pull&#x2F;48&#x2F;files#diff-4958b3e538c41183f77fdc060e3947e5bbbcd7e8da1905dada8b4c9abe3ebabe&quot;&gt;133 lines of
code&lt;&#x2F;a&gt;
in an alternative set of macro definitions -- and, if we&#x27;re being
fair, the earlier mechanical changes to use these macros to access and
update interpreter state rather than direct code. Then there is a
little bit of plumbing to create weval requests when a function is
created (&lt;code&gt;EnqueueScriptSpecialization&lt;&#x2F;code&gt; and
&lt;code&gt;EnqueueICStubSpecialization&lt;&#x2F;code&gt;), about a hundred lines total including
the definitions of those functions.&lt;&#x2F;p&gt;
&lt;p&gt;As an additional demonstration of ease-of-use, one might consider &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bernsteinbear.com&#x2F;blog&#x2F;weval&#x2F;&quot;&gt;Max
Bernstein&#x27;s post&lt;&#x2F;a&gt;, in which Max
and I wrote a toy interpreter (10 opcodes) and weval&#x27;d it in a few
hours.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Runtime Speedup&lt;&#x2F;em&gt;: In my &lt;a href=&quot;&#x2F;blog&#x2F;2024&#x2F;08&#x2F;27&#x2F;aot-js&#x2F;#results&quot;&gt;earlier
post&lt;&#x2F;a&gt; I presented results showing
overall 2.77x geomean speedup, with up to 4.39x on the highest
benchmark (over 4x on 3 of 13). This is the work of significant
optimization in SpiderMonkey too -- this speedup does not come for
free on day one of a weval-adapted interpreter -- but the results are
absolutely enabled by AOT compilation, with the weval&#x27;d PBL
interpreter body running as an interpreter providing only 1.27x
performance over the generic interpreter. This means that weval&#x27;s
compilation via the first Futamura projection, as well as lifting of
interpreter dataflow out of memory to SSA, and other miscellaneous
post-processing optimization passes, are responsible for a 2.19x
speedup (and that part truly is &quot;for free&quot;).&lt;&#x2F;p&gt;
&lt;p&gt;Overall, I&#x27;m pretty happy with these results; and we are working on
shipping the use of weval to AOT-compile JS in the Bytecode Alliance
(see
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;StarlingMonkey&#x2F;pull&#x2F;91&quot;&gt;here&lt;&#x2F;a&gt;).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;related-work&quot;&gt;Related Work&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;graalvm&quot;&gt;GraalVM&lt;&#x2F;h3&gt;
&lt;p&gt;Finally, to place weval in a broader context: in addition to the
foundational work on Futamura projections, weval is absolutely
inspired by GraalVM, the only other significant (in fact far more
significant) compiler based on partial-evaluation-of-interpreters to
my knowledge. weval is in broad strokes doing &quot;the same thing&quot; as
GraalVM, but makes a few (massive) simplifications and a few
generalizations as well:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;weval targets WebAssembly, a general-purpose low-level compiler
target that can support interpreters written in many languages
(e.g., C and C++, covering a majority of the most frequently-used
scripting language interpreters and JavaScript interpreters today),
while GraalVM targets the Java Virtual Machine.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;weval is fully ahead-of-time, while classically at least, GraalVM is
closely tied with the JVM&#x27;s JIT and requires runtime codegen and
(infamously) long warmup times. More recent versions of GraalVM also
support &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.graalvm.org&#x2F;latest&#x2F;reference-manual&#x2F;native-image&#x2F;&quot;&gt;GraalVM Native
Image&lt;&#x2F;a&gt;,
though GraalVM still carries complexity due to its support for,
e.g., runtime de-optimization.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;weval requires the interpreter to add a few intrinsics (details
below) but otherwise &quot;meets interpreters where they are&quot;, allowing
adaptation of existing industrial-grade language implementations
(such as SpiderMonkey!), while GraalVM requires the interpreter to
be written in terms of its own AST classes.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;On the other hand, weval, as an AOT-only transform, does not support
any speculative-optimization or deoptimization mechanisms, or
gradual warmup, vastly simplifying the design as compared to GraalVM
but also limiting performance and flexibility (for now).&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;porffor-static-hermes-static-typescript-assemblyscript&quot;&gt;porffor, Static Hermes, Static TypeScript, AssemblyScript&lt;&#x2F;h3&gt;
&lt;p&gt;There have been a variety of ahead-of-time compilers that accept
either JavaScript
(&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www-sop.inria.fr&#x2F;members&#x2F;Manuel.Serrano&#x2F;publi&#x2F;serrano-dls18.pdf&quot;&gt;Hopc&lt;&#x2F;a&gt;),
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;porffor.dev&#x2F;&quot;&gt;porffor&lt;&#x2F;a&gt;) or annotated JS (&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;facebook&#x2F;hermes&quot;&gt;Static
Hermes&lt;&#x2F;a&gt;, &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.microsoft.com&#x2F;en-us&#x2F;research&#x2F;uploads&#x2F;prod&#x2F;2019&#x2F;09&#x2F;static-typescript-draft2.pdf&quot;&gt;Static
TypeScript&lt;&#x2F;a&gt;),
or a JavaScript&#x2F;TypeScript-like language
(&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.assemblyscript.org&#x2F;&quot;&gt;AssemblyScript&lt;&#x2F;a&gt;).&lt;&#x2F;p&gt;
&lt;p&gt;Manuel Serrano in his &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www-sop.inria.fr&#x2F;members&#x2F;Manuel.Serrano&#x2F;publi&#x2F;serrano-dls18.pdf&quot;&gt;DLS&#x27;18
paper&lt;&#x2F;a&gt;
describes building an ahead-of-time compiler, Hopc, for JS that takes
a simple approach to specialization without runtime type observations:
it builds &lt;em&gt;one&lt;&#x2F;em&gt; specialized version of each function alongside the
generic version, and uses a set of type-inference rules to pick the
most &lt;em&gt;likely&lt;&#x2F;em&gt; types. This is shown to work quite well in the
paper. The main downside is the limit to how far this inference can
go: for example, SpiderMonkey&#x27;s ICs can specialize property accesses
on object shape, while Hopc does not infer object shapes and in
general such an analysis is hard (similar to global pointer analysis).&lt;&#x2F;p&gt;
&lt;p&gt;The porffor JS compiler is (as best I can tell!)  emitting &quot;direct&quot;
code for JS operators, with some type inference to reduce the number
of cases needed. The speedups on small programs are compelling, and I
think that taking this approach from scratch actually makes a lot of
sense. That said, JS is a large language with a lot of awkward
corner-cases, and with (as of writing) 39% of the ECMA-262 test suite
passing, porffor still has a hill to climb to reach parity with mature
engines such as SpiderMonkey or V8. I sincerely wish them all the best
of luck and would love to see the project succeed!&lt;&#x2F;p&gt;
&lt;p&gt;Static Hermes and Static TypeScript both adopt the idea: what if we
require type annotations for compilation? In that case, we can
dispense with the whole idea of dynamic IC dispatch -- we know which
case(s) will be needed ahead of time and we can inline them
directly. This is definitely the right answer if one has those
annotations. More modern codebases tend to have more discipline in
this regard; unfortunately, the world is filled with &quot;legacy JS&quot; and
so the idea is not universally applicable.&lt;&#x2F;p&gt;
&lt;p&gt;AssemblyScript is another design point in that same spirit, though
further: though it superficially resembles TypeScript, its semantics
are often a little different, in a way that makes the language simpler
(to the implementer) overall but creates major compatibility hurdles
in any effort to port existing TypeScript. For example, it does not
support function closures (lambdas), iterators or exceptions, all
regularly used features in large JS&#x2F;TS codebases today.&lt;&#x2F;p&gt;
&lt;p&gt;Compared to all of the cases above, the approach we have taken here --
adapting SpiderMonkey and turning a 100%-coverage interpreter tier
into a compiler with weval -- provides &lt;em&gt;full compatibility&lt;&#x2F;em&gt; with
everything that the SpiderMonkey engine supports, i.e., all recent
JavaScript features and APIs, without requiring anything special.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion-principled-derivation-of-a-jit&quot;&gt;Conclusion: Principled Derivation of a JIT&lt;&#x2F;h2&gt;
&lt;p&gt;weval has been an incredibly fun project, and honestly, that it is
working this well has surprised me. GraalVM is certainly an existence
proof for the idea, but it also has engineer-decades (perhaps
centuries?)  of work poured into it; it was a calculated risk I took
to follow my intuitive sense that one could do it simpler if focused
on the AOT-only case, and if leveraging the simplicity and full
encapsulation of Wasm as a basis for program transforms. I do believe
that weval was the shortest path I could have taken to produce an
ahead-of-time compiler (that is, no warmup or runtime codegen) for
JavaScript. This is largely for the reason that it let me factor out
the &quot;semantics of the bytecode execution&quot; problem, i.e., writing the
PBL interpreter, from the code generation problem; get the former
right with a native debugger and native test workflow; and then do a
principled transform to get a compiler out of it.&lt;&#x2F;p&gt;
&lt;p&gt;This is the big-picture story I want to tell as well: we &lt;em&gt;can&lt;&#x2F;em&gt; &quot;factor
complexity&quot; by replacing manual implementations of large, complex
systems with automatic derivations in some cases. Likely it won&#x27;t ever
be completely as good as the hand-written system; but it might get
close, if one can find the right abstractions to express all the right
optimizations. Profile-guided inlining of ICs to take AOT-compiled
(weval&#x27;d) JS code to the next performance level is another good
example, and in fact it was inspired by WarpMonkey which took the same
&quot;principled derivation from one single source of truth&quot; approach to
building a higher-performance tier. There is a lot more here to do.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;em&gt;Thanks to Luke Wagner, Nick Fitzgerald, and Max Bernstein for reading
and providing feedback on a draft of this post!&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;1&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;1&lt;&#x2F;sup&gt;
&lt;p&gt;In reality, native baseline compilers often keep a &quot;virtual
stack&quot; that allows avoiding some push and pop operations --
basically by keeping track of which registers represent the
values on the top of the stack whose pushes have been deferred,
and using them directly rather than popping when possible,
forcing a deferred &quot;synchronization&quot; only when the full stack
contents have to be reified for, say, a call or a GC safepoint.
Both SpiderMonkey&#x27;s baseline compiler and Wasmtime&#x27;s Winch
implement this optimization, which can be seen as a simple form
of &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Abstract_interpretation&quot;&gt;abstract
interpretation&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;2&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;2&lt;&#x2F;sup&gt;
&lt;p&gt;For more on this topic, &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.cs.cmu.edu&#x2F;afs&#x2F;cs&#x2F;academic&#x2F;class&#x2F;15745-s13&#x2F;public&#x2F;lectures&#x2F;L5-Foundations-of-Dataflow-1up.pdf&quot;&gt;these
slides&lt;&#x2F;a&gt;
are a good introduction. The online book &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;cs.au.dk&#x2F;~amoeller&#x2F;spa&#x2F;spa.pdf&quot;&gt;Static Program
Analysis&lt;&#x2F;a&gt; by Møller and
Schwartzbach is also an invaluable resource, as well as the
classic &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Compilers:_Principles,_Techniques,_and_Tools&quot;&gt;Dragon
Book&lt;&#x2F;a&gt;
(Aho, Lam, Sethi, Ullman).&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;3&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;3&lt;&#x2F;sup&gt;
&lt;p&gt;&quot;Simple&quot;, right?! The real implementation is &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;cfallin&#x2F;weval&#x2F;blob&#x2F;bb253a398f0fd622f64af69abaa9e39e54f56769&#x2F;src&#x2F;eval.rs&quot;&gt;about 2400 lines
of
code&lt;&#x2F;a&gt;,
which is actually not terrible for something this powerful,
IMHO. I&#x27;m personally shocked how &quot;small&quot; weval turned out to be.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
</content>
    </entry>
    <entry xml:lang="en">
        <title>Compilation of JavaScript to Wasm, Part 2: Ahead-of-Time vs. JIT</title>
        <published>2024-08-27T00:00:00+00:00</published>
        <updated>2024-08-27T00:00:00+00:00</updated>
        <author>
            <name></name>
        </author>
        <link rel="alternate" type="text/html" href="https://cfallin.org/blog/2024/08/27/aot-js/"/>
        <id>https://cfallin.org/blog/2024/08/27/aot-js/</id>
        <content type="html" xml:base="https://cfallin.org/blog/2024/08/27/aot-js/">&lt;p&gt;&lt;em&gt;This is a continuation of my &quot;fast JS on Wasm&quot; series; the &lt;a href=&quot;&#x2F;blog&#x2F;2023&#x2F;10&#x2F;11&#x2F;spidermonkey-pbl&#x2F;&quot;&gt;first
post&lt;&#x2F;a&gt; covered PBL, a portable
interpreter that supports inline caches, this post adds ahead-of-time
compilation, and the final post will discuss the details of that
ahead-of-time compilation. Please read the first post first for useful
context!&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;The most popular programming language in the world, by a wide margin
-- thanks to the ubiquity of the web -- is
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;JavaScript&quot;&gt;JavaScript&lt;&#x2F;a&gt; (or, if you
prefer to follow international standards, ECMAScript per
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;ecma-international.org&#x2F;publications-and-standards&#x2F;standards&#x2F;ecma-262&#x2F;&quot;&gt;ECMA-262&lt;&#x2F;a&gt;). For
a computing platform to be relevant to many modern kinds of
applications, it should run JavaScript.&lt;&#x2F;p&gt;
&lt;p&gt;For the past four years or so, I have been working on
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;webassembly.org&#x2F;&quot;&gt;WebAssembly&lt;&#x2F;a&gt; (Wasm) tooling and platforms,
and in particular, running Wasm &quot;outside the browser&quot; (where it was
born), using it for strong
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Sandbox_(computer_security)&quot;&gt;sandboxing&lt;&#x2F;a&gt;
of untrusted server-side code instead.&lt;&#x2F;p&gt;
&lt;p&gt;This blog post will describe my work, over the past 1.5 years, to
build an &lt;em&gt;ahead-of-time compiler&lt;&#x2F;em&gt; from JavaScript to WebAssembly
bytecode, achieving a 3-5x speedup. The work is now technically
largely complete: it has been integrated into the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.bytecodealliance.org&#x2F;&quot;&gt;Bytecode
Alliance&lt;&#x2F;a&gt;&#x27;s version of SpiderMonkey
(to be eventually upstreamed, ideally), then our shared
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;starlingmonkey&quot;&gt;StarlingMonkey&lt;&#x2F;a&gt;
JS-on-Wasm runtime on top of that (available with the &lt;code&gt;--aot&lt;&#x2F;code&gt; flag to
the &lt;code&gt;componentize.sh&lt;&#x2F;code&gt; toplevel build command), then my employer&#x27;s JS
SDK built on top of StarlingMonkey. It passes all &quot;JIT tests&quot; and &quot;JS
tests&quot; in the SpiderMonkey tree, and all Web Platform Tests at the runtime
level. It&#x27;s still in &quot;beta&quot; and considered experimental, but now is a good time
to do a deep-dive into how it works!&lt;&#x2F;p&gt;
&lt;p&gt;This JavaScript AOT compilation approach is built on top of my
&lt;a href=&quot;&#x2F;blog&#x2F;2023&#x2F;10&#x2F;11&#x2F;spidermonkey-pbl&#x2F;&quot;&gt;Portable Baseline Interpreter&lt;&#x2F;a&gt;
work in SpiderMonkey, combined with my
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;cfallin&#x2F;weval&quot;&gt;weval&lt;&#x2F;a&gt; partial program evaluator to
provide compilation from an interpreter body &quot;for free&quot; (using a
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Partial_evaluation#Futamura_projections&quot;&gt;Futamura
projection&lt;&#x2F;a&gt;). I
have recently &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=_T3s6-C38JI&quot;&gt;given a
talk&lt;&#x2F;a&gt; about this work. I
hope to go into a bit more depth in this post and a followup one;
first, how one can compile JavaScript ahead-of-time at all, and to
come later, how weval enables easier construction of the necessary
compiler backends.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;background&quot;&gt;Background&lt;&#x2F;h2&gt;
&lt;p&gt;How do we run JavaScript (JS) on a platform that supports WebAssembly?
At first, this question was needless, because Wasm originated within
existing JS engines as a way to provide lower-level, strongly-typed
code directly to the engine&#x27;s compiler backend: so one could run JS
code alongside any Wasm modules and they could interact at a
function-call level. But on Wasm-first platforms, &quot;outside the
browser&quot; as we say, we have no system-level JS engine; we can only run
Wasm modules that have been uploaded by the user, which interact with
the platform via direct &quot;hostcalls&quot; available as Wasm imports.&lt;&#x2F;p&gt;
&lt;p&gt;To quote my &lt;a href=&quot;&#x2F;blog&#x2F;2023&#x2F;10&#x2F;11&#x2F;spidermonkey-pbl&#x2F;&quot;&gt;earlier blog post&lt;&#x2F;a&gt; on
this topic, by adopting this restriction, we obtain a number of
advantages:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;Running an entire JavaScript engine &lt;em&gt;inside&lt;&#x2F;em&gt; of a Wasm module may
seem like a strange approach at first, but it serves real
use-cases. There are platforms that accept WebAssembly-sandboxed
code for security reasons, as it ensures complete memory isolation
between requests while remaining very fine-grained (hence with lower
overheads). In such an environment, JavaScript code needs to bring
its own engine, because no platform-native JS engine is
provided. This approach ensures a sandbox &lt;em&gt;without trusting the
JavaScript engine&#x27;s security&lt;&#x2F;em&gt; -- because the JS engine is just
another application on the hardened Wasm platform -- and carries
other benefits too: for example, the JS code can interact with other
languages that compile to Wasm easily, and we can leverage Wasm&#x27;s
determinism and modularity to snapshot execution and then perform
extremely fast cold startup.&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;So we have very fine-grained isolation and security, we eliminate JIT
bugs altogether (the most productive source of CVEs in production
browsers today!), and the modularity enables interesting new system
design points.&lt;&#x2F;p&gt;
&lt;p&gt;Unfortunately, the state of the art for bundling JS &quot;with its own
engine&quot; today is, still, to combine an &lt;em&gt;interpreter&lt;&#x2F;em&gt; with a bytecode
representation of the JS, without any compilation of the JS to
specialized &quot;native&quot; (or in this case Wasm) code. There are two
reasons. The first and most straightforward one is that Wasm is simply
a new platform; JS engines&#x27; interpreters can be ported relatively
straightforwardly, because they have little dependence on the
underlying instruction set architecture&#x2F;platform, but JIT compilers
very much do.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;compiling-js-why-is-it-hard&quot;&gt;Compiling JS: Why is it Hard?&lt;&#x2F;h2&gt;
&lt;p&gt;It&#x27;s worth reviewing first why this is a hard problem. It&#x27;s possible
to compile quite a few languages to a Wasm target today: C or C++
(with &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;webassembly&#x2F;wasi-sdk&quot;&gt;wasi-sdk&lt;&#x2F;a&gt; or
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;emscripten.org&#x2F;&quot;&gt;Emscripten&lt;&#x2F;a&gt;, for example), Rust (with its
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.rust-lang.org&#x2F;what&#x2F;wasm&quot;&gt;first-class Wasm support&lt;&#x2F;a&gt;),
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;swiftwasm.org&#x2F;&quot;&gt;Swift&lt;&#x2F;a&gt;,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;kotlinlang.org&#x2F;docs&#x2F;wasm-overview.html&quot;&gt;Kotlin&lt;&#x2F;a&gt;,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;dev.virtuslab.com&#x2F;p&#x2F;scala-to-webassembly-how-and-why&quot;&gt;Scala&lt;&#x2F;a&gt;,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;ghc.gitlab.haskell.org&#x2F;ghc&#x2F;doc&#x2F;users_guide&#x2F;wasm.html&quot;&gt;Haskell&lt;&#x2F;a&gt;,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;ocaml-wasm&#x2F;wasm_of_ocaml&quot;&gt;OCaml&lt;&#x2F;a&gt;, and probably
many more. What makes JavaScript different?&lt;&#x2F;p&gt;
&lt;p&gt;The main difficulty comes from JS&#x27;s &lt;em&gt;dynamic typing&lt;&#x2F;em&gt;: because
variables are not annotated or constrained to a single primitive type
or object class, a simple expression like &lt;code&gt;x + y&lt;&#x2F;code&gt; could mean many
different things. This could be an integer or floating-point numeric
addition, a string concatenation, or an invocation of arbitrary JS
code (via &lt;code&gt;.toString()&lt;&#x2F;code&gt;, &lt;code&gt;.valueOf()&lt;&#x2F;code&gt;, proxy objects, or probably
other tricks too). A naive compiler would generate a large amount of
code for this simple expression that performs type checks and
dispatches to one of these different cases. For both runtime
performance reasons (checking for all cases is quite slow) and
code-size reasons (can we afford hundreds or thousands of Wasm
instructions for each JS operator?), this is impractical. We will need
to somehow adapt the techniques that modern JS engines have invented
to the Wasm platform.&lt;&#x2F;p&gt;
&lt;p&gt;This leads us directly to the engineering marvel
that is the modern JIT (just-in-time) compiler for JavaScript. These
engines, such as &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;spidermonkey.dev&#x2F;&quot;&gt;SpiderMonkey&lt;&#x2F;a&gt; (part of
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;firefox.org&#x2F;&quot;&gt;Firefox&lt;&#x2F;a&gt;), &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;v8.dev&#x2F;&quot;&gt;V8&lt;&#x2F;a&gt; (part of
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.chromium.org&#x2F;Home&#x2F;&quot;&gt;Chromium&lt;&#x2F;a&gt;), and
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;WebKit&#x2F;WebKit&#x2F;tree&#x2F;main&#x2F;Source&#x2F;JavaScriptCore&quot;&gt;JavaScriptCore&lt;&#x2F;a&gt;
(part of &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;webkit.org&#x2F;&quot;&gt;WebKit&lt;&#x2F;a&gt;), generate &lt;em&gt;native machine
code&lt;&#x2F;em&gt; for JavaScript code. They work around the above difficulty by
generating only code for the cases that &lt;em&gt;actually occur&lt;&#x2F;em&gt;, specializing
based on runtime observations.&lt;&#x2F;p&gt;
&lt;p&gt;These JITs work fantastically well today on native platforms, but
there are technical reasons why a Wasm-based platform is a poor fit
for current JS engines&#x27; designs.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;unique-characteristics-of-a-wasm-platform&quot;&gt;Unique Characteristics of a Wasm Platform&lt;&#x2F;h2&gt;
&lt;p&gt;A typical &quot;Wasm-first&quot; platform -- whether that be
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;wasmtime.dev&#x2F;&quot;&gt;Wasmtime&lt;&#x2F;a&gt; running server-side, serving
requests with untrusted handler code, or a sandboxed plugin interface
inside a desktop application, or an embedded system -- has a few key
characteristics that distinguish it from a typical native OS
environment (such as that seen by a Linux x86-64 process).&lt;&#x2F;p&gt;
&lt;h3 id=&quot;no-runtime-codegen&quot;&gt;No Runtime Codegen&lt;&#x2F;h3&gt;
&lt;p&gt;First, these platforms typically &lt;em&gt;lack any dynamic&#x2F;runtime
code-generation mechanism&lt;&#x2F;em&gt;: that is, unlike a native process&#x27; ability
to JIT-compile new machine code and run it, a Wasm module has no
interface by which it can add new code at runtime. In other words,
code-loading functionality is &lt;em&gt;outside the sandbox&lt;&#x2F;em&gt;, typically by a
&quot;deployment&quot; or &quot;control plane&quot; functionality on the platform.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#1&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;
This allows greater management flexibility for the platform: for
example, it allows deployment of pre-compiled machine code for all
Wasm modules from a central control plane, allowing &quot;instant start&quot;
when a module is instantiated to serve a request. It also brings
significant security benefits: all code that is executing is known
a-priori, which makes security exploits harder to hide.&lt;&#x2F;p&gt;
&lt;p&gt;The major downside of this choice, of course, is that one cannot
implement a JIT in the traditional way: one cannot generate new code
based on observed behavior at runtime, because there is simply no
mechanism to invoke it.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;protected-call-stack-and-no-primitive-control-flow&quot;&gt;Protected Call Stack and No Primitive Control Flow&lt;&#x2F;h3&gt;
&lt;p&gt;Second, Wasm itself has some interesting divergences from conventional
instruction-set architecture (ISA) design. While a typical ISA on a
conventional CPU provides &quot;raw&quot; branch and call instructions that
transfer control to addresses in the main memory address space,
WebAssembly has first-class abstractions for modules and functions
within those modules, and all of its inter-function transfer
instructions (calls and returns) target known function entry points by
function index and maintain a protected (unmodifiable) call-stack. The
main advantage of this design choice is that a &quot;trusted stack&quot; allows
for function-level interop between different, mutually untrusting,
modules in a Wasm VM, potentially written in different languages; and
when Wasm-level features such as exceptions are implemented and used
widely, seamless interop of those features as well. In essence, it is
an enforced ABI. (My colleague Dan Gohman has some interesting
thoughts about this in &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=97jw9v2dRaw&amp;amp;t=13m0s&quot;&gt;this talk, at 13 minutes
in&lt;&#x2F;a&gt;.)&lt;&#x2F;p&gt;
&lt;p&gt;This is great for language interop and for security, but it rules out
a number of interesting control-flow patterns that language runtimes
use to implement features such as exceptions, generators&#x2F;coroutines,
pluggable &quot;stubs&quot; of optimized code, patchable jump-points, and
dynamic transfer between different optimization levels of the same
code (&quot;on-stack replacement&quot;). In other words, it conflicts with how
language runtimes want to implement &lt;em&gt;their&lt;&#x2F;em&gt; ABIs.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;per-request-isolation-and-lack-of-warmup&quot;&gt;Per-Request Isolation and Lack of Warmup&lt;&#x2F;h3&gt;
&lt;p&gt;Third, and specific to some of the Wasm-first platforms we care about,
Wasm&#x27;s &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bytecodealliance.org&#x2F;articles&#x2F;wasmtime-10-performance&quot;&gt;&lt;em&gt;fast
instantiation&lt;&#x2F;em&gt;&lt;&#x2F;a&gt;
allows for some very fine-grained sandboxing approaches. In
particular, when a new Wasm instance takes only a few microseconds to
start from a snapshot, it becomes feasible to treat instances as
disposable, and use them for small units of work. In the platform I
work on, we have an &lt;em&gt;instance-per-request isolation model&lt;&#x2F;em&gt;: each Wasm
instance serves only one HTTP request, and then goes away. This is
fantastic for security and bug mitigation: the blast radius of an
exploit or guest-runtime bug is only a single request, and can never
see the data from other users of the platform or even other requests
by the same user.&lt;&#x2F;p&gt;
&lt;p&gt;This fine-grained isolation is, again, great for security and
robustness, but throws a wrench into any language runtime&#x27;s plan that
begins with &quot;observe the program for a while and...&quot;. In other words,
we cannot implement a scheme for a dynamically-typed language that
requires us to specialize based on observation (of types, or dynamic
dispatch targets, or object schemas, or the like) because by the time
we make those observations for one request, our instance is disposed
and the next request starts &quot;fresh&quot; from the original snapshot.&lt;&#x2F;p&gt;
&lt;p&gt;What are we to do?&lt;&#x2F;p&gt;
&lt;h2 id=&quot;making-js-fast-specialized-runtime-codegen&quot;&gt;Making JS Fast: Specialized Runtime Codegen&lt;&#x2F;h2&gt;
&lt;p&gt;Let&#x27;s first recap how JavaScript engines&#x27;s JITs typically work -- at a
high level, &lt;em&gt;how&lt;&#x2F;em&gt; they observe program behavior, and compile the JS
into machine code based on that behavior.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;inline-caches&quot;&gt;Inline Caches&lt;&#x2F;h3&gt;
&lt;p&gt;As I described in more detail in my &lt;a href=&quot;&#x2F;blog&#x2F;2023&#x2F;10&#x2F;11&#x2F;spidermonkey-pbl&#x2F;&quot;&gt;earlier post about
PBL&lt;&#x2F;a&gt;, the key technique that all
modern JS engines use to accelerate a dynamically-typed language is to
observe execution -- with particular care toward actual types (&lt;code&gt;var x&lt;&#x2F;code&gt;
is always a string or an integer, for example), object shapes (&lt;code&gt;x&lt;&#x2F;code&gt;
always has properties &lt;code&gt;x.a&lt;&#x2F;code&gt;, &lt;code&gt;x.b&lt;&#x2F;code&gt; and &lt;code&gt;x.c&lt;&#x2F;code&gt; and we can store them in
memory in that order), or other specialized cases (&lt;code&gt;x.length&lt;&#x2F;code&gt; always
accesses the length slot of an array object) -- and then generate
specialized code for the actually-observed cases. For example, if we
observe &lt;code&gt;x&lt;&#x2F;code&gt; is always an array, we can compile &lt;code&gt;x.length&lt;&#x2F;code&gt; to a single
native load instruction of the array&#x27;s length in its object header; we
don&#x27;t need to handle cases where &lt;code&gt;x&lt;&#x2F;code&gt; is some other type. This
specialized codegen is necessarily at runtime, hence the name for the
technique, &quot;just-in-time (JIT) compilation&quot;: we generate the
specialized code while the program is running, just before it is
executed.&lt;&#x2F;p&gt;
&lt;p&gt;One could imagine building some ad-hoc framework to collect a lot of
observations (&quot;type feedback&quot;) and then direct the compiler
appropriately, and in fact you can get pretty far building a JIT this
way. However, SpiderMonkey uses a slightly more principled approach:
it uses &lt;em&gt;inline caches&lt;&#x2F;em&gt; for all type-feedback and other runtime
observations, and encodes the observed cases in a specialized compiler
IR, called &lt;em&gt;CacheIR&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The basic idea is: at every point in the program where there could be
some specialized behavior based on what we observe at runtime, we have
an &quot;IC site&quot;. This is a dynamic dispatch point: it invokes some
&lt;em&gt;stub&lt;&#x2F;em&gt;, or sequence of code, that has been &quot;attached&quot; to the IC site,
and we can attach new stubs as we execute the code. We always start
with a &quot;fallback stub&quot; that handles every case generically, but we can
emit new stubs as we learn. There is a good example of IC-based
specialization in my &lt;a href=&quot;&#x2F;blog&#x2F;2023&#x2F;10&#x2F;11&#x2F;spidermonkey-pbl&#x2F;#systematic-fast-paths-inline-caches&quot;&gt;earlier
post&lt;&#x2F;a&gt;,
summarized with this figure:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2023-10-03-ics-web.svg&quot; alt=&quot;Figure: Inline-cache stubs in a JavaScript function&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;compilation-phasing-for-ics&quot;&gt;Compilation Phasing for ICs&lt;&#x2F;h3&gt;
&lt;p&gt;So the question is now: how do we adapt a system that fundamentally
&lt;em&gt;generates new code at runtime&lt;&#x2F;em&gt; -- in this case, in the form of IC
stub sequences, which encode observed special cases (&quot;if conditions X
and Y, do Z&quot;) -- to a Wasm-based platform that restricts the program
to its existing, static code?&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s ask a basic question: are IC bodies really always completely
novel, or do we see the same sequences over and over? Intuitively, one
would expect a few common sequences to dominate most programs. For
example, the inline cache sequence for a common object property access
is a few CacheIR opcodes (simplifying a bit, but not too much, from
the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;rev&#x2F;e51b2630b44a8836a7ff35a876a2d8b555041d4a&#x2F;js&#x2F;src&#x2F;jit&#x2F;CacheIR.cpp#4537&quot;&gt;actual
code&lt;&#x2F;a&gt;):&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Check that the receiver (&lt;code&gt;x&lt;&#x2F;code&gt; in&lt;code&gt;x.a&lt;&#x2F;code&gt;) is an object;&lt;&#x2F;li&gt;
&lt;li&gt;Check that its &quot;shape&quot; (mapping from property names to slots) is the
same;&lt;&#x2F;li&gt;
&lt;li&gt;Load or store the appropriate slot.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;One would expect nearly any real JavaScript program to use that IC
body at some point to set an object property. Can&#x27;t we include it in
the engine build itself, avoiding the need to compile it at runtime?&lt;&#x2F;p&gt;
&lt;p&gt;We could special-case &quot;built-in ICs&quot;, but there&#x27;s a more elegant
approach: retain the uniform CacheIR representation (we &lt;em&gt;always&lt;&#x2F;em&gt;
generate IR for cases we observe), but then look up the generated
CacheIR bytecode sequence in a table of &lt;em&gt;included precompiled ICs&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;How do we decide which ICs to include, and do we need to write them
out manually? Well, not necessarily: it turns out that computers are
&lt;em&gt;great&lt;&#x2F;em&gt; at automating boring tasks, and we could simply &lt;em&gt;collect all
ICs ever generated&lt;&#x2F;em&gt; during a bunch of JavaScript execution (say, the
entire testsuite for SpiderMonkey) and build them in. Lookup tables
are cheap, and ICs tend to be small in terms of compiled code size, so
there isn&#x27;t much downside to including a few thousand ICs
(&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;gecko-dev&#x2F;tree&#x2F;ff-127-0-2&#x2F;js&#x2F;src&#x2F;ics&#x2F;&quot;&gt;2367&lt;&#x2F;a&gt;
by latest count).&lt;&#x2F;p&gt;
&lt;p&gt;This is the approach I took with &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;gecko-dev&#x2F;pull&#x2F;45&quot;&gt;AOT
ICs&lt;&#x2F;a&gt;: I built
infrastructure to compile ICs into the engine, loading them into the
lookup table (pre-existing to deduplicate compiled IC bodies) at
startup, and built an &lt;em&gt;enforcing mode&lt;&#x2F;em&gt; to allow for gathering the
corpus.&lt;&#x2F;p&gt;
&lt;p&gt;The latter part -- collecting the corpus, and testing that the corpus
is complete when running the whole testsuite -- is key, because it
answers the social&#x2F;software-engineering question of how to keep the
corpus up-to-date. The idea is to make updating it as easy as
possible. When the testsuite runs &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;gecko-dev&#x2F;blob&#x2F;dc07fd26d11b4ccfbc82b35c4721f30696fe098d&#x2F;.github&#x2F;workflows&#x2F;main.yml#L15&quot;&gt;in
CI&lt;&#x2F;a&gt;,
we test in &quot;AOT ICs&quot; mode that errors out if an IC is generated that
is not already in the compiled-in corpus. However, this failure also
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;gecko-dev&#x2F;blob&#x2F;dc07fd26d11b4ccfbc82b35c4721f30696fe098d&#x2F;js&#x2F;src&#x2F;jit&#x2F;BaselineCacheIRCompiler.cpp#L2537-L2568&quot;&gt;dumps a new IC
file&lt;&#x2F;a&gt;
and provides a message with instructions: move this file into
&lt;code&gt;js&#x2F;src&#x2F;ics&#x2F;&lt;&#x2F;code&gt; and rebuild, and the observed-but-not-collected IC
becomes part of the corpus. The goal is to catch the case where a
developer adds a new IC path and make the corresponding corpus
addition as painless as possible.&lt;&#x2F;p&gt;
&lt;p&gt;Another point worth noting is that while there is no guarantee that
the engine won&#x27;t generate novel IC bodies during execution of some
program -- for example, during accesses to an object with a very long
prototype chain that results in a long sequence of object-prototype
checks -- by integrating the corpus check with the testsuite
execution, there are &lt;em&gt;aligned incentives&lt;&#x2F;em&gt;. Anyone adding new
IC-accelerated functionality should add a test case that covers it;
and when we execute this testcase, we will check whether the corpus
includes the relevant IC(s) or not, too.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;what-about-js-bytecode&quot;&gt;What About JS Bytecode?&lt;&#x2F;h2&gt;
&lt;p&gt;We now have a corpus of IC bodies that cover all the relevant cases
for, say, the &lt;code&gt;+&lt;&#x2F;code&gt; operator in JavaScript, but we still haven&#x27;t
addressed the actual &lt;em&gt;compilation of JavaScript source&lt;&#x2F;em&gt;. Fortunately,
this is now the easiest part: we do a mostly 1-to-1 translation of JS
source to code that consists of a series of IC sites, invoking the
current IC-stub pointer for each operator instance in turn. The
dataflow (connectivity between the operators) and control flow
(conditionals and loops) become the compiled code &quot;skeleton&quot; around
these IC sites. So, for example, the JS function&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;javascript&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;function&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; f&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-parameters&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-parameter&quot;&gt;x&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-parameter&quot;&gt; y&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-parameters z-punctuation&quot;&gt;) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;  if&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; (y&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; &amp;gt;=&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant z-numeric&quot;&gt; 0&lt;&#x2F;span&gt;&lt;span&gt;)&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;    return&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; x&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; +&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; y&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;  }&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt; else&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;    return&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; x&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; -&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; y&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;  }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;might become something like the following pseudo-C code sketch:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;c&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Value &lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;f&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-parameter&quot;&gt;Value x&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-parameter&quot;&gt; Value y&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  Value t1 &lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; ics&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;[&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant z-numeric&quot;&gt;0&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-punctuation z-section&quot;&gt;](&lt;&#x2F;span&gt;&lt;span&gt;y&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant z-numeric&quot;&gt; 0&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;)&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; `&amp;gt;=` operator&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;  bool&lt;&#x2F;span&gt;&lt;span&gt; t2 &lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; ics&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;[&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant z-numeric&quot;&gt;1&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-punctuation z-section&quot;&gt;](&lt;&#x2F;span&gt;&lt;span&gt;t1&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;)&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;     &#x2F;&#x2F; Value-to-bool coercion&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;  if&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt; (&lt;&#x2F;span&gt;&lt;span&gt;t2&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    Value t3 &lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; ics&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;[&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant z-numeric&quot;&gt;2&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-punctuation z-section&quot;&gt;](&lt;&#x2F;span&gt;&lt;span&gt;x&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span&gt; y&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;)&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt; &#x2F;&#x2F; `+` operator&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;    return&lt;&#x2F;span&gt;&lt;span&gt; t3&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;  }&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword&quot;&gt; else&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    Value t4 &lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;=&lt;&#x2F;span&gt;&lt;span&gt; ics&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;[&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant z-numeric&quot;&gt;3&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-punctuation z-section&quot;&gt;](&lt;&#x2F;span&gt;&lt;span&gt;x&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span&gt; y&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;)&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt; &#x2F;&#x2F; `-` operator&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;    return&lt;&#x2F;span&gt;&lt;span&gt; t4&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;  }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Given this 1-to-1 compilation of all JS functions in the JS source,
and a corpus of precompiled ICs, we have a full &lt;em&gt;ahead-of-time
compiler&lt;&#x2F;em&gt; for JavaScript.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;ahead-of-time-aot-compilation-of-ic-based-js&quot;&gt;Ahead-of-Time (AOT) Compilation of IC-Based JS&lt;&#x2F;h2&gt;
&lt;p&gt;It&#x27;s worth highlighting this fact again: we have built an
&lt;em&gt;ahead-of-time compiler&lt;&#x2F;em&gt; for JavaScript. How did this happen? Isn&#x27;t
runtime code generation necessary for efficient compilation of a
dynamically-typed language?&lt;&#x2F;p&gt;
&lt;p&gt;The key insight here is that the dynamism has been &lt;em&gt;pushed to runtime
late-binding&lt;&#x2F;em&gt; in form of the indirect calls to IC bodies. In other
words: behavior may vary widely depending on types; but rather than
build all the options in statically, we have &quot;attachment points&quot; for
several points, and we can dynamically insert certain behaviors by
&lt;em&gt;setting a function pointer&lt;&#x2F;em&gt; rather than generating new code. We&#x27;re
patching together fragments of static code by updating dynamic data
instead.&lt;&#x2F;p&gt;
&lt;p&gt;This execution model is known in SpiderMonkey as &lt;em&gt;baseline
compilation&lt;&#x2F;em&gt;. The main goal of the baseline compiler in SpiderMonkey
is to generate code as quickly as possible, which is a somewhat
separate goal to generating code with no runtime observations about
that code; but these goals overlap somewhat, as both are relevant at
lower tiers of the compiler hierarchy. In any case, a key fact about
baseline compilation is: it does not require any type feedback, and
can be done with &lt;em&gt;only&lt;&#x2F;em&gt; the JS source and nothing else. It admits
fully AOT compilation.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;going-further-bringing-in-runtime-feedback&quot;&gt;Going Further: Bringing In Runtime Feedback?&lt;&#x2F;h3&gt;
&lt;p&gt;Baseline compilation works well enough, but this runtime binding has one
key limitation: it means that the compiler cannot optimize &lt;em&gt;combinations
of ICs together&lt;&#x2F;em&gt;. For example, if we have a series of IC stubs for each
operator in a function, we cannot &quot;blend&quot; these IC bodies into one large
function body and allow optimizations to take hold; we cannot propagate
knowledge across ICs. One &lt;code&gt;+&lt;&#x2F;code&gt; operator may produce an integer, and
within the IC stub for that case, we can make use of the known result
type; but the next &lt;code&gt;+&lt;&#x2F;code&gt; operator to consume that value has to do its
type-checks over again. This is a fundamental fact of the &quot;late
binding&quot;: we&#x27;ve pushed composition of behaviors to runtime via the
dynamic function-pointer &quot;attachment&quot;, and we don&#x27;t have a compiler at
runtime, so there is nothing to be done.&lt;&#x2F;p&gt;
&lt;p&gt;If we want to go further, though, we can learn one more lesson from
SpiderMonkey&#x27;s design: baseline compilation with ICs is a solid
&lt;em&gt;framework&lt;&#x2F;em&gt; for re-admitting this kind of whole-function optimization
with types. Specifically, SpiderMonkey&#x27;s
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;hacks.mozilla.org&#x2F;2020&#x2F;11&#x2F;warp-improved-js-performance-in-firefox-83&#x2F;&quot;&gt;WarpMonkey&lt;&#x2F;a&gt;
optimizing compiler tier works by allowing baseline code to &quot;warm up&quot;
its ICs, collecting the most relevant &#x2F; most common special cases;
then it &lt;em&gt;inlines&lt;&#x2F;em&gt; those ICs at the call sites. And that&#x27;s it!&lt;&#x2F;p&gt;
&lt;p&gt;Said a different way: putting all type feedback into a &quot;call
specialized code stubs&quot; framework reduces type-feedback compilation to
an inlining problem. Compiler engineers know how to build inliners;
that&#x27;s far easier than an ad-hoc optimizing JIT tier.&lt;&#x2F;p&gt;
&lt;p&gt;This leads to a natural way we could build a further-optimizing tier
on top of our initial AOT JS compiler: build a &quot;profile-guided
inliner&quot;. In fact, this has been done: my colleague &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;fitzgen&quot;&gt;Nick
Fitzgerald&lt;&#x2F;a&gt; built a prototype,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;fitzgen&#x2F;winliner&quot;&gt;Winliner&lt;&#x2F;a&gt;, that profiles Wasm
indirect calls and inlines the most frequent targets. The idea in the
context of JS is to record the most frequent call target &lt;code&gt;T&lt;&#x2F;code&gt; at each
IC dispatch site, then replace the indirect-call with &lt;code&gt;if target == T { &#x2F;* inlined code *&#x2F; } else { call target }&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The beauty of this approach is that it is &lt;em&gt;semantics-preserving&lt;&#x2F;em&gt;: the
transform itself does not know anything about JavaScript engines, and
the resulting code will behave exactly the same, so if we have gotten
our baseline-level AOT compiler to generate correct code, this
optimizing tier will, as well, without new bugs. Winliner and
profile-guided inlining let us &quot;level-up&quot; from baseline compilation to
optimized compilation in a way that is as close to &quot;for free&quot; as one
could hope for. The inlining tool and the baseline compiler can be
separately tested and verified; we can carefully reason about each
one, and convince ourselves they are correct; and we don&#x27;t need to
worry about bugs in the combination of the two pieces (where bugs most
often lurk) because by taking the Wasm ISA as the interface, they
compose correctly.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;results&quot;&gt;Results&lt;&#x2F;h2&gt;
&lt;p&gt;That&#x27;s all well and good; what are the results?&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;gecko-dev&#x2F;pull&#x2F;48&quot;&gt;final PR&lt;&#x2F;a&gt;
in which I introduced AOT compilation to our SpiderMonkey branch
quotes these numbers on the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;chromium.github.io&#x2F;octane&#x2F;&quot;&gt;Octane&lt;&#x2F;a&gt; benchmark suite (numbers
are rates, higher is better):&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;SpiderMonkey running inside a Wasm module:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;- generic interpreter (in production today) vs.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;- ahead-of-time compilation (this post)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;               interpreter        AOT compilation     Speedup&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Richards        166                 729               4.39x&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;DeltaBlue       169                 686               4.06x&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Crypto          412                1255               3.05x&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;RayTrace        525                1315               2.50x&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;EarleyBoyer     728                2561               3.52x&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;RegExp          271                 461               1.70x&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Splay          1262                3258               2.58x&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;NavierStokes    656                2255               3.44x&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;PdfJS          2182                5991               2.75x&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Mandreel        166                 503               3.03x&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Gameboy        1357                4659               3.43x&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;CodeLoad      19417               17488               0.90x&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Box2D           927                3745               4.04x&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;----&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Geomean         821                2273               2.77x&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;or about a 2.77x geomean improvement overall, with a maximum of 4.39x
on one benchmark and most benchmarks around 2.5x-3.5x. Not bad! For
comparison, the native baseline compiler in SpiderMonkey obtains
around a 5x geomean -- so we have some room to grow still, but this is
a solid initial release. And of course, as noted above, by inlining
ICs then optimizing further, we should be able to incorporate
profile-guided feedback (as native JITs do) to obtain higher
performance in the future.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;other-approaches&quot;&gt;Other Approaches?&lt;&#x2F;h2&gt;
&lt;p&gt;It&#x27;s worth noting at this point that a few other JS compilers exist
that attempt to do &lt;em&gt;specialized&lt;&#x2F;em&gt; codegen without runtime type
observations or other profiling&#x2F;warmup. Manuel Serrano&#x27;s Hopc
compiler, described in his &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www-sop.inria.fr&#x2F;members&#x2F;Manuel.Serrano&#x2F;publi&#x2F;serrano-dls18.pdf&quot;&gt;DLS&#x27;18
paper&lt;&#x2F;a&gt;,
works by building &lt;em&gt;one&lt;&#x2F;em&gt; specialized version of each function alongside
the generic version, and uses a set of type-inference rules to pick
the most &lt;em&gt;likely&lt;&#x2F;em&gt; types. This is shown to work quite well in the
paper.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;porffor.dev&#x2F;&quot;&gt;porffor&lt;&#x2F;a&gt; compiler similarly is a fully AOT
JS compiler that appears to use some inference to emit the right cases
only when possible. That project is a work-in-progress, with support
for 39% of ECMA-262 currently, but seems promising.&lt;&#x2F;p&gt;
&lt;p&gt;The main downside to an inference-based approach (as opposed to the
dynamic indirection through ICs in our work) is the limit to how far
this inference can go: for example, SpiderMonkey&#x27;s ICs can specialize
property accesses on object shape, while Hopc does not infer object
shapes and in general such an analysis is hard (similar to global
pointer analysis).&lt;&#x2F;p&gt;
&lt;p&gt;I&#x27;d be remiss not to note a practical aspect too: there is a large
body of code embedding SpiderMonkey and written against its APIs; and
the engine supports all of the latest JS proposals and is actively
maintained; the cost of reaching parity with this level of support
with a from-scratch AOT compiler was one of the main reasons I opted
to build on SpiderMonkey (adapting ICs to be AOT and compiling its
bytecode) instead.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;up-next-compiler-backends-for-free&quot;&gt;Up Next: Compiler Backends (For Free)&lt;&#x2F;h2&gt;
&lt;p&gt;Astute readers will note that while I stated that we &lt;em&gt;do&lt;&#x2F;em&gt; a
compilation from the JS source to Wasm bytecode, and from IC bodies in
the corpus to Wasm bytecode, I haven&#x27;t said &lt;em&gt;how&lt;&#x2F;em&gt; we do this
compilation. In my &lt;a href=&quot;&#x2F;blog&#x2F;2023&#x2F;10&#x2F;11&#x2F;spidermonkey-pbl&#x2F;&quot;&gt;previous post&lt;&#x2F;a&gt;
on Portable Baseline, I described implementing an &lt;em&gt;interpreter&lt;&#x2F;em&gt; for IC
bodies, and an interpreter for JS bytecode that can invoke these IC
bodies. Developing two compiler backends for these two kinds of
bytecode (CacheIR and JS) is quite the mountain to climb, and one
naturally wonders whether all the effort to design, build, and debug
the interpreters could be reused somehow.  If you do indeed wonder
this, then you&#x27;ll love the upcoming &lt;em&gt;part 3&lt;&#x2F;em&gt; of this series, where I
describe how we can derive these compiler backends &lt;em&gt;automatically&lt;&#x2F;em&gt;
from the interpreters, reducing maintenance burden and complexity
significantly. Stay tuned!&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;em&gt;Thanks to Luke Wagner, Nick Fitzgerald, and Max Bernstein for reading
and providing feedback on a draft of this post!&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;1&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;1&lt;&#x2F;sup&gt;
&lt;p&gt;On the Web, a Wasm module can &quot;trampoline&quot; out to JavaScript to
load and instantiate another module with new code, sharing the
same heap, then invoke that new module via a shared
function-reference table. This is technically workable but
somewhat slow and cumbersome, and this mechanism does not exist
on Wasm-only platforms. Note that the primary reason is the
design &lt;em&gt;choice&lt;&#x2F;em&gt; to allow code-loading only via a control plane;
nothing &lt;em&gt;technically&lt;&#x2F;em&gt; stops the platform from providing a direct
Wasm hostcall for the same purpose.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
</content>
    </entry>
    <entry xml:lang="en">
        <title>Path Generics in Rust: A Sketch Proposal for Simplicity and Generality</title>
        <published>2024-06-12T00:00:00+00:00</published>
        <updated>2024-06-12T00:00:00+00:00</updated>
        <author>
            <name></name>
        </author>
        <link rel="alternate" type="text/html" href="https://cfallin.org/blog/2024/06/12/rust-path-generics/"/>
        <id>https://cfallin.org/blog/2024/06/12/rust-path-generics/</id>
        <content type="html" xml:base="https://cfallin.org/blog/2024/06/12/rust-path-generics/">&lt;p&gt;The &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.rust-lang.org&#x2F;&quot;&gt;Rust&lt;&#x2F;a&gt; programming language is
best-known for its memory-related type system features that encode
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;doc.rust-lang.org&#x2F;book&#x2F;ch04-00-understanding-ownership.html&quot;&gt;ownership and
borrowing&lt;&#x2F;a&gt;:
these ensure memory safety (no dangling pointers), and also enforce a
kind of &quot;mutual exclusion&quot; discipline that allows for provably safe
parallelism. It&#x27;s fantastic stuff; but it can also be utterly
maddening when one attempts to twist the borrow checker in a direction
it doesn&#x27;t want to go.&lt;&#x2F;p&gt;
&lt;p&gt;In a recent &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;smallcultfollowing.com&#x2F;babysteps&#x2F;blog&#x2F;2024&#x2F;06&#x2F;02&#x2F;the-borrow-checker-within&#x2F;&quot;&gt;blog
post&lt;&#x2F;a&gt;,
Niko Matsakis described an ambitious (but in my opinion very
achievable) vision for perfecting the &quot;borrow checker within&quot;: a
series of well-considered generalizations and features that make
Rust&#x27;s lifetime and borrowing system more ergonomic and a better fit
for a wider variety of usage patterns.&lt;&#x2F;p&gt;
&lt;p&gt;In this post, my goal is to tie an analogy from the &quot;place-based
lifetimes&quot; in that post and an idea I proposed in 2020 called
&quot;deferred borrows&quot; (&lt;a href=&quot;&#x2F;pubs&#x2F;ecoop2020_defborrow.pdf&quot;&gt;paper&lt;&#x2F;a&gt;&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#1&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;). I&#x27;ll
describe how I see &lt;em&gt;path generics&lt;&#x2F;em&gt; (one might call them &quot;place
generics&quot; in analogy with place-based lifetimes) as a new idea in Rust
that could allow place-based lifetimes, but also when generalized
sufficiently, allow significant new expressivity in terms of &lt;em&gt;branded
types&lt;&#x2F;em&gt; -- that is, one value in one place (say, a handle) irrevocably
tied to another (say, a container). Hopefully all of this will become
clear soon.&lt;&#x2F;p&gt;
&lt;p&gt;Note also that this is very much &lt;em&gt;not&lt;&#x2F;em&gt; my day job: I work &lt;em&gt;in&lt;&#x2F;em&gt; Rust,
not &lt;em&gt;on&lt;&#x2F;em&gt; Rust -- so take these as the semi-amateur scribblings that
they are, for inspiration or discussion material at best; but I have
no plans at the moment to try to push this further, beyond writing up
the ideas as they exist in my head, with the hope they might be
interesting.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;recap-places-as-lifetimes&quot;&gt;Recap: Places as Lifetimes&lt;&#x2F;h2&gt;
&lt;p&gt;In Rust today, memory is managed primarily by an &quot;ownership tree&quot;
idiom built around RAII and linear types; some types -- structs and
standard-library containers -- can own other types in turn. Unique
ownership that is tied to program scopes with RAII is extremely useful
because, by itself, it guarantees memory safety: we always know when
to free memory, and there is no way to access it later because it is
no longer in scope. This scheme also allows for race-free parallelism
because subpieces of the program heap have completely distinct names
and no aliasing.&lt;&#x2F;p&gt;
&lt;p&gt;However, unique tree ownership is quite cumbersome on its own, hence
borrows: temporary loaning of a subtree (by reference) as its own
first-class value. To ensure this is safe, the compiler checks that we
don&#x27;t hold the pointer for too long -- for example, longer than the
lifetime of the original owning path. Rust reifies this concept at the
syntax level with a &lt;em&gt;lifetime&lt;&#x2F;em&gt; and allows naming explicit lifetimes at
the type level when describing borrows. Because ownership ultimately
traces back to RAII and local bindings (perhaps in &lt;code&gt;main()&lt;&#x2F;code&gt;, but
somewhere up the stack&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#2&quot;&gt;2&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;), these lifetimes correspond to static
spans of code in some active stack frame.&lt;&#x2F;p&gt;
&lt;p&gt;This is all very abstract; the language has precise definitions, of
course, but the intuition can be difficult to internalize, especially
when lifetimes arise as lifetime parameters. For example, the function
signature&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;fn&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; f&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&amp;#39;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;a&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;, &amp;#39;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;b&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; &amp;#39;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;a&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;...&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;) {&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; ...&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;defines a function that works in an abstract context where two
lifetimes -- two sets of code spans -- exist either directly in the
caller or further up the stack, and for all times when &lt;code&gt;&#x27;a&lt;&#x2F;code&gt; is valid,
&lt;code&gt;&#x27;b&lt;&#x2F;code&gt; is too (due to the &quot;outlives&quot; constraint introduced with
&lt;code&gt;:&lt;&#x2F;code&gt;). This is a useful context to establish -- for example, it may let
one store a reference to something in &lt;code&gt;&#x27;b&lt;&#x2F;code&gt; in a field something in
&lt;code&gt;&#x27;a&lt;&#x2F;code&gt; -- but it&#x27;s &lt;em&gt;very hard&lt;&#x2F;em&gt; for one to grasp if one hasn&#x27;t seen this
concept before.&lt;&#x2F;p&gt;
&lt;p&gt;At the core of the intuitive difficulty is the fact that this is
another &lt;em&gt;kind&lt;&#x2F;em&gt; of program entity for the programmer to mentally
track. In scope-based resource-management paradigms (such as C++ or
Rust RAII) one is already somewhat accustomed to thinking of an
object&#x27;s lifetime, in some scope, conveying ownership. Lifetimes can
feel like some other semantic layer that is either redundantly
describing that structure, or perhaps coarsening&#x2F;summarizing it into
&quot;categories&quot; that are enough to prove safety to the borrow checker
somehow.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#3&quot;&gt;3&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The idea conveyed in &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;smallcultfollowing.com&#x2F;babysteps&#x2F;blog&#x2F;2024&#x2F;06&#x2F;02&#x2F;the-borrow-checker-within&#x2F;&quot;&gt;Niko Matsakis&#x27;
post&lt;&#x2F;a&gt;
-- noted as already existing under-the-hood in a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;rust-lang&#x2F;polonius&quot;&gt;new borrow
checker&lt;&#x2F;a&gt;, but surfaced as
syntax and type-system semantics in a new proposal -- is a wonderful
simplification: name &lt;em&gt;storage places&lt;&#x2F;em&gt;, which are already roots of the
ownership tree, as lifetimes as well. In essence, a borrow temporarily
transfers ownership; so why not name the location the ownership came
from, to allow checking compatibility of the lifetimes directly?&lt;&#x2F;p&gt;
&lt;p&gt;The blog post gives a simple example where a lifetime parameter is
needed today, but use of a place-name instead is more intuitive:
functions that return borrows to a piece of an argument. Rather than
writing&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;fn&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; find_element&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&amp;#39;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;a&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;#39;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;a&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-language&quot;&gt; self&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; key&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; K&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;)&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; -&amp;gt; &amp;amp;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;#39;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;a V&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; ...&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;where the &lt;code&gt;&#x27;a&lt;&#x2F;code&gt; exists only to tie &lt;code&gt;self&lt;&#x2F;code&gt; to the return value, we can
instead write&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;fn&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; find_element&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator z-variable z-language&quot;&gt;&amp;amp;self&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; key&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; K&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;)&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; -&amp;gt; &amp;amp;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;#39;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;self V&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; ...&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;with the same result. The post then expands further into
self-referential types, where one element of a struct may borrow from
another:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;struct&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; S&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;  text&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; String&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;  pieces&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Vec&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;#39;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;self&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;.&lt;&#x2F;span&gt;&lt;span&gt;text &lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;str&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;which is an excellent improvement not just in ergonomics, as above,
but actual expressivity as well. A variant of this pattern can occur
when &lt;code&gt;text&lt;&#x2F;code&gt; is a local binding and we build a local index into it:
then given a local &lt;code&gt;let text: String = ...;&lt;&#x2F;code&gt; we may later have a type
such as &lt;code&gt;Vec&amp;lt;&amp;amp;&#x27;text str&amp;gt;&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;There is one remaining expressivity gap with the new kind of lifetime:
it must have a binding in-scope to use it as a lifetime. In other
words, one can say&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;fn&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; process_whole_and_part&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;whole&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;: &amp;amp;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;T&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; part&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;: &amp;amp;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;#39;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;whole T&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;) {&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; ...&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;but if &lt;code&gt;whole&lt;&#x2F;code&gt; is not a parameter in that signature, or equivalently
not a sibling field in a struct with borrow fields, one still needs a
traditional lifetime parameter. As far as I can tell, the proposal
does not propose removing lifetime parameters entirely (nor would one
want to in a backwards-compatible evolution of the language); but
could we close the gap by introducing a little more syntax (thus in
turn creating a simpler and more uniform semantic landscape)?&lt;&#x2F;p&gt;
&lt;h2 id=&quot;replacing-lifetimes-completely-path-parameters&quot;&gt;Replacing Lifetimes Completely: Path Parameters?&lt;&#x2F;h2&gt;
&lt;p&gt;Here&#x27;s the core of my proposal: allow &lt;em&gt;path parameters&lt;&#x2F;em&gt; alongside type
parameters and lifetime parameters for any generic type. This
generalizes the places-as-lifetimes proposal by allowing introductions
of abstract place names, and would look something like:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;struct&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; S&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;path&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; P&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;  borrow&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;: &amp;amp;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;#39;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;P u32&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;... which looks almost exactly like a lifetime parameter. So what has
changed exactly?&lt;&#x2F;p&gt;
&lt;p&gt;The key bit is that when we &lt;em&gt;use&lt;&#x2F;em&gt; this type elsewhere, it is tied to a
specific &lt;em&gt;path&lt;&#x2F;em&gt; (that is, a variable binding) rather than an abstract
lifetime. This is an important difference: it binds two values, the
borrow and the borrow-ee, together more tightly than a lifetime does.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#4&quot;&gt;4&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;For example, where we use &lt;code&gt;S&lt;&#x2F;code&gt;, we might write (with full type
ascriptions here for clarity):&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;fn&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; foo&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;() {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;  let&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; i&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; u32&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; =&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant z-numeric&quot;&gt; 0&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;  let&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; s&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; S&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;i&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; =&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; S&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; borrow&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;: &amp;amp;&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;i&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; };&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;when the generic struct type is instantiated with the path parameter
&lt;code&gt;i&lt;&#x2F;code&gt;, we get a borrow of type &lt;code&gt;&amp;amp;&#x27;i u32&lt;&#x2F;code&gt;, just as we saw above. So far
so good.&lt;&#x2F;p&gt;
&lt;p&gt;We gain some nice clarity-of-intent when we use these path parameters
in context structs and the like:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;struct&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Ctx&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;path&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Data&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;  part1&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;: &amp;amp;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;#39;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;Data T&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;  part2&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;: &amp;amp;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;#39;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;Data U&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;fn&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; find_parts&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;data&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;: &amp;amp;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;[&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;u8&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;])&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; -&amp;gt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Ctx&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;data&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt; {&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; ...&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is fairly nice for self-documentation purposes, and is perhaps
more intuitive than a separate concept of lifetimes, but in idiomatic
Rust today we sometimes already see descriptive lifetime names --
&lt;code&gt;&#x27;ctx&lt;&#x2F;code&gt;, &lt;code&gt;&#x27;data&lt;&#x2F;code&gt;, &lt;code&gt;&#x27;input&lt;&#x2F;code&gt;, and the like. What actual new expressive
powers -- powers to describe invariants to the compiler and get better
safety checks -- does this grant us?&lt;&#x2F;p&gt;
&lt;p&gt;The main &quot;new power&quot; we obtain is a means of &lt;em&gt;branding&lt;&#x2F;em&gt;, as we alluded
to above: &lt;code&gt;Ctx&amp;lt;data&amp;gt;&lt;&#x2F;code&gt; is truly tied to &lt;code&gt;data&lt;&#x2F;code&gt;, not anything with a
compatible lifetime. For memory safety this may not matter as such,
but the ability to tie a handle to a &quot;parent&quot; object is something that
has been discussed
(&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;internals.rust-lang.org&#x2F;t&#x2F;static-path-dependent-types-and-deferred-borrows&#x2F;14270&#x2F;20&quot;&gt;1&lt;&#x2F;a&gt;,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;rust&#x2F;comments&#x2F;kq6sz3&#x2F;design_pattern_for_compiletime_tying_of_handles&#x2F;&quot;&gt;2&lt;&#x2F;a&gt;)
and seems generally useful as a matter of expressing API
invariants. Certainly whenever implementing an index-based (ECS-like)
system in Rust -- such as for general graphs in a compiler IR -- it
would be nice to express &lt;code&gt;Id&amp;lt;graph&amp;gt;&lt;&#x2F;code&gt; (where &lt;code&gt;graph&lt;&#x2F;code&gt; is a specific
&lt;code&gt;Graph&lt;&#x2F;code&gt; object) rather than just &lt;code&gt;Id&lt;&#x2F;code&gt;. Especially so when multiple
index spaces exist (instruction indices in two different function
bodies or the like).&lt;&#x2F;p&gt;
&lt;p&gt;So to recap: we&#x27;ve seen how taking paths rather than lifetimes in
general parameter lists can give us the same &quot;borrow to an external
thing&quot; capability inside an aggregate type, and also lets us bind
&lt;em&gt;which&lt;&#x2F;em&gt; external thing a little more tightly. It (kind of) reduces the
inventory of cognitive concepts by one, as well -- we no longer have
to think about lifetimes, just about local bindings. The genericity
over a path makes it clear and explicit what Rust paths have always
been -- the lifetime of some object held by some stack frame somewhere
else (up the call stack).&lt;&#x2F;p&gt;
&lt;p&gt;Good so far! But some readers might now wonder: how does this
&lt;em&gt;actually&lt;&#x2F;em&gt; work with respect to the borrow checker&#x27;s tracking? The way
that traditional lifetime parameters &quot;hold a borrow open&quot; on an
external source of data is a little subtle, and depends on the code
that constructs a type: for example,&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;fn&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; make_struct&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&amp;#39;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;a&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;data&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;: &amp;amp;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;#39;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;a&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; [&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;u8&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;])&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; -&amp;gt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; S&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&amp;#39;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;a&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt; {&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; ...&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;causes &lt;code&gt;data&lt;&#x2F;code&gt; to be borrowed for as long as &lt;code&gt;S&lt;&#x2F;code&gt; exists because of the
combination of two constraints: well-formedness of &lt;code&gt;S&lt;&#x2F;code&gt; means that any
lifetime mentioned in &lt;code&gt;S&lt;&#x2F;code&gt; (here &lt;code&gt;&#x27;a&lt;&#x2F;code&gt;) outlives &lt;code&gt;S&lt;&#x2F;code&gt;, and then that
lifetime &lt;code&gt;&#x27;a&lt;&#x2F;code&gt; is tied to a borrow of &lt;code&gt;data&lt;&#x2F;code&gt; that is initially created
by the caller. So, in the caller&#x27;s context, whatever local path we
needed to borrow to get &lt;code&gt;data&lt;&#x2F;code&gt; will be blocked (cannot be borrowed
mutably, dropped, written to, or moved out of) while this &lt;code&gt;S&amp;lt;&#x27;a&amp;gt;&lt;&#x2F;code&gt;
exists.&lt;&#x2F;p&gt;
&lt;p&gt;How do we get equivalent behavior with path parameters? With the
equivalent signature&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;fn&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; make_struct&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;data&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;: &amp;amp;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;[&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;u8&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;])&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; -&amp;gt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; S&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;data&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt; {&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; ...&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;in order to enforce the same &quot;&lt;code&gt;data&lt;&#x2F;code&gt; borrowed as long as &lt;code&gt;S&lt;&#x2F;code&gt; is alive&quot;
property that we had from the signature and well-formedness above, we
need some analogous well-formedness rules for path parameters.&lt;&#x2F;p&gt;
&lt;p&gt;The simple way out would be to say that a path parameter always
borrows the path it mentions. Then we can use it for lifetimes, and we
have the same semantics as above. But wait: what kind of borrow,
immutable or mutable? In the lifetime-based signature above, we knew
this because the caller created the borrow then explicitly saw that
the borrow&#x27;s lifetime was captured by &lt;code&gt;S&lt;&#x2F;code&gt;; here &lt;code&gt;S&lt;&#x2F;code&gt; captures the
&lt;em&gt;path&lt;&#x2F;em&gt; but isn&#x27;t necessarily tied to the borrow on &lt;code&gt;data&lt;&#x2F;code&gt;. So we need
something more explicit, somewhere, to denote the long-term borrow.&lt;&#x2F;p&gt;
&lt;p&gt;Here&#x27;s an even more interesting question: could it ever be meaningful
and useful to mention, and bind to, a path, &lt;em&gt;without&lt;&#x2F;em&gt; borrowing it?&lt;&#x2F;p&gt;
&lt;h2 id=&quot;path-parameter-modes&quot;&gt;Path Parameter Modes&lt;&#x2F;h2&gt;
&lt;p&gt;To resolve these questions, we add &lt;em&gt;borrow modes&lt;&#x2F;em&gt; to path parameters:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;struct&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; S&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;path&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; &amp;amp;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;Data&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;  part1&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;: &amp;amp;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;#39;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;Data&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; [&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;u8&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;],&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;  part2&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;: &amp;amp;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;#39;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;Data&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; [&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;u8&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;],&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;and then add a rule about well-formed types: the borrow-mode of a path
in the parameter list must subsume its mode in any use. For example,
if used to denote the lifetime of an immutable borrow as above, we
must have &lt;code&gt;path &amp;amp;Data&lt;&#x2F;code&gt;; it would be a compile-time error to have only
&lt;code&gt;path Data&lt;&#x2F;code&gt;.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#5&quot;&gt;5&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The meaning of this path parameter mode is exactly as if the &lt;code&gt;S&lt;&#x2F;code&gt; held
the corresponding borrow for its lifetime. Thus we recover the full
semantics of the struct with the traditional lifetime parameter above,
and likewise we can encode the semantics of a mutable borrow.&lt;&#x2F;p&gt;
&lt;p&gt;This also neatly handles the question of whether paths can overlap or
must be disjoint: e.g., for &lt;code&gt;struct S&amp;lt;path P, path Q&amp;gt;&lt;&#x2F;code&gt;, are &lt;code&gt;P&lt;&#x2F;code&gt; and
&lt;code&gt;Q&lt;&#x2F;code&gt; separate bindings at the instantiation site, so we can have two
fields &lt;code&gt;&amp;amp;&#x27;P mut u32&lt;&#x2F;code&gt; and &lt;code&gt;&amp;amp;&#x27;Q mut u32&lt;&#x2F;code&gt;? If we require path parameter
modes to subsume their uses, then we would need &lt;code&gt;struct S&amp;lt;path &amp;amp;mut P, path &amp;amp;mut Q&amp;gt;&lt;&#x2F;code&gt;, and then at the instantiation site the disjointness
would be enforced. Immutably-borrowed paths need not be disjoint (and
consequently we cannot assume that when typechecking within their
scope).&lt;&#x2F;p&gt;
&lt;p&gt;Now a new idea arises: if we can have immutably-borrowed and
mutably-borrowed modes on path parameters, and these are explicit,
what would it mean to have &lt;em&gt;no&lt;&#x2F;em&gt; borrow on a path?&lt;&#x2F;p&gt;
&lt;p&gt;For example, one might declare&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;struct&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Handle&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;path&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Parent&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt; {&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; ...&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;and then create&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;struct&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; T&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; ...&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;impl&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; T&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;  fn&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; make_handle&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator z-variable z-language&quot;&gt;&amp;amp;self&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;)&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; -&amp;gt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Handle&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-language&quot;&gt;self&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt; {&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; ...&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;fn&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; foo&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;() {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;   let&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; a&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; T&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; = ...&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;   let&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; b&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; T&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; = ...&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;   let&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; handle_a&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Handle&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;a&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; =&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; a&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;.&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;make_handle&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;();&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;   let&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; handle_b&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Handle&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;b&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; =&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; b&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;.&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;make_handle&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;();&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;and &lt;code&gt;handle_a&lt;&#x2F;code&gt; and &lt;code&gt;handle_b&lt;&#x2F;code&gt; are tied irrevocably to &lt;code&gt;a&lt;&#x2F;code&gt; and &lt;code&gt;b&lt;&#x2F;code&gt;. If
&lt;code&gt;a&lt;&#x2F;code&gt; goes out of scope, then &lt;code&gt;handle_a&lt;&#x2F;code&gt;&#x27;s lifetime must have ended by
that point: that works the same as any lifetime constraint. However,
extremely importantly, we &lt;em&gt;do not borrow &lt;code&gt;&#x27;a&lt;&#x2F;code&gt;&lt;&#x2F;em&gt; while this handle
exists. &lt;code&gt;a&lt;&#x2F;code&gt; could be passed as a &lt;code&gt;&amp;amp;mut self&lt;&#x2F;code&gt; to various method calls,
we could do arbitrary things to it, and the handle type allows that;
it is &lt;em&gt;only&lt;&#x2F;em&gt; tied to the continued &lt;em&gt;existence&lt;&#x2F;em&gt; of &lt;code&gt;a&lt;&#x2F;code&gt;, the binding.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#6&quot;&gt;6&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;What good is this, then? For one, we can define methods on the handle
that &lt;em&gt;must&lt;&#x2F;em&gt; take the original container or parent object. So we don&#x27;t
hold a borrow open for the duration of the handle, but we do take the
borrow just for a particular access or mutation. This is analogous to
the ubiquitous &quot;index into &lt;code&gt;Vec&lt;&#x2F;code&gt; as reference&quot; pattern in Rust (in
fact, keep reading!) but at the type level:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;impl&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;path&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Parent&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Handle&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;Parent&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;  fn&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; do_mutation&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator z-variable z-language&quot;&gt;&amp;amp;self&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; parent&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; =&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Parent&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator z-storage&quot;&gt;: &amp;amp;mut&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; T&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;    &#x2F;&#x2F; ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;  }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;(Some might call this a &quot;singleton type&quot;. Please feel free to bikeshed
that syntax!). The idea is that we take a mutable borrow to the path
that we irrevocably tied ourselves to -- but we only take that during
the method call.&lt;&#x2F;p&gt;
&lt;p&gt;But if the typechecker will &lt;em&gt;only&lt;&#x2F;em&gt; let us pass that path to the
method, why require it to be written at all? Hence the next idea,
&lt;em&gt;implicit path-constrained parameters&lt;&#x2F;em&gt;:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;impl&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;path&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Parent&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Handle&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;Parent&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;  fn&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; do_mutation&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;[&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;parent&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; =&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Parent&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator z-storage&quot;&gt;: &amp;amp;mut&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; T&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;](&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator z-variable z-language&quot;&gt;&amp;amp;self&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;    &#x2F;&#x2F; ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;  }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;(Please &lt;em&gt;really&lt;&#x2F;em&gt; bikeshed that syntax: is a separate argument list
really the best?) This means that we call the method with a signature
that takes only &lt;code&gt;self&lt;&#x2F;code&gt; -- in some sense, its type really is (a subtype
of) &lt;code&gt;Fn(Self)&lt;&#x2F;code&gt; -- but it has captured the &lt;em&gt;path&lt;&#x2F;em&gt; and will borrow it
when called.&lt;&#x2F;p&gt;
&lt;p&gt;Then we can do something like&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;fn&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; use_it&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;() {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;  let&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; a&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; T&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; = ...&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;  let&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; handle_1&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; =&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; a&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;.&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;lookup_elt&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant z-numeric&quot;&gt;1&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;);&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;  let&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; handle_2&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; =&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; a&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;.&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;lookup_elt&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant z-numeric&quot;&gt;2&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;);&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;  handle_1&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;.&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;do_mutation&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;();&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; implicitly borrows `&amp;amp;mut a`, just for the call&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;  handle_2&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;.&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;do_mutation&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;();&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; implicitly borrows `&amp;amp;mut a`, just for the call&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Pretty convenient, no? (If you&#x27;re worried about the dangling-iterator
problem -- what if a mutation invalidates a handle? -- read on.)&lt;&#x2F;p&gt;
&lt;p&gt;There are a few bikeshedding points noted above, and possible
extensions such as type constraints on paths (say that I want &lt;code&gt;path Parent&lt;&#x2F;code&gt; to be a &lt;code&gt;T&lt;&#x2F;code&gt; specifically) but overall I like this design and
believe it has some potential beyond just ergonomics. Before I get to
that though -- under &quot;deferred borrows&quot; below -- one more detour, to
talk about mutable borrows and invalidation.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;precise-invalidation-and-encoding-data-structure-properties&quot;&gt;Precise Invalidation and Encoding Data Structure Properties&lt;&#x2F;h2&gt;
&lt;p&gt;Above, we created &quot;handles&quot; that are tied to a parent object; they
hold no borrow or other form of reservation until use. We irrevocably
tied the handles to the &lt;em&gt;binding&lt;&#x2F;em&gt; &lt;code&gt;a&lt;&#x2F;code&gt;, but what stops an intervening
call&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;let&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; old&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; =&lt;&#x2F;span&gt;&lt;span&gt; std&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;::&lt;&#x2F;span&gt;&lt;span&gt;mem&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;::&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;replace&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator z-storage&quot;&gt;&amp;amp;mut&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; a&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; new_container&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;());&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;from replacing whatever internal state the handles referenced?&lt;&#x2F;p&gt;
&lt;p&gt;Conventional, idiomatic Rust with borrow-based APIs avoids this
scenario by holding the borrow on the container&#x2F;parent object for the
entire lifetime of the handle. This essentially freezes the container
in place so that whatever piece we&#x27;re logically referencing with the
handle continues to exist. But by doing so, we give up all the
convenience (and expressive power!) of handles-without-borrows: it is
often useful to collect &quot;fingers&quot; into a data structure, then perform
mutations through them later. For example, we may have several
&lt;code&gt;handle_a&lt;&#x2F;code&gt;s above (an array of them, or a hashmap, perhaps), and wish
to write to a referenced element through some handle, and may not know
which until we&#x27;ve collected several handles.&lt;&#x2F;p&gt;
&lt;p&gt;But why is that desired usage pattern safe? The only reason is that we
know that our later mutations will mutate &lt;em&gt;elements&lt;&#x2F;em&gt;, but not the
shape of the container itself. In essence, idiomatic Rust with
handles-that-hold-borrows blurs this distinction because that is the
best the type system can do. This is where the indices-as-references
pattern can save us again: if we use indices into a &lt;code&gt;Vec&lt;&#x2F;code&gt;, we don&#x27;t
actually borrow until we directly access an element. This works too,
but leads to other sorts of awkwardness. For example, in
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;&quot;&gt;regalloc2&lt;&#x2F;a&gt;, which
makes pervasive use of the index-as-reference pattern, I often have to
&quot;re-derive&quot; a true borrow from an index because I perform some other
mutation (to some other disjoint state that I know shouldn&#x27;t matter)
in the meantime.&lt;&#x2F;p&gt;
&lt;p&gt;In principle we should be able to encode in the type system that some
handle relies on the &quot;shape&quot; of the container or parent object
remaining the same, but nothing else. For this, we propose an idiom
that we call &quot;virtual fields&quot; -- encoded as zero-sized unit fields,
perhaps with better syntax later -- that we can hold borrows on,
combined with the use of &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;smallcultfollowing.com&#x2F;babysteps&#x2F;&#x2F;blog&#x2F;2021&#x2F;11&#x2F;05&#x2F;view-types&#x2F;&quot;&gt;view
types&lt;&#x2F;a&gt;
(also recapped in Niko&#x27;s &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;smallcultfollowing.com&#x2F;babysteps&#x2F;blog&#x2F;2024&#x2F;06&#x2F;02&#x2F;the-borrow-checker-within&#x2F;&quot;&gt;latest
post&lt;&#x2F;a&gt;)
in path parameter specs and&#x2F;or method bodies.&lt;&#x2F;p&gt;
&lt;p&gt;(To be absolutely clear, the proposal in this subsection is an
&lt;em&gt;idiom&lt;&#x2F;em&gt;, and depends only on the &quot;view types&quot; extension proposed in
those posts.)&lt;&#x2F;p&gt;
&lt;p&gt;For example, to sketch what this might look like:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;struct&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Container&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;  pub&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; shape&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; (),&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; ^^ &amp;quot;virtual field&amp;quot;: a unit-typed field on which we use borrows via&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; view types to encode which methods modify which properties.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;  contents&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;: ...&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;struct&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Handle&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;path&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;shape&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Container&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; *really* needs some bikeshedding ^^&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment&quot;&gt;  &#x2F;&#x2F;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; meaning is: irrevocably bind to a given container; hold an immutable&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; borrow on its `shape` field always; but don&amp;#39;t otherwise hold a borrow&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; on it&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;  elt&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator z-storage&quot;&gt;: *const&lt;&#x2F;span&gt;&lt;span&gt; Element&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; internally unsafe pointer can be held because shape is constant&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;impl&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Container&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;  fn&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; handle_mut&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator z-variable z-language&quot;&gt;&amp;amp;self&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; idx&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; usize&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;)&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; -&amp;gt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Handle&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-language&quot;&gt;self&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt; {&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; ...&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; ^^ on return, `self.shape` is borrowed immutably, but nothing else is&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;impl&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;path&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Container&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;shape&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&amp;gt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Handle&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;Container&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;  fn&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; deref&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;[&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;c&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; =&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Container&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;&amp;amp;^&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;shape&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Container&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;](&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator z-variable z-language&quot;&gt;&amp;amp;self&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;)&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; -&amp;gt; &amp;amp;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;Element&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; ...&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; ^^ when *this* returns, we actually borrow Container, but that borrow only&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; lasts as long as we hold the true borrow; the handle is &amp;quot;just as good&amp;quot;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; (except that it doesn&amp;#39;t freeze the actual element contents)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment&quot;&gt;  &#x2F;&#x2F;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; The `&amp;amp;^shape` syntax means &amp;quot;immutably borrow everything in `Container`&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; except the `shape` field&amp;quot;; this signature is promising that we don&amp;#39;t&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; mutate the container&amp;#39;s shape.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;  fn&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; deref_mut&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;[&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;c&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; =&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Container&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;&amp;amp;^&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;shape&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Container&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;](&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator z-variable z-language&quot;&gt;&amp;amp;self&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;)&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator z-storage&quot;&gt; -&amp;gt; &amp;amp;mut&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Element&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; ...&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Note that this is quickly encroaching on territory well-trodden by
&lt;code&gt;Pin&lt;&#x2F;code&gt; and pin-projection and self-referential types; this post is
certainly not the first to ever meditate on the difference between
&quot;immutable shape&quot; and deep immutability in Rust. Interesting
hypothesis: perhaps someday &lt;code&gt;Pin&lt;&#x2F;code&gt; could be encoded with careful use of
view-types and virtual fields denoting properties. Even more
interesting hypothesis: containers with invariants not currently
described in the type system today -- such as the fact that moving a
&lt;code&gt;String&lt;&#x2F;code&gt; does not move the pointed-to content, likewise for a &lt;code&gt;Vec&lt;&#x2F;code&gt; --
could loosen their signatures and encode this as well (with careful
thought toward backward-compatibility). This might be a way to encode
the &quot;self-referential struct&quot; example&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;struct&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; S&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;  data&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; String&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;  parts&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Vec&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;&amp;amp;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;#39;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;self&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;.&lt;&#x2F;span&gt;&lt;span&gt;data &lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;str&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;where the &lt;code&gt;parts&lt;&#x2F;code&gt; hold &lt;code&gt;data&lt;&#x2F;code&gt;&#x27;s contents borrowed, but not &lt;code&gt;data&lt;&#x2F;code&gt;
itself, so the &lt;code&gt;S&lt;&#x2F;code&gt; is still movable.&lt;&#x2F;p&gt;
&lt;p&gt;I&#x27;ll readily admit I&#x27;ve gotten increasingly handwavy as this section
continues; but I believe there is &lt;em&gt;something&lt;&#x2F;em&gt; here and with enough
careful specification it could form a cohesive, backward-compatible
language overlay that adds expressive power to Rust APIs.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;new-pattern-deferred-borrows&quot;&gt;New Pattern: Deferred Borrows&lt;&#x2F;h2&gt;
&lt;p&gt;Finally we&#x27;ve reached (one) end application of all this language
machinery: implementing container types with &lt;em&gt;deferred borrows&lt;&#x2F;em&gt;, as I
had proposed in my &lt;a href=&quot;&#x2F;pubs&#x2F;ecoop2020_defborrow.pdf&quot;&gt;2020 paper&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The key idea is to define a kind of auto-deref trait, like
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;doc.rust-lang.org&#x2F;std&#x2F;ops&#x2F;trait.Deref.html&quot;&gt;&lt;code&gt;Deref&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;, but
with a captured path and implicit parameter such that it only
(necessarily) borrows that implicit &quot;context&quot; when converting to a
true borrow. Then we could implement this trait for handles and have
them work as any other smart-pointer type today (such as &lt;code&gt;Rc&lt;&#x2F;code&gt; or
&lt;code&gt;Box&lt;&#x2F;code&gt;) except that they would not unnecessarily hold open a borrow on
the parent container; only its &quot;shape&quot; or whatever invariants are
needed to keep elements alive.&lt;&#x2F;p&gt;
&lt;p&gt;In the paper I proposed several variants of each container and a kind
of API-level typestate to encode what handles do hold borrows on,
hence how efficient they can be. For example, a &lt;code&gt;Vec&lt;&#x2F;code&gt; might become a
&lt;code&gt;FrozenVec&lt;&#x2F;code&gt;, where the length is fixed; then handles can be true
pointers, and the deferred borrows are as efficient as real
borrows. Or it could be an &lt;code&gt;AppendOnlyVec&lt;&#x2F;code&gt;, with indices, incurring an
indirection through the storage base pointer (which may change as
growth causes reallocations) on each conversion to a true borrow but
&lt;em&gt;not&lt;&#x2F;em&gt; a dynamic bounds-check. With the above &quot;virtual fields&quot; idea and
view types, I think we could get away without the typestate-like API,
instead handing out different kind of handles (&lt;code&gt;PtrHandle&lt;&#x2F;code&gt; vs
&lt;code&gt;IndexHandle&lt;&#x2F;code&gt; vs ...) directly. Perhaps other variants could exist as
well; read the paper for more. (My intent here is not to summarize the
entire paper, but rather to show that its proposals become possible
once one has handles tied to paths and able to implicitly borrow
them.)&lt;&#x2F;p&gt;
&lt;h2 id=&quot;related-approaches&quot;&gt;Related Approaches&lt;&#x2F;h2&gt;
&lt;p&gt;One might reasonably ask: is there a way to &quot;trick&quot; the lifetime
system into giving us branded types today, where one value is
irrevocably tied to another?&lt;&#x2F;p&gt;
&lt;p&gt;Inherently what one needs is &lt;em&gt;generativity&lt;&#x2F;em&gt;, a kind of type-system
feature where each execution or instance or an operation yields a
different type.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#7&quot;&gt;7&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;I am aware of at least one instance of a &quot;generativity trick&quot; with
Rust lifetimes, described by &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;faultlore.com&#x2F;blah&#x2F;&quot;&gt;Gankra&lt;&#x2F;a&gt;&#x27;s
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;raw.githubusercontent.com&#x2F;Gankra&#x2F;thesis&#x2F;master&#x2F;thesis.pdf&quot;&gt;thesis&lt;&#x2F;a&gt;
in section 6.3, where closures and forced-invariance of lifetimes are
used to create separate lifetimes for separate arrays (in the example)
such that the handle type (indices in the example) can be uniquely
associated with only one array. This is undoubtedly an extremely
impressive hack; nevertheless, better would be for the language to
provide a first-class, readable way for the user to write their intent
(&lt;code&gt;Handle&amp;lt;myvec&amp;gt;&lt;&#x2F;code&gt;) without workaround.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;This blog post has outlined an idea for &quot;path generics&quot;, where path
parameters live alongside (and perhaps someday mostly replace?)
lifetime parameters as a more intuitive means of describing the
lifetime origin of some borrowed data. Along with that, we&#x27;ve seen how
they add expressive power in &quot;branding&quot; one type to be tied
irrevocably to some parent value (path), allowing for typesafe &quot;handle
patterns&quot;, and how a bit of configurability in their interaction with
borrow checking can lead to interesting and useful new APIs that were
strictly inexpressible before.&lt;&#x2F;p&gt;
&lt;p&gt;What do I hope to come of all this? As I noted in a footnote&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#1&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;
above, in hindsight I&#x27;ve tried to push ideas in the &quot;deferred borrows
family&quot; in sort of halfhearted ways before; mainly, I have a day-job
that is &quot;in Rust&quot;&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#8&quot;&gt;8&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; but is not &quot;the Rust language&quot; and I don&#x27;t feel
I have the energy to push forward a full language proposal on the side
or really go much beyond this post. (See above also re: me not being a
properly trained PL person.) But I &lt;em&gt;do&lt;&#x2F;em&gt; think there may be something
here, to the extent I want to braindump a bit and see what folks
think, especially now that there is some explicit thinking and
discussion about &quot;places as lifetimes&quot;, view types, and other
borrow-system extensions.&lt;&#x2F;p&gt;
&lt;p&gt;So: maybe someone else will also see some value here (and see past the
undoubtedly rough bikesheddable surface design details) and explore
further. Maybe not. Either way the ideas are written down and out
there. Feedback welcome!&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;1&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;1&lt;&#x2F;sup&gt;
&lt;p&gt;In hindsight, this paper was very much not the right way to get
the idea out. I published this paper just after entering
industry from academia, still not quite sure what I wanted to
do; I didn&#x27;t feel I had the energy to engage with Rust and
actually drive an idea forward, but I did want to write it up
&lt;em&gt;somewhere&lt;&#x2F;em&gt;, so I attempted that rare beast, a single-author
paper written in one&#x27;s free time (with all the corresponding
limits on completeness). This blog post is, in one way of seeing
it, another attempt at &quot;putting it out there&quot;: I almost
certainly don&#x27;t have the bandwidth to spec or implement this but
I&#x27;m curious what folks think and it feels more relevant now that
it is an &quot;incremental generalization&quot; on top of another idea
just proposed.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;2&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;2&lt;&#x2F;sup&gt;
&lt;p&gt;For pedagogical reasons let&#x27;s ignore global variables and
&lt;code&gt;&#x27;static&lt;&#x2F;code&gt; as well for now. The former is unsafe for a reason,
and the latter is the &quot;boring pathological case&quot; where data is
valid because it literally lives forever. In principle any
interesting lifetime in Rust is tied to a span of code executing
in some stack frame at some level.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;3&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;3&lt;&#x2F;sup&gt;
&lt;p&gt;In day-to-day use of Rust, explicit lifetimes are fairly rare
outside of writing custom containers and perhaps &quot;context
structs&quot; that bundle a set of borrows; but they &lt;em&gt;do&lt;&#x2F;em&gt; occur, and
sometimes efficiency (non-copying) concerns and layering force
nested lifetimes as well. The worst I&#x27;ve personally had to write
was a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;af59c4d568d487b7efbb49d7d31a861e7c3933a6&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;opts.rs#L76-L79&quot;&gt;three-level
nesting&lt;&#x2F;a&gt;
necessary to encode a long-lived data structure, an analysis
context over it, and an iteration context within that analysis
context. It would have been great to have some better
abstraction here! Even I was a bit frustrated, and I am solidly
on Team &quot;Borrow Checker Saves Me Time and Improves My Code In
The Long Run&quot;!&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;4&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;4&lt;&#x2F;sup&gt;
&lt;p&gt;Path variables used as lifetimes are semantically &lt;em&gt;different&lt;&#x2F;em&gt;,
and in the way described more powerful, but they also do not
fully subsume separate lifetime variables, as far as I can tell:
one cannot have a path that is the &quot;meet&quot; of two different
paths, e.g. a borrow in a struct that comes from either &lt;code&gt;P&lt;&#x2F;code&gt; &lt;em&gt;or&lt;&#x2F;em&gt;
&lt;code&gt;Q&lt;&#x2F;code&gt; (both of which live long enough). For this reason one may
still want to use conventional lifetimes, or perhaps path
parameters could be extended with a notion of &quot;union paths&quot;
(TBD!).&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;5&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;5&lt;&#x2F;sup&gt;
&lt;p&gt;There are likely some more details in exactly how path
parameters interact with both well-formedness and subtyping that
I haven&#x27;t worked out, mostly because I am not a type theorist (I
only play one in blog posts, and then only for the fun
parts). For example, I suspect one would want a kind of &quot;plus
constraint&quot; analogous to &lt;code&gt;T + &#x27;a&lt;&#x2F;code&gt; for paths too, to say
something like &quot;&lt;code&gt;T&lt;&#x2F;code&gt; and it&#x27;s allowed to borrow &lt;code&gt;P&lt;&#x2F;code&gt;&quot;.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;6&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;6&lt;&#x2F;sup&gt;
&lt;p&gt;One might worry that this is insufficient for a really robust
handle type: what if I take a mutable borrow of &lt;code&gt;a&lt;&#x2F;code&gt; and do a
&lt;code&gt;std::mem::replace&lt;&#x2F;code&gt; on it? This proposal&#x27;s answer is that it is
the responsibility of the API author to define signatures such
that this is not possible; but with &quot;virtual fields&quot; (keep
reading) there is a way to enfoce this.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;7&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;7&lt;&#x2F;sup&gt;
&lt;p&gt;Note this is a bit subtle: we could mean each operation at a
different static program point -- this is the kind of
generativity that, say, &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;ocaml.org&#x2F;manual&#x2F;5.2&#x2F;generativefunctors.html&quot;&gt;OCaml generative functor
application&lt;&#x2F;a&gt;
provides -- or we could mean that each &lt;em&gt;dynamic&lt;&#x2F;em&gt; instance of,
say, an object allocation produces a conceptually new type. We
want something like the latter for branding: it shouldn&#x27;t be
possible to allocate a list of objects in a loop, put them all
in a vector of same-typed values and mix up their handles.)&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;8&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;8&lt;&#x2F;sup&gt;
&lt;p&gt;Cranelift, Wasmtime, and other compilers and runtimes stuff, so
very intensively &quot;in Rust&quot;, but definitely the language itself
is out of scope!&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
</content>
    </entry>
    <entry xml:lang="en">
        <title>Fast(er) JavaScript on WebAssembly: Portable Baseline Interpreter and Future Plans</title>
        <published>2023-10-11T00:00:00+00:00</published>
        <updated>2023-10-11T00:00:00+00:00</updated>
        <author>
            <name></name>
        </author>
        <link rel="alternate" type="text/html" href="https://cfallin.org/blog/2023/10/11/spidermonkey-pbl/"/>
        <id>https://cfallin.org/blog/2023/10/11/spidermonkey-pbl/</id>
        <content type="html" xml:base="https://cfallin.org/blog/2023/10/11/spidermonkey-pbl/">&lt;p&gt;For the past year, I have been hard at work trying to improve the
performance of the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;spidermonkey.dev&#x2F;&quot;&gt;SpiderMonkey&lt;&#x2F;a&gt;
JavaScript engine when compiled as a
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;webassembly.org&#x2F;&quot;&gt;WebAssembly&lt;&#x2F;a&gt; module. For server-side
applications that use WebAssembly (and &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;wasi.dev&#x2F;&quot;&gt;WASI&lt;&#x2F;a&gt;, its
&quot;system&quot; layer) as a software distribution and sandboxing technology
with &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bytecodealliance.org&#x2F;articles&#x2F;webassembly-the-updated-roadmap-for-developers&quot;&gt;significant exciting
potential&lt;&#x2F;a&gt;,
this is an important enabling technology: it allows existing software
written in JavaScript to be run within the sandboxed environment and
to interact with other Wasm modules.&lt;&#x2F;p&gt;
&lt;p&gt;Running an entire JavaScript engine &lt;em&gt;inside&lt;&#x2F;em&gt; of a Wasm module may seem
like a strange approach at first, but it serves real use-cases. There
are platforms that accept WebAssembly-sandboxed code for security
reasons, as it ensures complete memory isolation between requests
while remaining very fine-grained (hence with lower overheads). In
such an environment, JavaScript code needs to bring its own engine,
because no platform-native JS engine is provided. This approach ensures
a sandbox &lt;em&gt;without trusting the JavaScript engine&#x27;s security&lt;&#x2F;em&gt; --
because the JS engine is just another application on the hardened Wasm
platform -- and carries other benefits too: for example, the JS code
can interact with other languages that compile to Wasm easily, and we
can leverage Wasm&#x27;s determinism and modularity to snapshot execution
and then perform extremely fast cold startup. We have been using this
strategy to great success for a while now: we did the initial port of
SpiderMonkey to WASI in 2020, and we wrote &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bytecodealliance.org&#x2F;articles&#x2F;making-javascript-run-fast-on-webassembly&quot;&gt;two years
ago&lt;&#x2F;a&gt;
about how we can leverage Wasm&#x27;s clean modularity and determinism to
use snapshots for fast startup.&lt;&#x2F;p&gt;
&lt;p&gt;This post is an update to that effort. At the end of that prior post,
we hinted at efforts to bring more of the JavaScript performance
engineering magic that browsers have done to the JS-in-Wasm
environment. Today we&#x27;ll see how we&#x27;ve successfully adapted &lt;em&gt;inline
caches&lt;&#x2F;em&gt;, achieving significant speedups (~2x in some cases) without
compromising the security of the interpreter-based strategy. At the
end of this post, I&#x27;ll hint at how we plan to use this as a foundation
for ahead-of-time compilation as well. Exciting times ahead!&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;Note: the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;docs.google.com&#x2F;document&#x2F;d&#x2F;1XZZnc5xxfOVnemxRrTbZqvCg0PH4B9GyGgXnBhULROA&quot;&gt;design
document&lt;&#x2F;a&gt;
is also available, and SpiderMonkey patches can be found on the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bugzilla.mozilla.org&#x2F;show_bug.cgi?id=1855321&quot;&gt;upstreaming
bug&lt;&#x2F;a&gt;.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;current-state-interpreters-only-beyond-this-point&quot;&gt;Current State: Interpreters Only Beyond This Point&lt;&#x2F;h2&gt;
&lt;p&gt;A distinguishing feature of some platforms is that they &lt;em&gt;do not allow
runtime code generation&lt;&#x2F;em&gt;. For example, a WebAssembly module may
contain functions; these functions are compiled at some point before
the functions are run; but the functions cannot do anything to create
&lt;em&gt;new&lt;&#x2F;em&gt; functions in the same code space, at least without going through
some lower-level and nonstandard system interface.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#1&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;On the other hand, JavaScript engines over the past several decades
have &lt;em&gt;embraced&lt;&#x2F;em&gt; runtime code generation. The basic reason for this is
that there are a lot of facts about a JS program that one cannot know
(or not easily, without human intelligence and reasoning and a view of
the whole program) until the program executes. For example: in the
simple one-line function &lt;code&gt;function(x) { return x + x; }&lt;&#x2F;code&gt;, what is
&lt;code&gt;x&lt;&#x2F;code&gt;&#x27;s type? It could be an integer, a floating-point number, a string,
or an object that can be converted to a string, or probably many other
things.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#2&quot;&gt;2&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; If one were to try to generate machine code for that
function ahead-of-time, it would look very different than that of,
say, a C function with the same &lt;code&gt;return x + x;&lt;&#x2F;code&gt; body but an
integer-typed &lt;code&gt;x&lt;&#x2F;code&gt;. It would have to contain type-checks, branches, and
implementations for all the different possible cases. With that
dynamic dispatch overhead, and the bloated and difficult-to-optimize
function body that supports all combinations of types, we are unlikely
to see much speedup &lt;em&gt;unless&lt;&#x2F;em&gt; we can know something about the types. A
similar problem arises in other &quot;dynamic&quot; aspects of the language:
when we say &lt;code&gt;obj.x&lt;&#x2F;code&gt;, where in the memory for the object &lt;code&gt;obj&lt;&#x2F;code&gt; is the
field &lt;code&gt;x&lt;&#x2F;code&gt;? When we call a function &lt;code&gt;f(1, 2, 3)&lt;&#x2F;code&gt;, is &lt;code&gt;f&lt;&#x2F;code&gt; another
JavaScript function, a native function in the runtime, or something
even more special that we handle in a different way?&lt;&#x2F;p&gt;
&lt;p&gt;The modern JavaScript engine&#x27;s answer to performance, then, is to
collect a lot of information as a program executes and then
dynamically generate machine-code that &lt;em&gt;is&lt;&#x2F;em&gt; specialized to what the
program is actually doing, but can fall back to the generic
implementation if conditions change. Because we can&#x27;t generate this
code until the program is already running, we need the platform to
support the ability to add more executable code as we are running:
this is &quot;runtime codegen&quot;, or as it is often known, &quot;JIT
(just-in-time) compilation&quot;.&lt;&#x2F;p&gt;
&lt;p&gt;So we have a conundrum: we run JavaScript inside a Wasm module, we
want performance, but the usual way to get that performance is to
JIT-compile specialized machine-code versions of the JS code after
observing it, and we can&#x27;t do that from within a Wasm module. What are
we to do?&lt;&#x2F;p&gt;
&lt;h2 id=&quot;systematic-fast-paths-inline-caches&quot;&gt;Systematic Fast-Paths: Inline Caches&lt;&#x2F;h2&gt;
&lt;p&gt;It&#x27;s worth understanding &lt;em&gt;how&lt;&#x2F;em&gt; JITs specialize the execution of the
program based on runtime observations. A simple approach might be to
build in &quot;ad-hoc&quot; kinds of observations to the interpreter, and use
those as needed in a type-specific &quot;specializing&quot; compiler. For
example, we could record types seen at a &lt;code&gt;+&lt;&#x2F;code&gt; operator at a given point
in the program, and then generate only the cases we&#x27;ve observed when
we compile that operator (perhaps with &quot;fallback code&quot; to call into a
generic implementation if our assumption becomes wrong). However, this
ad-hoc approach does not scale well: every semantic piece (operators,
function calls, object accesses) of the language implementation would
have to become a profiler, an analysis pass, and a profile-guided
compiler.&lt;&#x2F;p&gt;
&lt;p&gt;Instead, JITs specialize with a general &lt;em&gt;dispatch mechanism&lt;&#x2F;em&gt; known as
&quot;inline caches&quot; (ICs), and ICs build straight-line sequences of
&quot;fast-paths&quot; in a uniform &lt;em&gt;program representation&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The usual approach is to define certain points in the original program
at which we have an operator with widely-varying behavior, and place
an inline cache site in the compiled code. The idea of an inline cache
site is that it performs an indirect dispatch to some other &quot;stub&quot;:
these are &quot;pluggable implementations&quot; that replace a generic operator,
like &lt;code&gt;+&lt;&#x2F;code&gt;, with some specific case, like &quot;if both inputs are
&lt;code&gt;int32&lt;&#x2F;code&gt;-typed, do an integer add&quot;. For example, we might compile the
following function body into the polymorphic (type-generic) code on
the left, then generate the specialized fast-paths and attach them as
stubs on the right:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2023-10-03-ics-web.svg&quot; alt=&quot;Figure: Inline-cache stubs in a JavaScript function&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The IC site starts with a link to a generic implementation -- just
like the naive interpreter -- that works for any case. However, after
it executes, it also &lt;em&gt;generates a fast-path for that case&lt;&#x2F;em&gt; and
&quot;attaches&quot; the new stub to the IC site. The stubs form a &quot;chain&quot;, or
singly-linked list, with the generic case at the end. Some examples of
fast-paths that we see in practice are:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;For an access to a property on an object, we can generate a
fast-path that checks whether the object is a known &quot;shape&quot; --
defined as the set of existing properties and their layout in memory
-- and directly accessing the appropriate memory offset on a
match. This avoids an expensive lookup by property name.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;For any of the polymorphic built-in operators, like &lt;code&gt;+&lt;&#x2F;code&gt;, we can
generate a fast-path that checks types and does the appropriate
primitive action (integer addition, floating-point addition, or
string concatenation, say).&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;For a call to a built-in or &quot;well-known&quot; function, we can generate a
fast-path that avoids a function call altogether. For example, if
the user calls &lt;code&gt;String.length&lt;&#x2F;code&gt;, and this has not been overridden
globally (we need to check!) and the input is a string, then the IC
can load the string length directly from the known length-field
location in the object. This replaces a call into the JS runtime&#x27;s
native string-implementation code with just a few IC instructions.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Each stub has a simple format: it checks some conditions, then either
does its action (if it is a matching fast-path) or jumps to the next
stub in the chain.&lt;&#x2F;p&gt;
&lt;p&gt;This collection of stubs, once &quot;warmed up&quot; by program execution, is
useful in at least two ways. First, it represents a knowledge-base of
the program&#x27;s actual behavior. The execution has been &quot;tuned&quot; to have
fast-paths inserted for cases that are actually observed, and will
become faster as a result. That is quite powerful indeed!&lt;&#x2F;p&gt;
&lt;p&gt;Second, an even more interesting opportunity arises (first introduced
in the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;hacks.mozilla.org&#x2F;2020&#x2F;11&#x2F;warp-improved-js-performance-in-firefox-83&#x2F;&quot;&gt;WarpMonkey&lt;&#x2F;a&gt;
effort from the SpiderMonkey team, to their knowledge a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;2023.splashcon.org&#x2F;details&#x2F;mplr-2023-papers&#x2F;4&#x2F;CacheIR-The-Benefits-of-a-Structured-Representation-for-Inline-Caches&quot;&gt;novel
contribution&lt;&#x2F;a&gt;):
once we have the IC chains, we can use the combination of two parts --
the original program bytecode, and the pluggable stub fast-paths -- to
compile fully specialized code by &lt;em&gt;translating both to one IR and
inlining&lt;&#x2F;em&gt;. This is how we achieve specialized-variant compilation in a
systematic way: we just write out the necessary fast-paths as we need
them, and then we later incorporate them.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#3&quot;&gt;3&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; The following figure
illustrates this process:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2023-10-03-ic-inlining-web.svg&quot; alt=&quot;Figure: optimized JS compilation by inlining ICs&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;In the SpiderMonkey engine, there are three JIT tiers that make use of ICs:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The &quot;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;hacks.mozilla.org&#x2F;2019&#x2F;08&#x2F;the-baseline-interpreter-a-faster-js-interpreter-in-firefox-70&#x2F;&quot;&gt;baseline
interpreter&lt;&#x2F;a&gt;&quot;
interprets the JS function body&#x27;s opcodes, but accelerates
individual operations with ICs. The interpreter-based approach means
we have fast startup (because we don&#x27;t need to compile the function
body), while ICs give significant speedups on many operations.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;The &quot;baseline compiler&quot; translates the JS function body into machine
code on a 1-for-1 basis (each JS opcode becomes a small snippet of
machine code), and dispatches to the same ICs that the baseline
interpreter does. The main speedup over the baseline interpreter is
that we no longer have the &quot;JS opcode dispatch overhead&quot; (the cost
of fetching JS opcodes and jumping to the right interpreter case),
though we do still have the &quot;IC dispatch overhead&quot; (the cost of
jumping to the right fast-path).&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;The optimizing compiler, known as
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;hacks.mozilla.org&#x2F;2020&#x2F;11&#x2F;warp-improved-js-performance-in-firefox-83&#x2F;&quot;&gt;WarpMonkey&lt;&#x2F;a&gt;,
inlines ICs and JS bytecode to perform specialized compilation.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;We can summarize the advantages and tradeoffs of these tiers as
follows:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Name&lt;&#x2F;th&gt;&lt;th&gt;Data Required&lt;&#x2F;th&gt;&lt;th&gt;JS opcode dispatch&lt;&#x2F;th&gt;&lt;th&gt;ICs&lt;&#x2F;th&gt;&lt;th&gt;Optimization scope&lt;&#x2F;th&gt;&lt;th&gt;CacheIR dispatch&lt;&#x2F;th&gt;&lt;th&gt;Codegen at runtime&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Generic (C++) interpreter&lt;&#x2F;td&gt;&lt;td&gt;JS bytecode&lt;&#x2F;td&gt;&lt;td&gt;Interpreter (C++)&lt;&#x2F;td&gt;&lt;td&gt;None&lt;&#x2F;td&gt;&lt;td&gt;None&lt;&#x2F;td&gt;&lt;td&gt;N&#x2F;A&lt;&#x2F;td&gt;&lt;td&gt;No&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;----------------------------------&lt;&#x2F;td&gt;&lt;td&gt;-----------------------------------&lt;&#x2F;td&gt;&lt;td&gt;--------------------------------------------&lt;&#x2F;td&gt;&lt;td&gt;------------------&lt;&#x2F;td&gt;&lt;td&gt;----------------------------------------&lt;&#x2F;td&gt;&lt;td&gt;------------------&lt;&#x2F;td&gt;&lt;td&gt;----------------------------------&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Baseline interpreter&lt;&#x2F;td&gt;&lt;td&gt;JS bytecode + IC data structures&lt;&#x2F;td&gt;&lt;td&gt;Interpreter (generated at startup)&lt;&#x2F;td&gt;&lt;td&gt;Dynamic dispatch&lt;&#x2F;td&gt;&lt;td&gt;Special cases within one opcode via IC&lt;&#x2F;td&gt;&lt;td&gt;Compiled&lt;&#x2F;td&gt;&lt;td&gt;Yes (IC bodies, interp body)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;----------------------------------&lt;&#x2F;td&gt;&lt;td&gt;-----------------------------------&lt;&#x2F;td&gt;&lt;td&gt;--------------------------------------------&lt;&#x2F;td&gt;&lt;td&gt;------------------&lt;&#x2F;td&gt;&lt;td&gt;----------------------------------------&lt;&#x2F;td&gt;&lt;td&gt;------------------&lt;&#x2F;td&gt;&lt;td&gt;----------------------------------&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Baseline compiler&lt;&#x2F;td&gt;&lt;td&gt;JS bytecode + IC data structures&lt;&#x2F;td&gt;&lt;td&gt;Compiled function body (1:1 with bytecode)&lt;&#x2F;td&gt;&lt;td&gt;Dynamic dispatch&lt;&#x2F;td&gt;&lt;td&gt;Special cases within one opcode via IC&lt;&#x2F;td&gt;&lt;td&gt;Compiled&lt;&#x2F;td&gt;&lt;td&gt;Yes (IC bodies, function bodies)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;----------------------------------&lt;&#x2F;td&gt;&lt;td&gt;-----------------------------------&lt;&#x2F;td&gt;&lt;td&gt;--------------------------------------------&lt;&#x2F;td&gt;&lt;td&gt;------------------&lt;&#x2F;td&gt;&lt;td&gt;----------------------------------------&lt;&#x2F;td&gt;&lt;td&gt;------------------&lt;&#x2F;td&gt;&lt;td&gt;----------------------------------&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Optimizing compiler (WarpMonkey)&lt;&#x2F;td&gt;&lt;td&gt;JS bytecode + warmed-up IC chains&lt;&#x2F;td&gt;&lt;td&gt;Compiled function body (optimized)&lt;&#x2F;td&gt;&lt;td&gt;Inlined&lt;&#x2F;td&gt;&lt;td&gt;Across opcodes &#x2F; whole function&lt;&#x2F;td&gt;&lt;td&gt;Compiled&lt;&#x2F;td&gt;&lt;td&gt;Yes (optimized function body)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;h2 id=&quot;can-we-use-ics-in-a-wasi-program&quot;&gt;Can we Use ICs in a WASI Program?&lt;&#x2F;h2&gt;
&lt;p&gt;Given that we have a means to speed up execution beyond that of a
generic interpreter, namely, inline caches (ICs), and given that
SpiderMonkey supports ICs, surely we can simply make use of this
feature in a build of SpiderMonkey for WASI (i.e., when running inside
of a Wasm module)?&lt;&#x2F;p&gt;
&lt;p&gt;Not so fast! There are two basic problems:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;As designed, the IC stubs can only be run as &lt;em&gt;compiled code&lt;&#x2F;em&gt;. Even
the &quot;baseline interpreter&quot; above will invoke a pointer to an IC stub
of machine code compiled with a dedicated IC-stub compiler.&lt;&#x2F;p&gt;
&lt;p&gt;This works well for SpiderMonkey on a native platform -- the fastest
way to implement a fast-path is to produce a purpose-built sequence
of a handful of machine instructions -- but is not compatible with
WebAssembly&#x27;s inability to add new code at runtime that we noted
above. This is because SpiderMonkey only knows what the fast-paths
should be after it starts executing, which is too late to add code
to the Wasm module.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Less fundamental, but still a roadblock: the &quot;baseline interpreter&quot;
in SpiderMonkey is &lt;em&gt;also&lt;&#x2F;em&gt; JIT-compiled, albeit once at JS engine
startup rather than as code is executing. This is more of an
implementation&#x2F;engineering tradeoff, wherein the SpiderMonkey
authors &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;hacks.mozilla.org&#x2F;2019&#x2F;08&#x2F;the-baseline-interpreter-a-faster-js-interpreter-in-firefox-70&#x2F;&quot;&gt;realized they could reuse the baseline compiler
backend&lt;&#x2F;a&gt;
to cheaply produce a new tier (a brilliant idea!), but again is not
compatible with the WASI environment.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;You might already be thinking: the above two points are not laws of
nature -- nothing says that we can&#x27;t &lt;em&gt;interpret&lt;&#x2F;em&gt; whatever code we
would have JIT-compiled and executed in native SpiderMonkey. And you
would be right: in fact, that&#x27;s the starting point for the Portable
Baseline Interpreter (PBL)!&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#4&quot;&gt;4&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;a-baseline-without-jit-portable-baseline&quot;&gt;A Baseline without JIT: Portable Baseline&lt;&#x2F;h2&gt;
&lt;p&gt;Here we can now introduce the &lt;em&gt;Portable Baseline Interpreter&lt;&#x2F;em&gt;, or PBL
for short. PBL is a new &lt;em&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;firefox-source-docs.mozilla.org&#x2F;js&#x2F;index.html#javascript-jits&quot;&gt;execution
tier&lt;&#x2F;a&gt;&lt;&#x2F;em&gt;
that replaces the native &quot;baseline interpreter&quot; described above. Its
key distinguishing feature is that it does not require any runtime
code generation (JIT-compilation). Thus, it is suitable for use in a
Wasm&#x2F;WASI program, or in any other environment where runtime codegen
is prohibited.&lt;&#x2F;p&gt;
&lt;p&gt;The key design principle of PBL is to stick as &lt;em&gt;closely as possible&lt;&#x2F;em&gt;
to the other baseline tiers. In SpiderMonkey, significant shared
machinery exists for the (existing) baseline interpreter and baseline
compiler: there is a defined stack layout and execution state, there
is code that understands how to garbage-collect, introspect, and
unwind this state, and there are mechanisms to track the inline caches
associated with baseline execution. PBL&#x27;s goal at a technical level is
to perform exactly the work that the (native) baseline interpreter
would do, except in portable C++ code rather than runtime-generated
code.&lt;&#x2F;p&gt;
&lt;p&gt;To achieve this goal, the two major tasks were:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Implementing a new interpreter loop over JS opcodes. We cannot use
the generic interpreter tier&#x27;s &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;rev&#x2F;630d2c45fc127a44756e3cca8cef51c35654a4b4&#x2F;js&#x2F;src&#x2F;vm&#x2F;Interpreter.cpp#2179&quot;&gt;main
loop&lt;&#x2F;a&gt;
(what SpiderMonkey calls the &quot;C++ interpreter&quot;), because the actions
for each opcode in that implementation are &quot;generic&quot; -- they do not
use ICs to specialize on types or other kinds of fast-paths -- and
so are not suitable for our purposes. Likewise, we cannot use the
baseline interpreter&#x27;s main loop because it is &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;rev&#x2F;630d2c45fc127a44756e3cca8cef51c35654a4b4&#x2F;js&#x2F;src&#x2F;jit&#x2F;Ion.cpp#143&quot;&gt;generated at
startup&lt;&#x2F;a&gt;
using the JIT backend, and so is not suitable for use in a context
where we can only run portable C++.&lt;&#x2F;p&gt;
&lt;p&gt;Instead, we need to implement a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;cfallin&#x2F;gecko-dev&#x2F;blob&#x2F;981fbf34ea6ee1400136cf94ed04e1105adf8799&#x2F;js&#x2F;src&#x2F;vm&#x2F;PortableBaselineInterpret.cpp#L2533&quot;&gt;new interpreter
loop&lt;&#x2F;a&gt;
whose actions for each opcode invoke ICs where appropriate --
exactly the actions that the baseline interpreter does, but written
in portable code. This is superficially &quot;simple&quot;, but turns out to
require careful attention to many subtle details, because
handwritten JIT-compiled code can control some aspects of execution
much more precisely than C++ ordinarily can. (More on this below!)&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Implementing an interpreter for
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;rev&#x2F;630d2c45fc127a44756e3cca8cef51c35654a4b4&#x2F;js&#x2F;src&#x2F;jit&#x2F;CacheIR.h#27&quot;&gt;CacheIR&lt;&#x2F;a&gt;,
the intermediate representation in which the &quot;fast-path code&quot; for IC
stubs is represented. CacheIR opcodes encode the &quot;guards&quot;, or
preconditions necessary for a fast-path to apply, and the actions to
perform. There are many specialized CacheIR opcodes to particular
data structures or runtime state -- it is a heavily custom IR -- but
this tight fit to SpiderMonkey&#x27;s design is exactly what gives it its
ability to concisely encode many fast-paths.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;In principle, developing an interpreter for an IR that already has two
compilers (to &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;rev&#x2F;630d2c45fc127a44756e3cca8cef51c35654a4b4&#x2F;js&#x2F;src&#x2F;jit&#x2F;BaselineCacheIRCompiler.cpp&quot;&gt;machine
code&lt;&#x2F;a&gt;
and &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;rev&#x2F;630d2c45fc127a44756e3cca8cef51c35654a4b4&#x2F;js&#x2F;src&#x2F;jit&#x2F;WarpCacheIRTranspiler.cpp&quot;&gt;optimizing compiler IR
(MIR)&lt;&#x2F;a&gt;)
should be relatively straightforward: we transliterate the actions
that the compiled code is performing into a direct C++
implementation. In a system as complex as a JavaScript engine, though,
nothing is ever quite &quot;simple&quot;. Challenges encountered in implementing
the CacheIR interpreter fall into two general categories: aspects of
execution that cannot be directly replicated in C++ code, so need to
be &quot;emulated&quot; in some way; and achieving practical performance by
keeping the &quot;virtual machine&quot; model lightweight and playing some other
tricks too. We&#x27;ll give a few examples of each kind of challenge below.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;challenge-baseline-stack-layout&quot;&gt;Challenge: Baseline Stack Layout&lt;&#x2F;h3&gt;
&lt;p&gt;The first challenge that arose consists of &lt;em&gt;emulating the stack&lt;&#x2F;em&gt; as
the JIT-compiled code would have managed it. SpiderMonkey&#x27;s baseline
tiers build a series of stack frames with a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;rev&#x2F;37d9d6f0a77147292a87ab2d7f5906a62644f455&#x2F;js&#x2F;src&#x2F;jit&#x2F;JitFrames.h#37&quot;&gt;carefully-defined
format&lt;&#x2F;a&gt;
that can be traversed for various purposes: finding (and updating)
garbage-collection roots, handling exceptions and unwinding execution,
producing stack backtraces, providing information to the debugger API
and allowing the debugger to update state, and so on.&lt;&#x2F;p&gt;
&lt;p&gt;The format of a single stack frame consists of a &lt;code&gt;JitFrame&lt;&#x2F;code&gt; that looks
a lot like a &quot;normal&quot; function prologue&#x27;s frame -- return address,
previous frame pointer -- but also includes a &quot;callee token&quot; that the
caller pushes, describing the called function at the JS level, and a
&quot;receiver&quot; (the &lt;code&gt;this&lt;&#x2F;code&gt; value in JavaScript). The &lt;code&gt;BaselineFrame&lt;&#x2F;code&gt; below
that records the JS bytecode virtual machine state in a known format,
so it can be introspected: current bytecode PC, current IC slot, and
so on. Below that, the JS bytecode VM&#x27;s operand&#x2F;value stack is
maintained on the real machine stack. And, just before calling any
other function, a &quot;footer&quot; descriptor is pushed: this denotes the kind
of frame that just finished, so it can be handled appropriately.&lt;&#x2F;p&gt;
&lt;p&gt;This format has a very important property: it has &lt;em&gt;no gaps&lt;&#x2F;em&gt;. It is not
simply a linked list of fixed-size descriptor or header structures. If
it were, we could potentially place &lt;code&gt;BaselineFrame&lt;&#x2F;code&gt; &#x2F; &lt;code&gt;JitFrame&lt;&#x2F;code&gt;
instances on the C++ stack in PBL, and link them together with the
previous-FP fields as normal. But this won&#x27;t work: rather, every
machine word of the baseline-format stack is accounted for.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2023-10-03-baseline-stack-web.svg&quot; alt=&quot;Figure: baseline stack&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;This works fine for JIT-compiled code, because we control the code
that is emitted and can maintain whatever stack-format we define.  But
because the C++ compiler owns and manages the machine stack layout
when we are running in C++ code, PBL is not able to maintain the
actual machine stack in this format.&lt;&#x2F;p&gt;
&lt;p&gt;Thus, we instead define an &lt;em&gt;auxiliary stack&lt;&#x2F;em&gt;, build a series of real
baseline frames on it, and maintain this in parallel to the executing
C++ code&#x27;s actual machine stack. When we enter a new frame at the C++
level, we push a new frame on the auxiliary stack; when we return, we
pop a frame.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#5&quot;&gt;5&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; This auxiliary stack is what the garbage collector,
exception unwinder, debugger, and other components introspect: we
store pointers to its frames in the JIT state, and so on. As far as
the rest of the engine is concerned, it is the real stack. The only
major difference is that all return addresses are &lt;code&gt;nullptr&lt;&#x2F;code&gt;s: we don&#x27;t
need them, because we still manage control flow at the C++ level.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;challenge-unwinding&quot;&gt;Challenge: Unwinding&lt;&#x2F;h3&gt;
&lt;p&gt;A second issue that arises from the differences between a native
machine model and that of PBL is &lt;em&gt;unwinding&lt;&#x2F;em&gt;. In JIT code, where we
have complete control over emitted instructions, the call stack is
just a convention and we are free to skip over frames and jump to any
code location we please. The exception unwinder uses this to great
effect: when an exception is thrown, the runtime walks the stack and
looks for any appropriate handler. This might be several call-frames
up the stack. When one is found, it sets the stack pointer and frame
pointer to within that frame -- effectively popping all deeper frames
in one action -- and jumps directly to the appropriate program counter
in that handler&#x27;s frame. Unfortunately, this is not possible to do
directly in portable C++ code.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#6&quot;&gt;6&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Instead, starting from the invariant that one C++ frame in the PBL
interpreter function &quot;owns&quot; one (or more as an optimization -- see
below) baseline frames, we implement a &lt;em&gt;linear-time&lt;&#x2F;em&gt; unwinding scheme:
each C++ interpreter invocation remembers its &quot;entry frame&quot;; when
unwinding, after an exception or otherwise, we compare the new
frame-pointer value to this entry frame; if &quot;above&quot; (higher in
address, for a downward-growing stack), we return from the interpreter
function with a special code indicating an unwind is happening. The
caller instance of the PBL interpreter function then performs the same
logic, propagating upward until we reach the correct C++ frame. The
following figure contrasts the native and PBL strategies:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2023-10-03-unwind-web.svg&quot; alt=&quot;Figure: native baseline and PBL unwinding&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Thus, we do not have the same asymptotic &lt;code&gt;O(1)&lt;&#x2F;code&gt; unwind-efficiency
guarantee that native baseline execution does, but we remain
completely portable, able to execute anywhere that standard C++ runs.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;challenge-vm-exits&quot;&gt;Challenge: VM exits&lt;&#x2F;h3&gt;
&lt;p&gt;A third issue that often arose was that of &lt;em&gt;emulating VM exits&lt;&#x2F;em&gt;. On
the native baseline platform, when JIT code is executing, the stack is
&quot;under construction&quot;, in a sense: the innermost frame is not complete
(there is no footer descriptor word) and is not reachable from the VM
data structures. JIT code can call back into the runtime only via a
carefully-controlled &quot;VM exit&quot; mechanism, which pushes a special kind
of &quot;exit frame&quot;, records the end of the series of contiguous JIT
frames (the &quot;JIT activation&quot;), and then invokes C++ code:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;# JIT code:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  push arg2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  push arg1                # trampoline&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  call VMHelper  -----&amp;gt;    push exit frame&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                           cx-&amp;gt;exitFP = fp&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                           call VMHelperImpl    -----&amp;gt;  walkStack(cx-&amp;gt;exitFP)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                                        doStuff(cx)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                           pop exit frame       &amp;lt;-----  ret&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  ...            &amp;lt;-----    ret&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;which results in a stack layout that looks like:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2023-10-03-vmexit-stack-web.svg&quot; alt=&quot;Figure: the baseline stack after a proper VM exit&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;While executing within the C++ PBL interpreter function, it is very
tempting to simply call into the rest of the runtime as required. This
results in a stack that looks like the below, and unfortunately breaks
in all sorts of exciting and subtle ways: it may appear to work, but
frames are missing and GC roots are not updated after a moving GC; or
if the dangling exit FP is not null, an entirely bogus set of stack
frames may be traced. Either way, various impossible-to-find bugs
arise.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2023-10-03-vmexit-stack-incomplete-web.svg&quot; alt=&quot;Figure: the baseline stack after calling into the runtime without a proper VM exit&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;PBL thus requires extreme discipline in separating &quot;JIT-code mode&quot; (or
its emulation, in a portable C++ interpreter) and &quot;runtime mode&quot;. To
make this distinction clearer, I designed a type-enforced mechanism
that leverages an important idiom in SpiderMonkey: every function that
might perform a GC or otherwise introspect overall VM state will take
a &lt;code&gt;JSContext&lt;&#x2F;code&gt; parameter. In the PBL interpreter function, we hide the
&lt;code&gt;JSContext&lt;&#x2F;code&gt; (rename the local and set it to &lt;code&gt;nullptr&lt;&#x2F;code&gt; normally). We
then have a helper RAII class that pushes an exit frame and does
everything that a &quot;VM exit&quot; trampoline would do, then behaves as a
&lt;em&gt;restricted-scope local&lt;&#x2F;em&gt; that implicitly converts to the true
&lt;code&gt;JSContext&lt;&#x2F;code&gt;. This looks like the below:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;cpp&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;  CASE&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;(&lt;&#x2F;span&gt;&lt;span&gt;Opcode&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    Value arg0 &lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;=&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; POP&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section z-punctuation&quot;&gt;().&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;asValue&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;()&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;  &#x2F;&#x2F; POP() is a macro that uses the `sp` local.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    Value arg1 &lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;=&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; POP&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section z-punctuation&quot;&gt;().&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;asValue&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;()&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;    &#x2F;&#x2F; here, `sp` is the top-of-stack for our in-progress frame in our&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;    &#x2F;&#x2F; in-progress activation. We are &amp;quot;in JIT code&amp;quot; from the engine&amp;#39;s&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;    &#x2F;&#x2F; perspective, even though this is still C++.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;    {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;      &#x2F;&#x2F; This macro completes the activation and creates a `cx` local&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;      &#x2F;&#x2F; that gives us the JSContext* for use.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;      PUSH_EXIT_FRAME&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;()&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;span&gt; &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;      if&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt; (&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;!&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;DoEngineThings&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;(&lt;&#x2F;span&gt;&lt;span&gt;cx&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span&gt; arg0&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span&gt; arg1&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;)) {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;        goto&lt;&#x2F;span&gt;&lt;span&gt; error&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-terminator&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;      }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;    }&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt; &#x2F;&#x2F; pops the exit frame.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;    &#x2F;&#x2F; ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-section&quot;&gt;  }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This idiom works fairly well in practice and statically prevents us
from making most kinds of stack-frame-related mistakes.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;optimization-avoiding-function-call-overhead&quot;&gt;Optimization: Avoiding Function-Call Overhead&lt;&#x2F;h3&gt;
&lt;p&gt;At this point, we have introduced techniques to enable PBL to run
correctly; we now have a functioning JavaScript interpreter that can
invoke ICs. (Take a breath and celebrate!) Unfortunately, I arrived at
this point and found that performance still lagged behind that of the
generic interpreter. How could this be, when ICs directly encode
fast-paths and allow us to short-circuit expensive runtime calls?&lt;&#x2F;p&gt;
&lt;p&gt;The first realization came after profiling both a native build of PBL,
and especially a Wasm build: C++ function calls can be
&lt;em&gt;expensive&lt;&#x2F;em&gt;. The basic PBL design consisted of a JS interpreter that
invoked the IC interpreter for every opcode with an IC -- a majority
of them, in most programs (all numeric operators, property accesses,
function calls, and so on!). Thus function calls are extremely
frequent. Their high cost is for a few basic reasons:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;When the interpreter function is large and has a lot of context
(live variables), register pressure is high; when the called
function is similar, we effectively have a full &quot;context switch&quot;
(save all register values and use for new variables) on every
call&#x2F;return.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Splitting logic across multiple functions precludes optimizations
that span the logic of both functions. For example, the IC
interpreter &quot;reified control flow as data&quot; by returning an enum
value that the JS interpreter then switched on. Combining the two
functions would allow us to embed the switch-bodies directly where
the return code is set.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;On many Wasm implementations, including
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;wasmtime.dev&#x2F;&quot;&gt;Wasmtime&lt;&#x2F;a&gt; (my VM of choice and the main
optimization target for our WASI port), function prologues have some
extra cost: the generated code needs to check for stack overflow,
and may need to check for interruption or preemption. This is a part
of the cost of sandboxing that can only be avoided by staying within
a single function frame.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Thus, it is very important to avoid function-call overhead whenever
possible. I optimized this in two ways. First, the IC interpreter is
aggressively inlined into the JS interpreter. This produces one
super-interpreter that can run both kinds of bytecode -- JS bytecodes
and CacheIR -- without extra frame setup at every IC site.&lt;&#x2F;p&gt;
&lt;p&gt;Second, and more important in practice, &lt;em&gt;multiple JS frames&lt;&#x2F;em&gt; are
handled by one C++ frame (interpreter invocation). In a technique
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;rev&#x2F;37d9d6f0a77147292a87ab2d7f5906a62644f455&#x2F;js&#x2F;src&#x2F;vm&#x2F;Interpreter.cpp#3464-3488&quot;&gt;borrowed from SpiderMonkey&#x27;s generic
interpreter&lt;&#x2F;a&gt;,
when certain conditions are met, we
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;cfallin&#x2F;gecko-dev&#x2F;blob&#x2F;c60c6d313c9fbbd50fd64e9b46d87bbf01e3dcc9&#x2F;js&#x2F;src&#x2F;vm&#x2F;PortableBaselineInterpret.cpp#L4104-L4196&quot;&gt;handle&lt;&#x2F;a&gt;
a JS call opcode by pushing a JS frame and dispatching directly to the
callee&#x27;s first opcode without any C++-level call. (This may be an
obvious implementation to anyone who has written an interpreter
virtual machine before, but disentangling C++ frames and JS frames is
actually not trivial at all, given the prologue&#x2F;epilogue logic --
hence the required conditions!) This interacts in subtle ways with
unwinding described above: it means that the mapping from JS to C++
frames is 1-to-many, and thus requires some care. (As a silver lining,
however, the logic for a &quot;return&quot; is substantially similar to that for
&quot;unwind&quot;: we can use the same conditions to know when to return at the
C++ level.)&lt;&#x2F;p&gt;
&lt;h3 id=&quot;optimization-hybrid-ics&quot;&gt;Optimization: Hybrid ICs&lt;&#x2F;h3&gt;
&lt;p&gt;Having implemented all of the above techniques, I was still finding
PBL to have somewhat disappointing performance numbers. Fortunately,
one final insight came: perhaps the tradeoffs related to which
operations are profitable to fast-path &lt;em&gt;change&lt;&#x2F;em&gt;, when the cost of the
fast-path mechanism (an IC) itself changes?&lt;&#x2F;p&gt;
&lt;p&gt;For example: in native baseline execution, every arithmetic operator
uses ICs to dispatch to type-specific behavior. The &lt;code&gt;+&lt;&#x2F;code&gt; operator, our
favorite example, has possible fast-paths for integers, floating-point
numbers, strings, and more. This is profitable in &quot;native baseline&quot;
because the cost of an IC is extremely low: the JIT controls register
allocation so it can effectively do global allocation across the
function body and IC stubs by using special-purpose registers and a
custom calling convention, and it can avoid generating any
prologue&#x2F;epilogue in the IC stubs themselves. As a result, ICs can
literally be a handful of instructions: call, check type tag in
registers 0 and 1, integer add, return. PBL, in contrast, is both
emulating virtual-machine state (rather than using an optimized IC
calling convention), and paying the interpreter-dispatch cost for
every IC opcode.&lt;&#x2F;p&gt;
&lt;p&gt;So I ran a simple experiment: in a native PBL build, I added &lt;code&gt;rdtsc&lt;&#x2F;code&gt;
(CPU time counter)-based timing measurements around execution of each
JS opcode both in the generic interpreter and in PBL&#x27;s interpreter
loop, and binned the results by opcode type. The results were
fascinating: property accesses (e.g., &lt;code&gt;GetProp&lt;&#x2F;code&gt;) were significantly
faster with ICs, for example, but many simpler operators, like &lt;code&gt;Add&lt;&#x2F;code&gt;,
were twice as slow.&lt;&#x2F;p&gt;
&lt;p&gt;Then given this data, I developed the &quot;hybrid ICs&quot; approach, namely:
use ICs only where they help! For the &lt;code&gt;Add&lt;&#x2F;code&gt; operator, the PBL
interpreter now has &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;cfallin&#x2F;gecko-dev&#x2F;blob&#x2F;c60c6d313c9fbbd50fd64e9b46d87bbf01e3dcc9&#x2F;js&#x2F;src&#x2F;vm&#x2F;PortableBaselineInterpret.cpp#L2806-L2841&quot;&gt;specific cases for integer and floating-point
addition&lt;&#x2F;a&gt;,
and then invokes the generic interpreter&#x27;s behavior (&lt;code&gt;AddOperation&lt;&#x2F;code&gt;);
it never invokes the IC chain, but rather skips over it entirely. This
behavior is configurable -- with faster IC mechanisms in the future,
we may be able to use ICs for these opcodes again, so the code for
both strategies remains.&lt;&#x2F;p&gt;
&lt;p&gt;The results were striking: PBL was finally showing significant
speedups on almost all benchmarks. The final &quot;hybrid IC&quot; set includes:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Property accesses.&lt;&#x2F;em&gt; These are extremely common in most JavaScript
code, and can benefit from fast-path behavior whenever objects
usually have the same &quot;shape&quot;, or set of properties, at a given
point. This is because the engine can encode a fast-path that
directly accesses a particular memory offset in the object in
memory, without looking up a property by name.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;em&gt;Calls.&lt;&#x2F;em&gt; This is somewhat less intuitive: for an ordinary call to
another JavaScript function, there is not much an IC can do -- we
just need to update interpreter state to the callee and start
dispatching. But for calls to built-in functions, as described
above, the benefits can be huge: string and array operations, for
example, transform from an expensive call into the runtime (through
several layers of generic JS function-call logic) into just a few
direct field accesses or other operations on a known object type.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;Every other JS opcode is executed with generic logic.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;results&quot;&gt;Results&lt;&#x2F;h2&gt;
&lt;p&gt;Enough description -- how well does it perform?&lt;&#x2F;p&gt;
&lt;p&gt;The best test of any language runtime or platform is a &quot;real-world&quot;
use-case, and PBL has been fortunate to see some early adoption, where
two real applications saw wall-clock CPU time reductions of 42% and
17%, respectively, when executing on a Wasm&#x2F;WASI platform. That is
quite significant and exciting, and is motivating adoption and further
work of PBL.&lt;&#x2F;p&gt;
&lt;p&gt;While developing PBL, I did most of my benchmarking with
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;chromium.github.io&#x2F;octane&#x2F;&quot;&gt;Octane&lt;&#x2F;a&gt;, which is
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;v8.dev&#x2F;blog&#x2F;retiring-octane&quot;&gt;deprecated&lt;&#x2F;a&gt; but still useful
when hacking on the core of a JS engine (one just needs to give the
appropriate caveats that benchmark speedups will have an uncertain
correlation to real-world speedups). On Octane, PBL currently sees a
1.26x speedup (that is, throughput is 26% higher; or, equivalently,
there is a runtime reduction of &lt;code&gt;1 - 1&#x2F;1.26&lt;&#x2F;code&gt;, or 21%). That is quite
something as well, for a new engine tier that remains completely
portable as a pure interpreter!&lt;&#x2F;p&gt;
&lt;p&gt;Because of these exciting results, and our future plans below, we have
worked with the SpiderMonkey team themselves to plan &lt;em&gt;upstreaming&lt;&#x2F;em&gt; --
incorporating PBL into the main SpiderMonkey tree. This will ease
maintenance because it will allow PBL to be updated and evolved (i.e.,
kept compiling and running) as SpiderMonkey itself does, will allow us
to use SpiderMonkey without a heavy patch-stack on top, and will make
PBL available for others to use as well. We believe it could be useful
beyond the Wasm&#x2F;WASI world: for example, high-security contexts that
disallow JIT could benefit as well. The &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bugzilla.mozilla.org&#x2F;show_bug.cgi?id=1855321&quot;&gt;upstreaming
code-review&lt;&#x2F;a&gt; is
in-progress and we look forward to completing it!&lt;&#x2F;p&gt;
&lt;h2 id=&quot;future-compiled-code&quot;&gt;Future: Compiled Code&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;em&gt;Note: this section describes my own thoughts and plans, but goes
beyond what is currently being upstreamed into SpiderMonkey, and is
not necessarily endorsed yet by upstream. My plan and hope is to
develop the ideas to maturity and, if results hold up, propose
additional upstreaming -- but that is further out.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;p&gt;PBL has an attractive simplicity as a pure interpreter, and has
surprised us with speedups even under that restriction. However, the
larger question, for me at least, has always been: how can we
&lt;em&gt;compile&lt;&#x2F;em&gt; JS ahead-of-time in a performant way?&lt;&#x2F;p&gt;
&lt;p&gt;Recall that the main restriction of the WebAssembly platform is not
that we can&#x27;t generate code at all; it&#x27;s just that all code, no matter
the producer (the traditional Wasm compiler toolchain or our own JS
tools), needs to be generated before any execution occurs.&lt;&#x2F;p&gt;
&lt;p&gt;SpiderMonkey&#x27;s native baseline tiers hint at a way forward here. PBL
as described above is roughly equivalent to the baseline interpreter
(modulo the &lt;em&gt;way&lt;&#x2F;em&gt; that ICs are executed). Can we (i) produce compiled
code for ICs, and (ii) do the equivalent of the baseline &lt;em&gt;compiler&lt;&#x2F;em&gt;,
generating a specialized Wasm function for every JS function body?&lt;&#x2F;p&gt;
&lt;p&gt;In principle, this should be possible without information from
execution, because it handles the type-specific specialization with
the &lt;em&gt;runtime dispatch&lt;&#x2F;em&gt; inherent in the IC chains. In other words,
types are late-binding, so we retain late-binding control flow to
match.&lt;&#x2F;p&gt;
&lt;p&gt;This still requires us to know what &lt;em&gt;possible&lt;&#x2F;em&gt; ICs we might need, but
here we can play a trick: we can collect many IC bodies ahead of time,
and generate straight-line compiled Wasm functions for these IC
bodies. This is more-or-less the trick we described in our post &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bytecodealliance.org&#x2F;articles&#x2F;making-javascript-run-fast-on-webassembly&quot;&gt;two
years
ago&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;But all of this is still implying the development of a Wasm &lt;em&gt;compiler&lt;&#x2F;em&gt;
backend. How does PBL help us at all? Isn&#x27;t it a dead-end, if we are
eventually able to compile JS source (which we typically have
available ahead-of-time -- performance-critical &lt;code&gt;eval()&lt;&#x2F;code&gt; usage is
rare) straight to specialized Wasm, with late-bound ICs?&lt;&#x2F;p&gt;
&lt;p&gt;The answer to that lies in &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Partial_evaluation&quot;&gt;partial
evaluation&lt;&#x2F;a&gt;. Over
the past year I have developed a tool called
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;cfallin&#x2F;weval&quot;&gt;weval&lt;&#x2F;a&gt; that takes an interpreter in
a Wasm module, with a few minor intrinsic-call annotations (to specify
what to specialize, and to denote that memory storing bytecode is
&quot;constant&quot; and can be assumed not to self-modify dynamically), and
generates a Wasm module with specialized functions appended. This
gives us a compiler &quot;for free&quot; once we have an interpreter, and PBL
has been designed to be that interpreter.&lt;&#x2F;p&gt;
&lt;p&gt;In particular, the JS opcode and IC opcode interpreters in PBL were
designed carefully to work efficiently with weval, and in a next step
to the project (&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;cfallin&#x2F;gecko-dev&#x2F;tree&#x2F;pbl-weval&quot;&gt;development
branch&lt;&#x2F;a&gt;), I have
the whole thing working. Whereas pure-interpreter PBL got a 1.26x
speedup on Octane, PBL with weval gets a 1.58x speedup, up to 2.4x or
so, with a bunch of low-hanging fruit remaining that will hopefully
push that number further.&lt;&#x2F;p&gt;
&lt;p&gt;This combination isn&#x27;t quite ready for production use yet, but I
continue to polish it, and we hope sometime early next year it will be
ready, taking us to &quot;conceptual parity&quot; (if not engineering
fine-tuning parity!) with SpiderMonkey&#x27;s native baseline compiler. We
have some more thoughts on going beyond that -- inlining ICs like
WarpMonkey does, hoisting guards, and all the rest -- but more on that
in due time.&lt;&#x2F;p&gt;
&lt;p&gt;Given all of that, one could compare PBL and PBL+weval to
SpiderMonkey&#x27;s existing tiers. Recall our table above:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Name&lt;&#x2F;th&gt;&lt;th&gt;Data Required&lt;&#x2F;th&gt;&lt;th&gt;JS opcode dispatch&lt;&#x2F;th&gt;&lt;th&gt;ICs&lt;&#x2F;th&gt;&lt;th&gt;Optimization scope&lt;&#x2F;th&gt;&lt;th&gt;CacheIR dispatch&lt;&#x2F;th&gt;&lt;th&gt;Codegen at runtime&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;Generic (C++) interpreter&lt;&#x2F;td&gt;&lt;td&gt;JS bytecode&lt;&#x2F;td&gt;&lt;td&gt;Interpreter (C++)&lt;&#x2F;td&gt;&lt;td&gt;None&lt;&#x2F;td&gt;&lt;td&gt;None&lt;&#x2F;td&gt;&lt;td&gt;N&#x2F;A&lt;&#x2F;td&gt;&lt;td&gt;No&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;----------------------------------&lt;&#x2F;td&gt;&lt;td&gt;-----------------------------------&lt;&#x2F;td&gt;&lt;td&gt;--------------------------------------------&lt;&#x2F;td&gt;&lt;td&gt;------------------&lt;&#x2F;td&gt;&lt;td&gt;----------------------------------------&lt;&#x2F;td&gt;&lt;td&gt;-------------------&lt;&#x2F;td&gt;&lt;td&gt;----------------------------------&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Baseline interpreter&lt;&#x2F;td&gt;&lt;td&gt;JS bytecode + IC data structures&lt;&#x2F;td&gt;&lt;td&gt;Interpreter (generated at startup)&lt;&#x2F;td&gt;&lt;td&gt;Dynamic dispatch&lt;&#x2F;td&gt;&lt;td&gt;Special cases within one opcode via IC&lt;&#x2F;td&gt;&lt;td&gt;Compiled&lt;&#x2F;td&gt;&lt;td&gt;Yes (IC bodies, interp body)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;----------------------------------&lt;&#x2F;td&gt;&lt;td&gt;-----------------------------------&lt;&#x2F;td&gt;&lt;td&gt;--------------------------------------------&lt;&#x2F;td&gt;&lt;td&gt;------------------&lt;&#x2F;td&gt;&lt;td&gt;----------------------------------------&lt;&#x2F;td&gt;&lt;td&gt;-------------------&lt;&#x2F;td&gt;&lt;td&gt;----------------------------------&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Baseline compiler&lt;&#x2F;td&gt;&lt;td&gt;JS bytecode + IC data structures&lt;&#x2F;td&gt;&lt;td&gt;Compiled function body (1:1 with bytecode)&lt;&#x2F;td&gt;&lt;td&gt;Dynamic dispatch&lt;&#x2F;td&gt;&lt;td&gt;Special cases within one opcode via IC&lt;&#x2F;td&gt;&lt;td&gt;Compiled&lt;&#x2F;td&gt;&lt;td&gt;Yes (IC bodies, function bodies)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;----------------------------------&lt;&#x2F;td&gt;&lt;td&gt;-----------------------------------&lt;&#x2F;td&gt;&lt;td&gt;--------------------------------------------&lt;&#x2F;td&gt;&lt;td&gt;------------------&lt;&#x2F;td&gt;&lt;td&gt;----------------------------------------&lt;&#x2F;td&gt;&lt;td&gt;-------------------&lt;&#x2F;td&gt;&lt;td&gt;----------------------------------&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Optimizing compiler (WarpMonkey)&lt;&#x2F;td&gt;&lt;td&gt;JS bytecode + warmed-up IC chains&lt;&#x2F;td&gt;&lt;td&gt;Compiled function body (optimized)&lt;&#x2F;td&gt;&lt;td&gt;Inlined&lt;&#x2F;td&gt;&lt;td&gt;Across opcodes &#x2F; whole function&lt;&#x2F;td&gt;&lt;td&gt;Compiled&lt;&#x2F;td&gt;&lt;td&gt;Yes (optimized function body)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;To which we could add the row:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Name&lt;&#x2F;th&gt;&lt;th&gt;Data Required&lt;&#x2F;th&gt;&lt;th&gt;JS opcode dispatch&lt;&#x2F;th&gt;&lt;th&gt;ICs&lt;&#x2F;th&gt;&lt;th&gt;Optimization scope&lt;&#x2F;th&gt;&lt;th&gt;CacheIR dispatch&lt;&#x2F;th&gt;&lt;th&gt;Codegen at runtime&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;PBL (interpreter)&lt;&#x2F;td&gt;&lt;td&gt;JS bytecode + IC data structures&lt;&#x2F;td&gt;&lt;td&gt;Interpreter (C++)&lt;&#x2F;td&gt;&lt;td&gt;Dynamic dispatch&lt;&#x2F;td&gt;&lt;td&gt;Special cases within one opcode via IC&lt;&#x2F;td&gt;&lt;td&gt;Interpreter (C++)&lt;&#x2F;td&gt;&lt;td&gt;No&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;And then, with weval and pre-collected ICs (but no profiling of the JS code!), we could have:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Name&lt;&#x2F;th&gt;&lt;th&gt;Data Required&lt;&#x2F;th&gt;&lt;th&gt;JS opcode dispatch&lt;&#x2F;th&gt;&lt;th&gt;ICs&lt;&#x2F;th&gt;&lt;th&gt;Optimization scope&lt;&#x2F;th&gt;&lt;th&gt;CacheIR dispatch&lt;&#x2F;th&gt;&lt;th&gt;Codegen at runtime&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;PBL (wevaled)&lt;&#x2F;td&gt;&lt;td&gt;JS bytecode + IC data structures&lt;&#x2F;td&gt;&lt;td&gt;Compiled function body (1:1 with bytecode)&lt;&#x2F;td&gt;&lt;td&gt;Dynamic dispatch&lt;&#x2F;td&gt;&lt;td&gt;Special cases within one opcode via IC&lt;&#x2F;td&gt;&lt;td&gt;Compiled&lt;&#x2F;td&gt;&lt;td&gt;No (!!)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;which one will note is identical to the baseline-compiler row above,
except that no runtime codegen is required.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, if we have reliable profiling information, such as from a
profiling run at build time, we could use this profile (just as one
does in a standard C&#x2F;C++ &quot;PGO&quot; or &quot;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Profile-guided_optimization&quot;&gt;profile-guided
optimization&lt;&#x2F;a&gt;&quot;
build) to &lt;em&gt;inline&lt;&#x2F;em&gt; the ICs. Note that this could be done in a way that
is &lt;em&gt;completely agnostic to the underlying interpreter&lt;&#x2F;em&gt;, because IC
invocations are just indirect calls: that is, it is also a
semantics-preserving, independently-verifiable transform. Having done
that, we would then have:&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Name&lt;&#x2F;th&gt;&lt;th&gt;Data Required&lt;&#x2F;th&gt;&lt;th&gt;JS opcode dispatch&lt;&#x2F;th&gt;&lt;th&gt;ICs&lt;&#x2F;th&gt;&lt;th&gt;Optimization scope&lt;&#x2F;th&gt;&lt;th&gt;CacheIR dispatch&lt;&#x2F;th&gt;&lt;th&gt;Codegen at runtime&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;PBL (wevaled + inlined ICs)&lt;&#x2F;td&gt;&lt;td&gt;JS bytecode + IC data structures + warmed-up IC chains&lt;&#x2F;td&gt;&lt;td&gt;Compiled function body (optimized)&lt;&#x2F;td&gt;&lt;td&gt;Inlined&lt;&#x2F;td&gt;&lt;td&gt;Across opcodes &#x2F; whole function&lt;&#x2F;td&gt;&lt;td&gt;Compiled&lt;&#x2F;td&gt;&lt;td&gt;No (!!)&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;which approximates WarpMonkey. Note that this will require significant
additional engineering -- SpiderMonkey&#x27;s native JITs, after all,
embody engineer-centuries of effort (much of which we leverage by
reusing its well-tuned CacheIR sequences, but much which we can&#x27;t) --
but is a clear path to allow for optimized JS without runtime code
generation.&lt;&#x2F;p&gt;
&lt;p&gt;The thing that excites me most about this direction is that it is, in
some sense, &quot;deriving a JIT from scratch&quot;: we are writing down the
semantics of the opcodes, and we&#x27;re explicitly extracting fast-paths,
but we&#x27;re using semantics-preserving tools to go beyond that. (Weval&#x27;s
semantics are that it provides a function pointer to a specialized
function that behaves identically to the original.) That allows us to
decouple the correctness aspects of our work from performance, mostly,
and makes life far simpler -- no more insidious JIT bugs, or
divergence between the interpreter and compiler tiers.  More to come!&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;em&gt;Many, many thanks to &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;lukewagner&quot;&gt;Luke Wagner&lt;&#x2F;a&gt;, &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;fitzgeraldnick.com&#x2F;&quot;&gt;Nick
Fitzgerald&lt;&#x2F;a&gt;, &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;elliottt&quot;&gt;Trevor
Elliott&lt;&#x2F;a&gt;, &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;jamey.thesharps.us&#x2F;&quot;&gt;Jamey
Sharp&lt;&#x2F;a&gt;, &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;tia.mat.br&#x2F;&quot;&gt;L Pereira&lt;&#x2F;a&gt;,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;michelledaviest.github.io&#x2F;&quot;&gt;Michelle Thalakottur&lt;&#x2F;a&gt;, and others with
whom I&#x27;ve discussed these ideas over the past several years. Thanks to Luke,
Nick, &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;linclark&quot;&gt;Lin Clark&lt;&#x2F;a&gt;, and &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.mgaudet.ca&#x2F;&quot;&gt;Matt
Gaudet&lt;&#x2F;a&gt; for feedback on this post. Thanks also to
Trevor Elliott and &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;JakeChampion&quot;&gt;Jake Champion&lt;&#x2F;a&gt; for help in
getting PBL integrated with other infrastructure, Jamey Sharp for ramping up
efforts to fill out PBL&#x27;s CacheIR opcode support, and the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;spidermonkey.dev&#x2F;&quot;&gt;Mozilla SpiderMonkey
team&lt;&#x2F;a&gt; for graciously hearing our ideas and agreeing
to upstream this work.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;1&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;1&lt;&#x2F;sup&gt;
&lt;p&gt;WebAssembly running in a browser could implement runtime code
generation and loading by calling out to external JavaScript. In
essence, it would generate a new Wasm module in memory, then
call JS to instantiate that module into an instance that shares
the current linear memory, and call into it. However, this is
fundamentally a feature of the Web platform and not built-in to
Wasm; and many Wasm platforms, especially those designed with
security among untrusted co-tenants in mind, do not allow this
and strictly enforce ahead-of-time compilation of fixed code
instead.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;2&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;2&lt;&#x2F;sup&gt;
&lt;p&gt;Honestly, even after &lt;em&gt;writing a new interpreter tier&lt;&#x2F;em&gt; for
SpiderMonkey, I couldn&#x27;t tell you the answer to this. (I run the
bytecode, I don&#x27;t lower to it!) The language&#x27;s semantics are
something to marvel at, in the edge cases, and this is all the
more reason to centralize on a few well-tested, well-engineered
shared implementations.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;3&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;3&lt;&#x2F;sup&gt;
&lt;p&gt;Note that there are some details here omitted for
simplicity. Most importantly, once inlining ICs, we need to
&lt;em&gt;hoist the guards&lt;&#x2F;em&gt;, or the conditions that exist at the
beginning of every IC, so that they are shared in common (or
removed entirely, if we can prove they will always
succeed). Consider a function that operates on floating-point
numbers: every IC will be some form of &quot;check if inputs are
floats, do operation, tag result as float&quot; but instead we could
check that the function arguments are floats &lt;em&gt;once&lt;&#x2F;em&gt;, then
propagate from &quot;produce float&quot; in one IC to &quot;check if float&quot; in
the next. ICs &lt;em&gt;enable&lt;&#x2F;em&gt; this by expressing the necessary
preconditions (checks) and postconditions (produced values and
their types) for each operator, and inlining is necessary as
well because it places everything in one code-path so it is in
scope to be cross-optimized; but guard-hoisting and -elimination
are JIT-compiler-specific optimizations.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;4&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;4&lt;&#x2F;sup&gt;
&lt;p&gt;One can see this idea as a variant of the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;publications.sba-research.org&#x2F;publications&#x2F;dls10.pdf&quot;&gt;interpreter
quickening&lt;&#x2F;a&gt;
idea, in a way: the CacheIR sequences are a shorter or more
efficient implementation of particular behavior that we rewrite
the interpreted program to use (via the pluggable IC sites) as
we learn more about its execution.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;5&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;5&lt;&#x2F;sup&gt;
&lt;p&gt;The correspondence isn&#x27;t actually 1-to-1, unfortunately (that
would have been much simpler!): instead, we sometimes push an
additional frame for VM exits, and we also handle some calls
&quot;inline&quot;, pushing the frame and going right to the top of the
dispatch loop again. The actual invariant is that every
auxiliary stack frame is &quot;owned&quot; by one C++ function invocation,
but there may be several such frames. It is thus a 1-to-many
relationship.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;6&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;6&lt;&#x2F;sup&gt;
&lt;p&gt;Strictly speaking, we could have used &lt;code&gt;setjmp()&lt;&#x2F;code&gt;&#x2F;&lt;code&gt;longjmp()&lt;&#x2F;code&gt; to
implement similar constant-time unwinding. However, this
interacts poorly with C++ destructors, and is also problematic
-- that is, does not exist -- on WebAssembly with
WASI. Eventually the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;WebAssembly&#x2F;exception-handling&#x2F;blob&#x2F;main&#x2F;proposals&#x2F;exception-handling&#x2F;Exceptions.md&quot;&gt;exception handling
proposal&lt;&#x2F;a&gt;
for Wasm may be directly usable for this purpose, but it is not
finalized yet.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
</content>
    </entry>
    <entry xml:lang="en">
        <title>Cranelift&#x27;s Instruction Selector DSL, ISLE: Term-Rewriting Made Practical</title>
        <published>2023-01-20T00:00:00+00:00</published>
        <updated>2023-01-20T00:00:00+00:00</updated>
        <author>
            <name></name>
        </author>
        <link rel="alternate" type="text/html" href="https://cfallin.org/blog/2023/01-20/cranelift-isle/"/>
        <id>https://cfallin.org/blog/2023/01-20/cranelift-isle/</id>
        <content type="html" xml:base="https://cfallin.org/blog/2023/01-20/cranelift-isle/">&lt;p&gt;Today I&#x27;m going to be writing about &lt;strong&gt;ISLE&lt;&#x2F;strong&gt;, or the &quot;instruction
selection&#x2F;lowering expressions&quot; domain-specific language
(&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Domain_specific_language&quot;&gt;DSL&lt;&#x2F;a&gt;), which
over the past year we have designed, improved, and fully adopted in
the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;tree&#x2F;main&#x2F;cranelift&#x2F;&quot;&gt;Cranelift&lt;&#x2F;a&gt;
compiler
project. &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;tree&#x2F;main&#x2F;cranelift&#x2F;isle&quot;&gt;ISLE&lt;&#x2F;a&gt;
is now used to express both our instruction-lowering patterns for each
of four target architectures, and also machine-independent optimizing
rewrites. It allows us to develop these parts of the compiler in an
extremely productive way: we can write the key idea -- that one opcode
or instruction should map to another -- in a concise way, while
maintaining type-safety with an expressive type system, and allowing
us to use the declarative patterns for many different purposes.&lt;&#x2F;p&gt;
&lt;p&gt;The goal of this blog post is to illustrate the requirements and the
design-space that led to ISLE&#x27;s key ideas, and especially its
departures from other &quot;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Rewriting#Term_rewriting_systems&quot;&gt;term-rewriting
systems&lt;&#x2F;a&gt;&quot;
and blending of ideas from backtracking languages like
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Prolog&quot;&gt;Prolog&lt;&#x2F;a&gt;. In particular, we&#x27;ll
talk about how ISLE has a strong type system, with terms of distinct
types (as opposed to one &quot;value&quot; type); how it matches on abstract
&quot;extractors&quot; provided by an embedding environment, which effectively
define a virtual &quot;input term&quot; without ever reifying it; and how it was
explicitly designed to have a simple
&quot;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Foreign_function_interface&quot;&gt;FFI&lt;&#x2F;a&gt;&quot; with
Rust for easy-to-understand, no-magic interactions with the rest of
the compiler. All of these properties allowed us to chart an
incremental path from a fully handwritten compiler backend to one with
all lowering logic in the DSL (in 27k lines of DSL code), with a
relatively low defect rate over the year-long migration and with
significant correctness improvements along the way.&lt;&#x2F;p&gt;
&lt;p&gt;The ISLE project was also a really interesting moment in my career
personally: it was &lt;em&gt;very much&lt;&#x2F;em&gt; a &quot;research project&quot;, in that it
required synthesis of existing approaches and careful thought about
the domain and requirements, and invention of slightly new takes on
old ideas in order to make the whole thing practical. At the same
time, there was quite a lot of work put into the &lt;em&gt;incrementalist&lt;&#x2F;em&gt; and
&lt;em&gt;pragmatic&lt;&#x2F;em&gt; aspect of the design (something I also &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;cfallin.org&#x2F;blog&#x2F;2022&#x2F;06&#x2F;09&#x2F;cranelift-regalloc2&#x2F;#compatibility-and-migration-path&quot;&gt;talk about in an
earlier post on
regalloc2&lt;&#x2F;a&gt;),
and a lot of care and feeding to a 12-month migration effort, curation
of our understanding of the language and how to use it, and nurturing
of ongoing ideas for improvement. I feel pretty fortunate to have (i)
been given the space to wrestle the problem space down to its
essential kernel and find a working design, and (ii) a cohort of
really great coworkers who ran with it and made it real. (Especially
thanks to my brilliant colleague &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;fitzgeraldnick.com&#x2F;&quot;&gt;Nick Fitzgerald
(@fitzgen)&lt;&#x2F;a&gt; who completed the Cranelift
integration of ISLE and who pioneered the idea of rewrite DSLs in
Cranelift with his
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;tree&#x2F;dba74024aa412f284871375db292c1bf9079d769&#x2F;cranelift&#x2F;peepmatic&quot;&gt;Peepmatic&lt;&#x2F;a&gt;
project, which inspired many parts of ISLE and primed the project for
this effort.) I wrote more about the benefits we&#x27;ve seen so far from
ISLE, and anticipate to see,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bytecodealliance.org&#x2F;articles&#x2F;cranelift-progress-2022#isle-dsl-for-backends-and-mid-end&quot;&gt;here&lt;&#x2F;a&gt;;
suffice it to say that we&#x27;re happy we&#x27;ve gone this way and we look
forward to additional results that this work has enabled.&lt;&#x2F;p&gt;
&lt;p&gt;This post covers material and repeats some arguments I made early in
ISLE&#x27;s
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;cfallin&#x2F;rfcs&#x2F;blob&#x2F;cranelift-isel-pre-rfc&#x2F;accepted&#x2F;cranelift-isel-pre-rfc.md&quot;&gt;pre-RFC&lt;&#x2F;a&gt;
and
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;rfcs&#x2F;blob&#x2F;main&#x2F;accepted&#x2F;cranelift-isel-isle-peepmatic.md&quot;&gt;RFC&lt;&#x2F;a&gt;;
I recommend reading those documents as well for further background, if
desired. The &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;main&#x2F;cranelift&#x2F;isle&#x2F;docs&#x2F;language-reference.md&quot;&gt;ISLE language
reference&lt;&#x2F;a&gt;
is the canonical definition of the DSL.&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s first go through some background on Cranelift, on compiler DSLs
in general, and motivate the case for a DSL in Cranelift; then we&#x27;ll
get into the details of ISLE proper.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;context-new-cranelift-backends-and-handwritten-code&quot;&gt;Context: New Cranelift Backends and Handwritten Code&lt;&#x2F;h2&gt;
&lt;p&gt;In the Cranelift project, starting in 2020, we developed a new
framework for the machine backends -- the part of the compiler that
takes the final optimized machine-independent &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Intermediate_representation&quot;&gt;IR (intermediate
representation)&lt;&#x2F;a&gt;
and converts it to instructions for the target instruction set,
allocates registers, lowers control flow, and emits machine code.  (I
described more details in an earlier post series:
&lt;a href=&quot;&#x2F;blog&#x2F;2020&#x2F;09&#x2F;18&#x2F;cranelift-isel-1&#x2F;&quot;&gt;first&lt;&#x2F;a&gt;,
&lt;a href=&quot;&#x2F;blog&#x2F;2021&#x2F;01&#x2F;22&#x2F;cranelift-isel-2&#x2F;&quot;&gt;second&lt;&#x2F;a&gt;,
&lt;a href=&quot;&#x2F;blog&#x2F;2021&#x2F;03&#x2F;15&#x2F;cranelift-isel-3&#x2F;&quot;&gt;third&lt;&#x2F;a&gt;.)&lt;&#x2F;p&gt;
&lt;p&gt;Prior to introducing ISLE, we build three backends in this framework:
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;tree&#x2F;main&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;isa&#x2F;aarch64&quot;&gt;AArch64&lt;&#x2F;a&gt;,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;tree&#x2F;main&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;isa&#x2F;x64&quot;&gt;x86-64&lt;&#x2F;a&gt;,
and &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;tree&#x2F;main&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;isa&#x2F;s390x&quot;&gt;s390x (IBM
Z)&lt;&#x2F;a&gt;. The
general experience was quite positive -- the simplicity that we aimed
to achieve by focusing on a &quot;straightforward handwritten lowering
pass&quot; design allowed us to quickly implement quite complete support
for WebAssembly (core and SIMD) and support other users of Cranelift
(like
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bjorn3&#x2F;rustc_codegen_cranelift&quot;&gt;cg_clif&lt;&#x2F;a&gt;). In
March 2021, we &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;2718&quot;&gt;made the new x86-64 backend the
default&lt;&#x2F;a&gt;, and
at the end of September 2021, we were able to &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;3009&quot;&gt;remove the old x86
backend&lt;&#x2F;a&gt; with
its legacy, complex, and generally slower framework.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;the-need-for-a-dsl&quot;&gt;The Need for a DSL&lt;&#x2F;h2&gt;
&lt;p&gt;However, as with many engineering problems, there is a tradeoff point
in the design space. The simplicity of the &quot;just write the &lt;code&gt;match&lt;&#x2F;code&gt; on
the opcode and emit the instructions&quot;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;rfcs&#x2F;blob&#x2F;main&#x2F;accepted&#x2F;cranelift-isel-isle-peepmatic.md&quot;&gt;approach&lt;&#x2F;a&gt;
eventually becames a downside: we had an
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;rfcs&#x2F;blob&#x2F;main&#x2F;accepted&#x2F;cranelift-isel-isle-peepmatic.md&quot;&gt;increasingly-deeply-nested&lt;&#x2F;a&gt;
lowering function with many conditions matching on types,
sub-expressions, and special cases, and keeping track of it all became
more and more difficult. In total, we found we were running into
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;cfallin&#x2F;rfcs&#x2F;blob&#x2F;cranelift-isel-pre-rfc&#x2F;accepted&#x2F;cranelift-isel-pre-rfc.md#downsides-as-complexity-increases&quot;&gt;three major
problems&lt;&#x2F;a&gt;:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;It became very &lt;em&gt;tedious&lt;&#x2F;em&gt; to write what was essentially a longhand
form of a list of lowering patterns. If adding a special lowering
for a combination of IR operators requires understanding control in
handwritten code, and writing out the checks for special conditions
or searches for other combining operators by hand, then we are much
less likely to improve the compiler backends: the incentives
instead point toward keeping the code as simple and as minimalistic
as possible and discouraging change.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;It became difficult to refactor the code at all: the compiler
&quot;lowering API&quot; was ossifying as more and more handwritten backend
code came to depend on its subtle details, and refactors became
very hard or impossible. This became especially apparent with the
regalloc2 work.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;It became more and more difficult to maintain correctness, and
ensure that the backends were generating the code we
expected. While writing code against the lowering API, one had to
keep in mind the subtle correctness invariants for when one can
&quot;combine&quot; instructions, when one can &quot;sink&quot; an operation, and so
on; and the rules for how to use registers and temporaries
properly. Even when generating some kind of correct code, it was
easy to miss a corner of the state space and, say, omit a lowering
for a particular combination of input types, or skip an intended
optimized lowering and use a general one instead, in an accidental
and hard-to-reason-about way due to complex control flow. As we&#x27;ll
see below, a DSL allows us to solve both of these problems with (i)
principled strongly-typed abstractions in the DSL, and (ii)
&quot;overlap checks&quot;, respectively.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;It became clear that generating lowering patterns from some
meta-description would lead to overall clearer and more maintainable
compiler source, and would give us more flexibility if we wanted to
change any details of or optimize the translation, as well. Hence, our
realization that we probably needed a &lt;em&gt;domain-specific language&lt;&#x2F;em&gt; (DSL)
to generate this part of Cranelift.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;dsls-in-compilers-and-term-rewriting&quot;&gt;DSLs in Compilers and Term-Rewriting&lt;&#x2F;h2&gt;
&lt;p&gt;There is a long history of DSLs to specify compilers -- the
&quot;metacompiler&quot; or &quot;compiler-compiler&quot; concept -- going back to at
least the 1970s. The general idea of a DSL-based instruction selection
stage is to declaratively describe a list of &lt;em&gt;patterns&lt;&#x2F;em&gt; --
combinations of operators in the program -- and for each pattern, when
it matches, a series of instructions that can implement that
pattern. This makes it easier to reason about what the compiler is
doing, to modify and improve it, and to apply systematic optimizations
across the backend by changing how the DSL is used to generate the
compiler backend itself.&lt;&#x2F;p&gt;
&lt;p&gt;The &quot;pattern rewriting&quot; approach to a compiler backend can be seen as
a kind of &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Rewriting#Term_rewriting_systems&quot;&gt;term-rewriting
system&lt;&#x2F;a&gt;:
that is, a formal framework in which rules operate on a data
representation (in this case, the program to be
compiled). Term-rewriting is an old idea: the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Lambda_calculus&quot;&gt;lambda
calculus&lt;&#x2F;a&gt;, one of the
original mathematical models of computation, operates via
term-rewriting. It is general enough to be useful yet also concise and
expressive enough to be very productive when the goal is to transform
structured data.&lt;&#x2F;p&gt;
&lt;p&gt;One reason why term-rewriting is such a good fit for a compiler in
particular is that &quot;terms&quot;, or nodes in an AST, represent values in
the program being compiled; given this, a rewrite rule is an
expression of an &lt;em&gt;equivalence&lt;&#x2F;em&gt;. An &quot;integer addition&quot; operator in a
compiler IR is &lt;em&gt;equivalent&lt;&#x2F;em&gt; to (or produces an equivalent result to)
the integer addition instruction in a given CPU&#x27;s instruction set; so
we can replace one with the other. One might write this rule as
something like &lt;code&gt;(add x y) =&amp;gt; (x86_add x y)&lt;&#x2F;code&gt;, for example. Likewise,
many compiler optimizations can be expressed as rules, in a way that
is familiar to any student of algebra: for example, &lt;code&gt;x + 0 == x&lt;&#x2F;code&gt;, or
in an AST notation, &lt;code&gt;(add x 0) =&amp;gt; x&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Examples of such pattern-rules abound in production compilers. For
example, in the Go compiler, a set of
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;golang&#x2F;go&#x2F;blob&#x2F;e870de9936a7efa42ac1915ff4ffb16017dbc819&#x2F;src&#x2F;cmd&#x2F;compile&#x2F;internal&#x2F;ssa&#x2F;_gen&#x2F;AMD64.rules&quot;&gt;rules&lt;&#x2F;a&gt;
define how the IR&#x27;s operators are converted into x86-64 instructions:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&#x2F;&#x2F; Lowering arithmetic&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;(Add(64|32|16|8) ...) =&amp;gt; (ADD(Q|L|L|L) ...)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&#x2F;&#x2F; combine add&#x2F;shift into LEAQ&#x2F;LEAL&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;(ADD(L|Q) x (SHL(L|Q)const [3] y)) =&amp;gt; (LEA(L|Q)8 x y)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;&#x2F;&#x2F; Merge load and op&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;((ADD|SUB|AND|OR|XOR)Q x l:(MOVQload [off] {sym} ptr mem)) &amp;amp;&amp;amp;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    canMergeLoadClobber(v, l, x) &amp;amp;&amp;amp;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    clobber(l) =&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;((ADD|SUB|AND|OR|XOR)Qload x [off] {sym} ptr mem)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Similar kinds of descriptions exist in the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm-mirror&#x2F;llvm&#x2F;blob&#x2F;2c4ca6832fa6b306ee6a7010bfb80a3f2596f824&#x2F;lib&#x2F;Target&#x2F;X86&#x2F;X86InstrArithmetic.td&quot;&gt;LLVM x86
backend&lt;&#x2F;a&gt;
and the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;gcc-mirror&#x2F;gcc&#x2F;blob&#x2F;31ec203247413f150d5244198efd586fc6d2ef5e&#x2F;gcc&#x2F;config&#x2F;i386&#x2F;i386.md&quot;&gt;GCC x86
backend&lt;&#x2F;a&gt;,
using the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;llvm.org&#x2F;docs&#x2F;TableGen&#x2F;&quot;&gt;TableGen&lt;&#x2F;a&gt; language (LLVM)
and the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;gcc.gnu.org&#x2F;onlinedocs&#x2F;gcc-4.3.2&#x2F;gccint&#x2F;Machine-Desc.html#Machine-Desc&quot;&gt;Machine Description
DSL&lt;&#x2F;a&gt;
(gcc), respectively.&lt;&#x2F;p&gt;
&lt;p&gt;There is a large and well-explored design-space for auto-generated
compiler backends from such rules. A classical design is the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;BURS&quot;&gt;BURS&lt;&#x2F;a&gt; (bottom-up rewrite system)
technique. I won&#x27;t attempt a deeper introduction here; further
descriptions can be found in e.g. the Dragon Book&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#1&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; or Muchnick&#x27;s
textbook&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#2&quot;&gt;2&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;. For this post, it suffices to know that these systems
find a &quot;covering&quot; of tree patterns such as the above over an input
expression tree.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;term-rewriting-for-lowering-maybe&quot;&gt;Term Rewriting for Lowering... Maybe?&lt;&#x2F;h2&gt;
&lt;p&gt;Given the above precedent -- several mainstream compilers adopting a
pattern-matching-based scheme, with clear benefits -- it would seem
that our path ahead is well-defined. Why, then, is there so much of
this post remaining? What more could be said?&lt;&#x2F;p&gt;
&lt;p&gt;It turns out that three main problems arise when considering how to
adopt a typical term-rewriting scheme. First, there is a basic
question: do we actually reify the input tree as a tree? For example,
if we have a pattern&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;(add x y) =&amp;gt; (x86_add x y)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;meaning that an &lt;code&gt;add&lt;&#x2F;code&gt; operator on two operands is lowered to an x86
&lt;code&gt;ADD&lt;&#x2F;code&gt; instruction, does that imply that our IR literally contains a
node for the &lt;code&gt;add&lt;&#x2F;code&gt;?&lt;&#x2F;p&gt;
&lt;p&gt;That may seem like a silly question to raise, as CLIF, Cranelift&#x27;s IR,
does indeed have an operator for &lt;code&gt;add&lt;&#x2F;code&gt; (actually &lt;code&gt;iadd&lt;&#x2F;code&gt; for &quot;integer
add&quot;) that takes two arguments.&lt;&#x2F;p&gt;
&lt;p&gt;But directly matching on a &quot;tree of operators&quot; implies several
properties of the IR that have deep impact. One of these properties is
that the &quot;value&quot; or &quot;result&quot; of the operator is its unique
identifier. In CLIF, this isn&#x27;t the case: each instruction has its own
identifier (&lt;code&gt;Inst&lt;&#x2F;code&gt;) and each &lt;code&gt;Inst&lt;&#x2F;code&gt; can have any number of &lt;code&gt;Value&lt;&#x2F;code&gt;
result. Another is that it implies that the tree that the matching
process &lt;em&gt;should&lt;&#x2F;em&gt; see is exactly what is in the IR. However, &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;ff995d910b64ea7937ccfd982dd431b1487a1ec8&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;machinst&#x2F;lower.rs#L1216-L1236&quot;&gt;quite a
few
considerations&lt;&#x2F;a&gt;
are involved when a backend wants to &quot;merge&quot; the handling of operators
by looking deeper into the tree. None of these impedance mismatches
are fatal to the approach, but they do imply extra work to build the
tree in memory &quot;as matched&quot;.&lt;&#x2F;p&gt;
&lt;p&gt;Second, there is the problem of how to incorporate &lt;em&gt;additional
information&lt;&#x2F;em&gt; beyond the raw tree of operators. One could see this as a
question of &quot;side-tables&quot; or &quot;auxiliary information&quot;, or of supporting
various &quot;queries&quot; on the input tree.&lt;&#x2F;p&gt;
&lt;p&gt;For example, we might want to represent the ability to encode an
integer immediate in certain ISA-specific forms as a term. AArch64 has
several such forms: a &quot;regular&quot; 12-bit immediate, and a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;dinfuehr.github.io&#x2F;blog&#x2F;encoding-of-immediate-values-on-aarch64&#x2F;&quot;&gt;rather clever
&quot;logical
immediate&quot;&lt;&#x2F;a&gt;
format designed to efficiently encode common kinds of bitmasks in only
13 bits. We might represent these with terms, and have rules that
translate &lt;code&gt;(iconst K)&lt;&#x2F;code&gt; into &lt;code&gt;(aarch64_imm12 bits)&lt;&#x2F;code&gt; or
&lt;code&gt;(aarch64_logicalimm bits)&lt;&#x2F;code&gt; and subsequent rules that match on these
terms to encode immediate-using instruction forms. The problem then
comes: how do we know which of these intermediate rewrites to do
before we attempt to match any instruction forms? Do we do both, and
represent both forms?&lt;&#x2F;p&gt;
&lt;p&gt;The net effect of this requirement is that the matching pattern for a
rewrite rule starts to look less like a tree of terms and more like a
sequence of custom queries. The &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;golang&#x2F;go&#x2F;blob&#x2F;e870de9936a7efa42ac1915ff4ffb16017dbc819&#x2F;src&#x2F;cmd&#x2F;compile&#x2F;internal&#x2F;ssa&#x2F;_gen&#x2F;AMD64.rules&quot;&gt;Go compiler&#x27;s
rules&lt;&#x2F;a&gt;
add predicates to rewrite rules to handle these cases, but this is
awkward and makes the language harder to reason about. It would be
better if we could represent the ISA concepts at the term-rewrite
level as well.&lt;&#x2F;p&gt;
&lt;p&gt;Third, there is the question of how to interact with the rest of the
compiler as we make these queries on the input representation. In the
most straightforward implementation, a rewrite system has knowledge of
the &quot;tree nodes&quot; that terms in the pattern match and that terms in the
rewrite expression produce. But building the glue between the rewrite
system and the IR data structures may be nontrivial, especially if
custom queries (as above) are also involved.&lt;&#x2F;p&gt;
&lt;p&gt;All of this raises the question: is there a better way to think about
the execution semantics of the rewrite rules? What if the DSL were not
involved in ASTs or rewrites in a direct way at all?&lt;&#x2F;p&gt;
&lt;h2 id=&quot;sequential-execution-semantics-and-external-extractors-constructors&quot;&gt;Sequential Execution Semantics and &quot;External Extractors&#x2F;Constructors&quot;&lt;&#x2F;h2&gt;
&lt;p&gt;At this point, I hit upon ISLE&#x27;s first key idea: what if all
interactions with the rest of the compiler were via &quot;virtual&quot; terms,
implemented by Rust functions? In other words, rather than build a
system that matches on literal AST data structures and rewrites or
produces new output ASTs, all of the pattern matching would &quot;bottom
out&quot; in a sort of FFI that would invoke the rest of the compiler, in
handwritten Rust. The DSL itself knows nothing about the rest of the
compiler, or &quot;ASTs&quot;, or any other IR-specific concept. (This is ISLE&#x27;s
main secret: it is not actually an instruction-selection DSL, but
actually (or at least aspirationally) a more general language.)&lt;&#x2F;p&gt;
&lt;h3 id=&quot;sequential-semantics-for-matching&quot;&gt;Sequential Semantics for Matching&lt;&#x2F;h3&gt;
&lt;p&gt;One could imagine a rule like &lt;code&gt;(iadd (imul a b) c) =&amp;gt; (aarch64_madd a b c)&lt;&#x2F;code&gt; to &quot;compile&quot; to a series of &quot;match operations&quot; like the
following invented operations for some matching engine:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;$t0, $t1 := match_op $root, iadd&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;$t2, $t3 := match_op $t0, imul&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;$t4 = create_op aarch64_madd, $t2, $t3, $t1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;return $t4&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In this &quot;matching VM&quot;, we execute &lt;code&gt;match_op&lt;&#x2F;code&gt; operations by trying to
unpack a tree node into its children (arguments), given the expected
operator. Any step in this sequence of match operators might &quot;fail&quot;,
which causes us to try the next rewrite rule instead. If we can match
the &lt;code&gt;iadd&lt;&#x2F;code&gt; from the input tree root, and the &lt;code&gt;imul&lt;&#x2F;code&gt; from its first
argument, then the compiled form of this rule builds the
&lt;code&gt;aarch64_madd&lt;&#x2F;code&gt; (&quot;multiply-add&quot;) term.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;programmable-matching&quot;&gt;Programmable Matching&lt;&#x2F;h3&gt;
&lt;p&gt;Rather than a fixed set of operators like &lt;code&gt;match_op&lt;&#x2F;code&gt;, what if we
allowed for environment-defined operators? What if operators like the
&lt;code&gt;aarch64_logicalimm&lt;&#x2F;code&gt; above were &quot;match operators&quot; as well, such that
they &quot;matched&quot; if the given &lt;code&gt;u64&lt;&#x2F;code&gt; could be encoded in the desired form
and failed to match otherwise?&lt;&#x2F;p&gt;
&lt;p&gt;This is the essence of the &quot;external extractor&quot; idea (and the dual to
it, the &quot;external constructor&quot;) in ISLE. Once we allow user-defined
operators for the left-hand side (&quot;matching pattern&quot;) of a rule, we
actually no longer need any built-in notion of AST matching at all;
this becomes just another thing we can define in the &quot;standard
library&quot; or &quot;prelude&quot; of our DSL!&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#3&quot;&gt;3&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The basic idea is to introduce the ability to &lt;em&gt;define&lt;&#x2F;em&gt; a term like
&lt;code&gt;iadd&lt;&#x2F;code&gt; or &lt;code&gt;imul&lt;&#x2F;code&gt; and associate an external Rust function with it. When
appearing on the left-hand side of a rewrite rule, terms match
&quot;outside in&quot;: that is, &lt;code&gt;(iadd (imul a b) c)&lt;&#x2F;code&gt; takes the root of the
input, tries to use &lt;code&gt;iadd&lt;&#x2F;code&gt; to destructure it to two arguments, and
tries to use &lt;code&gt;imul&lt;&#x2F;code&gt; to destructure it further. (This outside-in,
reversed order is the opposite of what one might expect if this were a
tree of function calls, because we are &lt;em&gt;destructuring&lt;&#x2F;em&gt; (extracting)
rather than &lt;em&gt;constructing&lt;&#x2F;em&gt;. We&#x27;ll explore the analogy to functions
more below.)&lt;&#x2F;p&gt;
&lt;p&gt;So, skipping straight to the real ISLE syntax now, we can define the
term like&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;common-lisp&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;(decl (iadd Value Value) Value)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;and then associate a Rust function with it like&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;common-lisp&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;(extern extractor iadd my_iadd_impl)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;and this implies the existence of a Rust function that the generated
matching code will invoke:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;fn&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; my_iadd_impl&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator z-storage z-variable z-language&quot;&gt;&amp;amp;mut self&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; input&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Value&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;)&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; -&amp;gt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Option&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;Value&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Value&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;)&amp;gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Likewise, we can define a term to be used on the right-hand side and
associate an implementation like&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;common-lisp&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;(decl (aarch64_madd Value Value Value) Value)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;(extern constructor aarch64_madd my_madd_impl)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;with the Rust function&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;fn&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; my_madd_impl&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator z-storage z-variable z-language&quot;&gt;&amp;amp;mut self&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; a&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Value&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; b&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Value&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; c&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Value&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;)&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; -&amp;gt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Value&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;and then use it like&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;common-lisp&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;(rule (iadd (imul a b) c)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      (aarch64_madd a b c))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;and the generated code will invoke the &lt;em&gt;external extractor&lt;&#x2F;em&gt;
&lt;code&gt;my_iadd_impl&lt;&#x2F;code&gt;; if it returns &lt;code&gt;Some&lt;&#x2F;code&gt; (matches), will invoke whatever
external extractor is associated with &lt;code&gt;imul&lt;&#x2F;code&gt;; if it also returns
&lt;code&gt;Some&lt;&#x2F;code&gt;, then invoke &lt;code&gt;aarch64_madd&lt;&#x2F;code&gt; to &quot;construct&quot; the result. These
Rust functions can do &lt;em&gt;whatever they like&lt;&#x2F;em&gt;: we have abstracted away
the need to actually query a reified AST and mutate it.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;extractors-and-the-execution-driven-view&quot;&gt;Extractors and the Execution-Driven View&lt;&#x2F;h3&gt;
&lt;p&gt;Another important consequence of this design is that we can define
&lt;em&gt;arbitrary&lt;&#x2F;em&gt; extractors and constructors, and they can have &lt;em&gt;arbitrary&lt;&#x2F;em&gt;
types. (ISLE is strongly-typed, with sum types that lower 1:1 to Rust
enums.) This neatly addresses the &quot;metadata or side-table&quot; question
above: we don&#x27;t need to generate auxiliary nodes in a real AST to
represent information about a value, and we don&#x27;t need to know when to
compute them; such computations can be driven by demand on the
matching side, and don&#x27;t need to reify any actual node.&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s take a step back and understand what we have done here. We have
taken a rule that expresses a tree rewrite -- like &lt;code&gt;(iadd (imul a b) c) =&amp;gt; (madd a b c)&lt;&#x2F;code&gt; -- and allowed for the terms in the left-hand side
and right-hand side to compile to Rust function calls, with
well-defined semantics. The DSL itself deals only with
pattern-matching; ASTs and compiler IRs are wholly outside of its
understanding. Nevertheless, if we ascribe formal semantics to &lt;code&gt;iadd&lt;&#x2F;code&gt;,
&lt;code&gt;imul&lt;&#x2F;code&gt; and &lt;code&gt;madd&lt;&#x2F;code&gt;, we can still reason about the rewrite at the
term-rewriting system level, and this is essential to formal
verification efforts for our compiler backends (see below!). We have
thus allowed for integration with existing, handwritten Rust code
while raising the abstraction level to allow for more declarative
reasoning.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#4&quot;&gt;4&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;It&#x27;s worth dwelling for a moment on the shift from an explicit AST
data structure, traversed and built by a rule-matching engine, to
calls to external extractors and constructors. As a consequence of
this shift, the term nodes corresponding to extractor matches need not
ever actually exist in memory. The ISLE rewrite flow thus works
something like the &quot;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;wiki.c2.com&#x2F;?VisitorPattern&quot;&gt;visitor
pattern&lt;&#x2F;a&gt;&quot;: it introduces a level
of abstraction that decouples the consumption and production of data
from its representation, allowing more flexibility.&lt;&#x2F;p&gt;
&lt;p&gt;The execution-driven view of term rewriting gives us our rewrite
procedure for free, as well: rather than an engine that takes some
specification of patterns and rewrites, we compile rules to sequential
code that invokes extractors and constructors. &quot;Rewriting&quot; a top-level
term is equivalent to invoking a Rust function. If we define a term,
&lt;code&gt;lower&lt;&#x2F;code&gt;, corresponding to instruction lowering, then &lt;code&gt;(lower (iadd ...))&lt;&#x2F;code&gt; is a term that will be rewritten to whatever ISA-specific terms
the rules specify. This rewriting is done by a &lt;em&gt;Rust function&lt;&#x2F;em&gt; that
&lt;em&gt;implements&lt;&#x2F;em&gt; &lt;code&gt;lower&lt;&#x2F;code&gt;; we invoke it with the term&#x27;s arguments, and the
body will match on extractors as needed, then invoke constructors to
build the rewritten expression.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;explicit-types-and-implicit-conversions-in-isle&quot;&gt;Explicit Types (and Implicit Conversions) in ISLE&lt;&#x2F;h2&gt;
&lt;p&gt;The second key differentiator in ISLE as compared to most other
term-rewriting systems is its &lt;em&gt;strong type system&lt;&#x2F;em&gt;. It might not be
too surprising that a DSL that compiles to Rust would incorporate a
type system that mirrors Rust&#x27;s type system to some degree, e.g. with
sum types (enum variants). But this is actually a bit of a departure
from conventional compiler-backend rule systems, and allows
significant expressivity and safe-encapsulation wins, as we&#x27;ll see
below.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;why-types&quot;&gt;Why Types?&lt;&#x2F;h3&gt;
&lt;p&gt;A conventional rewrite system operates on an AST whose nodes all
represent values in the program. In other words, every term has the
same type (at the DSL level): we can replace &lt;code&gt;iadd&lt;&#x2F;code&gt; with &lt;code&gt;x86_add&lt;&#x2F;code&gt;
because both have type &lt;code&gt;Value&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;While this works fine for simple substitutions, it quickly breaks down
when various ISA complexities are considered. For example, how do we
model an addressing mode? We might wish to have a node &lt;code&gt;x86_add&lt;&#x2F;code&gt; that
accepts a &quot;register or memory&quot; operand, as x86 allows; and if a memory
operand, the memory address can have one of several different forms
(&quot;addressing modes&quot;): &lt;code&gt;[reg]&lt;&#x2F;code&gt;, &lt;code&gt;[reg + offset]&lt;&#x2F;code&gt;, &lt;code&gt;[reg + reg + offset]&lt;&#x2F;code&gt;, &lt;code&gt;[reg + scale*reg + offset]&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;We could impose some ad-hoc structure on the AST in order to model
this: for example, an &lt;code&gt;x86_add&lt;&#x2F;code&gt; with an &lt;code&gt;x86_load&lt;&#x2F;code&gt; in its second
argument (or alternately, a separate opcode &lt;code&gt;x86_add_from_memory&lt;&#x2F;code&gt;)
could represent this case. Then we would need to have rules for the
address expression: if another &lt;code&gt;x86_add&lt;&#x2F;code&gt; node (but only with register
arguments!), we could absorb that into the instruction&#x27;s addressing
mode.&lt;&#x2F;p&gt;
&lt;p&gt;Ad-hoc structure like this is fragile, though, especially when
transformed by optimization passes. As a general guideline, as well,
the more we can put program invariants into the type system (at any
level), the more likely we are to be able to maintain the structure
across refactors or unexpected interactions.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;types-in-isle&quot;&gt;Types in ISLE&lt;&#x2F;h3&gt;
&lt;p&gt;ISLE thus allows terms to have types that resemble function types. One
can define a term&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;common-lisp&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;(decl lower_amode (Value) AMode)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;that takes one argument, a &lt;code&gt;Value&lt;&#x2F;code&gt;, and produces an &lt;code&gt;AMode&lt;&#x2F;code&gt;. This type
can then be defined as&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;common-lisp&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-storage&quot;&gt;type&lt;&#x2F;span&gt;&lt;span&gt; AMode (enum (Reg (reg Reg))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                  (RegReg (ra Reg)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                          (rb Reg))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                  (RegOffset (base Reg)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                             (offset u32))))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;and so on. The &lt;code&gt;AMode&lt;&#x2F;code&gt; compiles to an enum in Rust with the specified
enum variants, making interop with Rust code (in external extractors
and constructors) via these rich types straightforward. In our machine
backends, machine instructions are defined as enums and constructed
directly in the ISLE.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;typed-terms-as-functions&quot;&gt;Typed Terms as Functions&lt;&#x2F;h3&gt;
&lt;p&gt;Now that we have seen how to give a &quot;signature&quot; to a term, it&#x27;s worth
discussing how one can see terms -- both extractors and constructors
-- as functions, albeit in opposite directions. In particular, with a term&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;common-lisp&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;(decl F (A B C) R)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;that has arguments (or AST child nodes) of types &lt;code&gt;A, B, C&lt;&#x2F;code&gt;, with a
type &lt;code&gt;R&lt;&#x2F;code&gt; for the term itself, one can see:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;F&lt;&#x2F;code&gt; as a function &lt;em&gt;from&lt;&#x2F;em&gt; &lt;code&gt;A, B, C&lt;&#x2F;code&gt; &lt;em&gt;to&lt;&#x2F;em&gt; &lt;code&gt;R&lt;&#x2F;code&gt;, when used as a
constructor; or&lt;&#x2F;li&gt;
&lt;li&gt;&lt;code&gt;F&lt;&#x2F;code&gt; as a function &lt;em&gt;from&lt;&#x2F;em&gt; &lt;code&gt;R&lt;&#x2F;code&gt; &lt;em&gt;to&lt;&#x2F;em&gt; &lt;code&gt;A, B, C&lt;&#x2F;code&gt;, when used as an
extractor.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;In other words, given a tree of terms in a pattern (left-hand side),
where terms are interpreted as extractors, we can see each term as a
function invocation from the &quot;outer&quot; type (the thing being
destructured) to the &quot;inner&quot; types (the pieces that are the result of
the destructuring). Conversely, given a tree of terms in an expression
(right-hand side), we can see each term as a function from the &quot;inner&quot;
types (the arguments of the new thing being constructed) to the
&quot;outer&quot; type (the return value). This is another way of seeing the
&quot;execution-driven&quot; view of ISLE semantics described above.&lt;&#x2F;p&gt;
&lt;p&gt;Note that the extractor form of &lt;code&gt;F&lt;&#x2F;code&gt; above is ordinarily a &lt;em&gt;partial&lt;&#x2F;em&gt;
function -- that is, it is allowed to have &lt;em&gt;no&lt;&#x2F;em&gt; mapping for a particular
value. This is how we formally think about the &quot;doesn&#x27;t match&quot; case
when searching for a particular kind of node in an AST, or any other
matcher on the left-hand side of a pattern. In contrast, the constructor
is normally &lt;em&gt;total&lt;&#x2F;em&gt; -- cannot fail -- unless explicitly declared to be
partial. (Partial constructors are useful for &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;rfcs&#x2F;blob&#x2F;main&#x2F;accepted&#x2F;isle-extended-patterns.md&quot;&gt;&lt;code&gt;if-let&lt;&#x2F;code&gt;
clauses&lt;&#x2F;a&gt;.)&lt;&#x2F;p&gt;
&lt;h3 id=&quot;types-for-invariants&quot;&gt;Types for Invariants&lt;&#x2F;h3&gt;
&lt;p&gt;Support for arbitrary types lets us much more richly capture
invariants as well, by encapsulating values (on the input side) or
machine instructions (on the output side) so that they can only be
used or combined in legal ways. For example:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Many instruction-set architectures have a &quot;flags&quot; register that is
set by certain operations with bits that correspond to conditions
(result was zero, result was negative, etc.) and used by conditional
branches and conditional moves. This is &quot;global&quot; or &quot;ambient&quot; state
and one has to be careful to use the flags after computing them,
without overwriting them in the meantime. To ensure exact
correspondence of a particular flag-producer and flag-consumer,
certain instruction constructors create &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;0c615365c645776c1abf42be89edc1d5292163c8&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;prelude_lower.isle#L319-L358&quot;&gt;&lt;code&gt;ProducesFlags&lt;&#x2F;code&gt; and
&lt;code&gt;ConsumesFlags&lt;&#x2F;code&gt;
values&lt;&#x2F;a&gt;
rather than raw instructions. These can then be emitted &lt;em&gt;together&lt;&#x2F;em&gt;,
with no clobber in the middle, with the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;0c615365c645776c1abf42be89edc1d5292163c8&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;prelude_lower.isle#L389-L399&quot;&gt;&lt;code&gt;with_flags&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
constructor.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;There is a distinction between an IR-level &lt;code&gt;Value&lt;&#x2F;code&gt; and a
machine-level value in a register, which we denote with &lt;code&gt;Reg&lt;&#x2F;code&gt;. When
a lowering rule requires an input to be in a register, it can use
the &lt;code&gt;put_in_reg&lt;&#x2F;code&gt; constructor, which takes a &lt;code&gt;Value&lt;&#x2F;code&gt; and produces
(rewrites to) a &lt;code&gt;Reg&lt;&#x2F;code&gt;. This provides a way for us to do bookkeeping
(note that the value was used, and ensure that we codegen its
producer), but also allows us to distinguish &lt;em&gt;how&lt;&#x2F;em&gt; to place the
value in the register: one may wish to &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;0c615365c645776c1abf42be89edc1d5292163c8&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;isa&#x2F;aarch64&#x2F;inst.isle#L2780-L2812&quot;&gt;sign- or
zero-extend&lt;&#x2F;a&gt;
the values.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;There is a distinction between an IR-level &lt;code&gt;Value&lt;&#x2F;code&gt; and the
instruction that produces it. Not every &lt;code&gt;Value&lt;&#x2F;code&gt; is defined by an
instruction; some are &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Static_single-assignment_form#Block_arguments&quot;&gt;block
parameters&lt;&#x2F;a&gt;. Furthermore,
at a given point in the lowering process, we may not be &lt;em&gt;allowed&lt;&#x2F;em&gt; to
see the producer of a value, if we cannot &quot;sink&quot; its effect (merge
it into an instruction generated at the current point). Thus,
instructions have &lt;code&gt;Value&lt;&#x2F;code&gt;s as operands, but one goes from a &lt;code&gt;Value&lt;&#x2F;code&gt;
to an &lt;code&gt;Inst&lt;&#x2F;code&gt; (instruction ID) with
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;0c615365c645776c1abf42be89edc1d5292163c8&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;prelude_lower.isle#L220-L222&quot;&gt;&lt;code&gt;def_inst&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;,
which may or may not match depending on whether there is an &lt;code&gt;Inst&lt;&#x2F;code&gt;
and we can see&#x2F;merge it.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;implicit-conversions&quot;&gt;Implicit Conversions&lt;&#x2F;h3&gt;
&lt;p&gt;Type-safe abstractions allow for well-defined and safe interfaces, but
can lead to verbose code. After several months of experience with
ISLE, we were finding that we wrote rules like:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;common-lisp&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;(rule (lower (iadd (def_inst (imul x y)) z))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  (madd&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;   (put_in_reg x)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;   (put_in_reg y)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;   (put_in_reg z)))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;when we would prefer to write more natural rules like&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;common-lisp&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;(rule (lower (iadd (imul x y) z))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      (madd x y z))&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;as in our original term-rewriting examples above. At first it seemed
we had to choose one or the other: the shorter form would require
abandoning some of the type distinctions we were making. But in
actuality, there is some redundancy. Consider: when typechecking the
left-hand side pattern, we know that &lt;code&gt;iadd&lt;&#x2F;code&gt;&#x27;s arguments (which the
extractor will produce, if it can destructure the &lt;code&gt;Inst&lt;&#x2F;code&gt; type) have
type &lt;code&gt;Value&lt;&#x2F;code&gt;, but the inner &lt;code&gt;imul&lt;&#x2F;code&gt; expects an &lt;code&gt;Inst&lt;&#x2F;code&gt;. In our prelude
we have one canonical term that converts from one to the other:
&lt;code&gt;def_inst&lt;&#x2F;code&gt;. Likewise, in the right-hand side, bindings &lt;code&gt;x&lt;&#x2F;code&gt;, &lt;code&gt;y&lt;&#x2F;code&gt; and
&lt;code&gt;z&lt;&#x2F;code&gt; have type &lt;code&gt;Value&lt;&#x2F;code&gt;, but the &lt;code&gt;madd&lt;&#x2F;code&gt; constructor that builds a
machine instruction requires &lt;code&gt;Reg&lt;&#x2F;code&gt; types (which are virtual registers
in pre-regalloc machine code). We likewise have one term,
&lt;code&gt;put_in_reg&lt;&#x2F;code&gt;, that can do this conversion. If there is only &lt;em&gt;one&lt;&#x2F;em&gt;
canonical way, or usual way, to make the conversion, and if the types
on &lt;em&gt;both&lt;&#x2F;em&gt; sides of the conversion are already known, why can&#x27;t we
rectify the types by inserting necessary conversions automatically?&lt;&#x2F;p&gt;
&lt;p&gt;ISLE thus has one final trick up its sleeve to improve ease-of-use:
implicit conversions. By specifying converter terms for pairs of types
like&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;common-lisp&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;(convert Inst Value def_inst)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;(convert Value Reg put_in_reg)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;the typechecking pass can expand the pattern and rewrite expression
ASTs as necessary. This makes writing lowering rules much more natural
with less boilerplate, and we have a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;0c615365c645776c1abf42be89edc1d5292163c8&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;prelude_lower.isle#L712-L723&quot;&gt;fairly rich set of implicit
conversions&lt;&#x2F;a&gt;
defined in our prelude to facilitate this.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;putting-it-all-together-ast-patterns-in-isle&quot;&gt;Putting it All Together: AST Patterns in ISLE&lt;&#x2F;h2&gt;
&lt;p&gt;Now that we&#x27;ve taken a tour of the various features of ISLE, let&#x27;s
review what we have built. We have started with a desire to express
high-level rewrite rules that equate AST nodes -- such as &lt;code&gt;(iadd (imul x y) z)&lt;&#x2F;code&gt; and &lt;code&gt;(madd x y z)&lt;&#x2F;code&gt; -- and to have a system that performs such
rewrites, in a way that interoperates with the existing compiler
infrastructure and has predictable and comprehensible behavior.&lt;&#x2F;p&gt;
&lt;p&gt;ISLE allows high-level patterns to be compiled to straightforward Rust
pattern-matching code. A strong type system with sum types (enums)
ensures that terms in patterns and rewrites match the expected schema,
and allows for expressing high-level invariants. Implicit conversions
leverage these types to remove redundancy in the patterns, allowing
for more natural high-level forms while retaining the useful
type-level distinctions. The ability to arbitrarily define
&quot;extractors&quot; to use in patterns allows us to build up a rich
pattern-language in our prelude, matching trees of operators, values,
and pieces of the input program with various properties in a
programmable way. And the well-defined mapping to Rust and an
execution scheme that maps terms directly to function calls, rather
than an incrementally-rewritten AST allows for code as efficient as
what one would write by hand.&lt;&#x2F;p&gt;
&lt;p&gt;We have thus gone from &lt;code&gt;(iadd (imul x y) z) =&amp;gt; (madd x y z)&lt;&#x2F;code&gt; to
something like:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Match the input &lt;code&gt;Inst&lt;&#x2F;code&gt; against the &lt;code&gt;iadd&lt;&#x2F;code&gt; enum variant, getting
argument &lt;code&gt;Value&lt;&#x2F;code&gt;s if so;&lt;&#x2F;li&gt;
&lt;li&gt;Get the &lt;code&gt;Inst&lt;&#x2F;code&gt; that produced the first &lt;code&gt;Value&lt;&#x2F;code&gt;, if we&#x27;re allowed to
merge it, via &lt;code&gt;def_inst&lt;&#x2F;code&gt;;&lt;&#x2F;li&gt;
&lt;li&gt;Match that &lt;code&gt;Inst&lt;&#x2F;code&gt; against the &lt;code&gt;imul&lt;&#x2F;code&gt; enum variant, getting its
argument &lt;code&gt;Value&lt;&#x2F;code&gt;s if so;&lt;&#x2F;li&gt;
&lt;li&gt;Put all three remaining &lt;code&gt;Value&lt;&#x2F;code&gt;s in registers with calls to
&lt;code&gt;put_in_reg&lt;&#x2F;code&gt;; and&lt;&#x2F;li&gt;
&lt;li&gt;Emit an &lt;code&gt;madd&lt;&#x2F;code&gt; instruction with these registers&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;all &lt;em&gt;via bindings defined in ISLE itself&lt;&#x2F;em&gt; and &lt;em&gt;without any knowledge
of &quot;instruction lowering&quot; in the ISLE DSL compiler&lt;&#x2F;em&gt;. As concrete
evidence that the last point is valuable, we have been able to use
ISLE for CLIF-to-CLIF rewrite rules in our new mid-end optimization
framework as well, simply by defining a new prelude (see below!).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;ongoing-and-future-benefits-of-declarative-rules&quot;&gt;Ongoing and Future Benefits of Declarative Rules&lt;&#x2F;h2&gt;
&lt;p&gt;The next most exciting thing about writing lowering rules
declaratively -- after the expressivity and productivity win while
developing the compiler itself -- is that being able to reason about
the rules &lt;em&gt;as data&lt;&#x2F;em&gt; lets us analyze them in various ways and check
that they satisfy desirable properties.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;correctness-and-formal-verification&quot;&gt;Correctness and Formal Verification&lt;&#x2F;h3&gt;
&lt;p&gt;As an example, during the development of our rulesets, we found that
it was sometimes unintuitive which rule would fire first. We have a
priority mechanism to allow this to be controlled in an arbitrary way,
but the default heuristics (roughly, &quot;more specific rule first&quot;) were
sometimes counter-intuitive.&lt;&#x2F;p&gt;
&lt;p&gt;We thus invented the idea of an &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;4906&quot;&gt;overlap
checker&lt;&#x2F;a&gt;,
initially implemented by my brilliant colleague &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.linkedin.com&#x2F;in&#x2F;elliotttrevor&#x2F;&quot;&gt;Trevor
Elliott&lt;&#x2F;a&gt; and subsequently
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;5195&quot;&gt;redesigned with a new internal representation and
algorithm&lt;&#x2F;a&gt; by
my other brilliant colleague &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;jamey.thesharps.us&#x2F;&quot;&gt;Jamey
Sharp&lt;&#x2F;a&gt;. The key idea is to define &quot;rule
overlap&quot; such that two rules overlap if a given input to the
pattern-matching could cause either rule to fire. In these (and only
these) cases, priority and&#x2F;or default ordering heuristics determine
which rule actually does fire. Then we decided that in such cases, we
would &lt;em&gt;require&lt;&#x2F;em&gt; the ISLE author to use the priority mechanism to
explicitly choose one of the rules. In other words, no two overlapping
rules can have the same priority. Through a series of PRs to fix up
our existing rules, we were able to actually find several cases where
rules were fully &quot;shadowed&quot;, or unreachable because some other more
general rule was always firing first. We turned on enforcing mode for
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;5011&quot;&gt;non-overlap among same-priority
rules&lt;&#x2F;a&gt; and
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;5322&quot;&gt;non-shadowing of rules by higher-priority
rules&lt;&#x2F;a&gt; after
fixing several cases, and as a result, we now have more confidence in
the correctness of our rulesets.&lt;&#x2F;p&gt;
&lt;p&gt;On a broader scale, writing rules as equivalences from one AST to
another lets us verify that the two sides are, well, actually
equivalent! There is an ongoing collaboration with some
formal-verification researchers who are adding &lt;em&gt;annotations&lt;&#x2F;em&gt; to
external extractors and constructors that describe their semantics
(mostly in terms of an SMT checker&#x27;s theory-of-bitvectors
primitives). Given these &quot;specs&quot;, one could lower each ISLE rule to
SMT clauses rather than executable Rust code, and search for cases
where it is incorrect. I won&#x27;t steal any thunder here -- the work is
really exciting (also still in progress) and the researchers will
present it in due time -- but it&#x27;s an example of what declarative DSLs
allow.&lt;&#x2F;p&gt;
&lt;p&gt;The ISLE-to-Rust compiler (metacompiler) is also a fairly complex tool
in its own right, and has had bugs in the past. What makes us
confident that we are generating code that correctly implements the
rules -- even if the rules themselves are verified? To answer that
question, we &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;5435#pullrequestreview-1228022447&quot;&gt;have
considered&lt;&#x2F;a&gt;
how to verify &lt;em&gt;the translation of the ISLE rules&lt;&#x2F;em&gt;. Our current plan is
to modify the ISLE compiler to generate both the production backend,
with intelligently-scheduled matching operations, and a &quot;naive&quot;
version that runs through rules sequentially in priority order. If the
latter picks the same rule as the former, then we know we have a
faithful implementation of the left-hand-side matching (and the
translation of the right-hand-side expression to constructor calls is
straightforward in comparison, so we trust it already). Then we can
trust that our verified-correct rules are being applied as written.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;optimizing-the-instruction-selector&quot;&gt;Optimizing the Instruction Selector&lt;&#x2F;h3&gt;
&lt;p&gt;Next, my colleague Jamey is working on &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;5435&quot;&gt;a new ISLE metacompiler
backend&lt;&#x2F;a&gt; that
lowers ISLE rules to a planned sequence of matching ops more
efficiently. The ability to change the way that compiler-backend code
matches the IR was a theoretical benefit of a DSL-based approach,
recognized and evaluated as we weighed the pros and cons of ISLE, but
admittedly was a bit speculative -- no one knew if we would actually
find a better way to generate code from the rules than the initial
ISLE compiler and its &quot;trie&quot;-based approach. I am very excited that
Jamey actually &lt;em&gt;did&lt;&#x2F;em&gt; manage to do this (and we should look forward to
a hopeful future blog post in which he can describe this in more
detail!).&lt;&#x2F;p&gt;
&lt;h3 id=&quot;mid-end-optimizations&quot;&gt;Mid-end Optimizations&lt;&#x2F;h3&gt;
&lt;p&gt;Finally, as mentioned above, we were able to find a
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;main&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;opts&#x2F;algebraic.isle&quot;&gt;second&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;main&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;opts&#x2F;cprop.isle&quot;&gt;use&lt;&#x2F;a&gt;
for ISLE as part of the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;rfcs&#x2F;blob&#x2F;main&#x2F;accepted&#x2F;cranelift-isel-isle-peepmatic.md&quot;&gt;egraph-based mid-end
optimizer&lt;&#x2F;a&gt;
work, which we &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;5587&quot;&gt;just enabled by
default&lt;&#x2F;a&gt;. (I
hope to write a blog post about this soon too!) This was excellent and
very satisfying validation to me personally that ISLE is more general
than just Cranelift backends: we were able to write a new prelude (and
actually share a bunch of extractors too) so that rules could specify
IR-to-IR rewrites, in addition to IR-to-machine-instruction
lowerings. This will allow us to iterate on and improve the compiler&#x27;s
suite of optimizations more easily in the future, and it will also
have follow-on benefits in terms of shared infrastructure:
verification tools that we build for backend lowering rules can also
be adapted to verify mid-end rules. In addition, there are potentially
other ways that putting all of the compiler&#x27;s core program-transform
logic into the same DSL will allow us to blur the lines, combine
stages, or move logic around in ways we can&#x27;t yet anticipate today;
but it seems like a worthwhile investment. In any case, ISLE proved to
be a quite useful tool in developing pattern-matching Rust code with
less boilerplate and tedium than before!&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;ISLE has been great fun to design, build, and use; while we have learned a lot
and made several language adjustments and extensions over the past year, I
think that there is general consensus that it has made the compiler backends
easier to work on. I&#x27;m excited to see how the ongoing work (verification, new
ISLE codegen strategy) turns out, and how the language evolves in general. And
as noted above, ISLE&#x27;s secret is that it is actually more general than
instruction selection, or Cranelift: if you find another way to use it, I&#x27;d be
very interested in hearing about it!&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;em&gt;Thanks to Jamey Sharp and Nick Fitzgerald for reading and providing
very helpful feedback on a draft of this blog post, and to bjorn3 and
Adrian Sampson for feedback and typo fixes after publication.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;1&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;1&lt;&#x2F;sup&gt;
&lt;p&gt;A Aho, R Sethi, J Ullman, M S Lam. Compilers: Principles,
Techniques, and Tools. 2006.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;2&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;2&lt;&#x2F;sup&gt;
&lt;p&gt;S Muchnick. Advanced Compiler Design and Implementation. 1997.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;3&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;3&lt;&#x2F;sup&gt;
&lt;p&gt;ISLE &lt;em&gt;does&lt;&#x2F;em&gt; have special knowledge about Rust enums, and the
ability to match on them efficiently with &lt;code&gt;match&lt;&#x2F;code&gt; expressions in
the generated Rust code, because to miss this optimization would
be very costly. But in principle it could have been built
without this, involving only Rust function calls and control
flow around them.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;4&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;4&lt;&#x2F;sup&gt;
&lt;p&gt;There is a parallel to the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Prolog&quot;&gt;Prolog&lt;&#x2F;a&gt; language here, in
that it also allows for high-level, declarative expression of
rule-matching with backtracking while also having a well-defined
sequential execution semantics with FFI to an imperative
world. In fact Prolog was a central inspiration for ISLE&#x27;s
design. The key difference(s) are that (i) ISLE does not have
full backtracking -- once a left-hand side matches, we cannot
backtrack, as the right-hand sides are infallible -- and (ii)
there is no unification, and all dataflow in a term is
unidirectional, from input (value to be destructured) to output
(arguments). We used to have &quot;argument polarity&quot;, which was
closer to unification in that it allowed a configurable (but
fixed) input&#x2F;output direction for each argument to an
extractor. We discarded this feature, however, in favor of a
more general &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;rfcs&#x2F;blob&#x2F;main&#x2F;accepted&#x2F;isle-extended-patterns.md&quot;&gt;&lt;code&gt;if-let&lt;&#x2F;code&gt;
clause&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
</content>
    </entry>
    <entry xml:lang="en">
        <title>Cranelift, Part 4: A New Register Allocator</title>
        <published>2022-06-09T00:00:00+00:00</published>
        <updated>2022-06-09T00:00:00+00:00</updated>
        <author>
            <name></name>
        </author>
        <link rel="alternate" type="text/html" href="https://cfallin.org/blog/2022/06/09/cranelift-regalloc2/"/>
        <id>https://cfallin.org/blog/2022/06/09/cranelift-regalloc2/</id>
        <content type="html" xml:base="https://cfallin.org/blog/2022/06/09/cranelift-regalloc2/">&lt;p&gt;This post is the fourth part of a three-part series&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#1&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; describing
work that I have been doing to improve the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;tree&#x2F;main&#x2F;cranelift&quot;&gt;Cranelift&lt;&#x2F;a&gt;
compiler. In this post, I&#x27;ll describe the work that went into
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&quot;&gt;regalloc2&lt;&#x2F;a&gt;, a new
register allocator I developed over the past year. The allocator
started as an effort to port &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;source&#x2F;js&#x2F;src&#x2F;jit&#x2F;BacktrackingAllocator.cpp&quot;&gt;IonMonkey&#x27;s register
allocator&lt;&#x2F;a&gt;
to Rust and a standalone form usable by Cranelift (&quot;how hard could it
be?&quot;), but quickly evolved during a focused optimization effort to
have its own unique design and implementation aspects.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Register_allocation&quot;&gt;Register
allocation&lt;&#x2F;a&gt; is a
classically hard
(&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;NP-hardness&quot;&gt;NP-hard!&lt;&#x2F;a&gt;) problem, and a
good solution is mainly a question of concocting a suitable
combination of heuristics and engineering high-performance data
structures and algorithms such that &lt;em&gt;most&lt;&#x2F;em&gt; cases are &lt;em&gt;good enough&lt;&#x2F;em&gt;
with &lt;em&gt;few enough&lt;&#x2F;em&gt; exceptions. As I&#x27;ve found, this rabbithole goes
infinitely deep and there is always more to improve, but for now we&#x27;re
in a fairly good place.&lt;&#x2F;p&gt;
&lt;p&gt;We &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;3989&quot;&gt;recently switched over to
regalloc2&lt;&#x2F;a&gt;,
and Cranelift 0.84 and the concurrently released Wasmtime 0.37 use it
by default. Some
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;issues&#x2F;3942&quot;&gt;measurements&lt;&#x2F;a&gt;
show that it generally improves overall compiler speed (which was and
is dominated by regalloc time) by 20%-ish, and generated code
performance improves on register pressure-impacted benchmarks up to
10-20% in Wasmtime. In Cranelift&#x27;s use as a backend for rustc via
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bjorn3&#x2F;rustc_codegen_cranelift&quot;&gt;rustc_codegen_cranelift&lt;&#x2F;a&gt;
runtime performance improved by &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;3989#issuecomment-1092110720&quot;&gt;up to
7%&lt;&#x2F;a&gt;. The
allocator seems to have generally fewer compile-time outliers than our
previous allocator, which in many cases is a more important property
than 10-20% improvements. Overall, it seems to be a reasonable
performance win with few downsides. Of course, this work benefits
hugely from the lessons learned in developing that prior allocator,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc.rs&#x2F;&quot;&gt;regalloc.rs&lt;&#x2F;a&gt;, which
was work primarily done by Julian Seward and Benjamin Bouvier; I
learned enormous amounts talking to them and watching their work on
regalloc.rs in 2020, and this work stands on their shoulders as well
as IonMonkey&#x27;s.&lt;&#x2F;p&gt;
&lt;p&gt;This post will make a whirlwind tour through several topics. After
reviewing the register allocation problem and why it is important, we
will learn about regalloc2&#x27;s approach: its abstractions, its key
features, and how its passes work. We&#x27;ll then spend a good amount of
time on &quot;lessons learned&quot;: how we attained reasonable performance; how
we managed to make anything work at all in reasonable development
time; how we migrated a large existing compiler codebase to new
foundational types and invariants; and some perspective on ongoing
tuning and refinements.&lt;&#x2F;p&gt;
&lt;p&gt;A &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;main&#x2F;doc&#x2F;DESIGN.md&quot;&gt;design
document&lt;&#x2F;a&gt;
for the allocator exists as well, and this blogpost is meant to be
complementary: we&#x27;ll walk through some interesting bits of the
architecture, but anyone hoping to actually grok the thing in its
entirety (and please talk to me if this is you!) is advised to dig
into the design doc and the source for the full story.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, a fair warning: this post has become a bit of a book chapter;
if you&#x27;re looking for a tl;dr, you can skip to the
&lt;a href=&quot;https:&#x2F;&#x2F;cfallin.org&#x2F;blog&#x2F;2022&#x2F;06&#x2F;09&#x2F;cranelift-regalloc2&#x2F;#four-lessons&quot;&gt;Lessons&lt;&#x2F;a&gt; section or the &lt;a href=&quot;https:&#x2F;&#x2F;cfallin.org&#x2F;blog&#x2F;2022&#x2F;06&#x2F;09&#x2F;cranelift-regalloc2&#x2F;#conclusions&quot;&gt;Conclusions&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;register-allocation-recap&quot;&gt;Register Allocation: Recap&lt;&#x2F;h2&gt;
&lt;p&gt;First, let&#x27;s recap what register allocation is and why it&#x27;s
important.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#2&quot;&gt;2&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The basic problem at hand is to assign &lt;em&gt;storage locations&lt;&#x2F;em&gt; to &lt;em&gt;program
dataflow&lt;&#x2F;em&gt;. We can imagine our compiler input as a graph of operators,
each of which consumes some values and produces others (let&#x27;s ignore
control flow for the moment):&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;cranelift-regalloc2-fig1-web.svg&quot; alt=&quot;Figure: A dataflow graph&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Some compilers directly represent the program in this way (called a
&quot;sea of nodes&quot; IR) but most, including Cranelift, &lt;em&gt;linearize&lt;&#x2F;em&gt; the
operators into a particular program order. And in fact, by the time
that the register allocator does its work, the &quot;operators&quot; are really
machine instructions, or something very close to them, so we will call
them that: we have a sequence of instructions, and &lt;em&gt;program points&lt;&#x2F;em&gt;
before and after each one. Even in this new view, we still have the
dataflow connectivity that we did above; now, each edge corresponds to
a particular &lt;em&gt;range of program points&lt;&#x2F;em&gt;:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;cranelift-regalloc2-fig2-web.svg&quot; alt=&quot;Figure: Dataflow graph with liveranges&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;We call each of these dataflow edges, representing a value that must
flow from one instruction to another, a &lt;em&gt;liverange&lt;&#x2F;em&gt;. We say that
virtual registers -- the names we give the values before regalloc --
have a set of liveranges.&lt;&#x2F;p&gt;
&lt;p&gt;With control flow, liveranges might be discontiguous from a
linear-instruction-order point of view, because of jumps; for example:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;cranelift-regalloc2-fig3-web.svg&quot; alt=&quot;Figure: Control flow with discontiguous liveranges&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Each instruction requires its inputs to be in registers and produces
its outputs in registers, usually.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#3&quot;&gt;3&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; So, the job of the register
allocator is to choose one of a finite set of machine registers to
convey each of the liveranges from its definition(s) to all of its
use(s).&lt;&#x2F;p&gt;
&lt;p&gt;Why is this hard? In brief, because we may not have enough
registers. We thus enter a world of &lt;em&gt;compromise&lt;&#x2F;em&gt;: if more values are
&quot;alive&quot; (need to be kept around for later use) than the number of
registers that the CPU has, then we have to put some of them somewhere
else, and bring them back into registers only when we actually need to
use them. That &quot;somewhere else&quot; is usually memory in the function&#x27;s
stack frame that the compiler reserves (a &quot;stackslot&quot;), and the
process of saving values away to free up registers is called
&quot;spilling&quot;.&lt;&#x2F;p&gt;
&lt;p&gt;One more concept before we go further: we may want to choose to place
a liverange in &lt;em&gt;different&lt;&#x2F;em&gt; places throughout its lifetime, depending
on the needs of the instruction sequence at certain points. For
example, if a value is produced at the top of a function, then dormant
(but live) for a while, and then used frequently in a tight loop at
the bottom of the function, we don&#x27;t &lt;em&gt;really&lt;&#x2F;em&gt; want to spill it, and
reload it from memory every loop iteration. In other words, given this
program:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    v0 := ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    v1 := ...    &#x2F;&#x2F; lots of intermediate defs that use all regs&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    v2 := ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    vN := ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;loop:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    v100 := ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    v101 := add v0, v100&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    store v101, ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    jmp loop&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;we ideally do not want to assign a stack slot to the value &lt;code&gt;v0&lt;&#x2F;code&gt; and
then produce machine code like&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    add rax, rbx      ;; `v0` stored in `rax`&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    mov [rsp+16], rax ;; spill `v0` to a stack slot&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;loop:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    mov rax, [rsp+16] ;; load `v0` from stack on every iteration -- expensive!&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    add rcx, rax&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    mov [...], rcx&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    jmp loop&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;but if we only choose &lt;em&gt;a location&lt;&#x2F;em&gt; per liverange, we either choose a
register, or a stackslot -- no middle ground. Intuitively, it seems
like we should be able to put the value in a different place while it
is &quot;dormant&quot; (spill it to the stack, most likely), then pick an
optimal location during the tight loop. To do so, we need to refer to
the two parts of the liverange separately, and assign each one a
separate location. This is called &lt;em&gt;liverange splitting&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;If we split liveranges, we can then do something like:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    add rax, rbx      ;; `v0` stored in `r0`&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    mov [rsp+16], rax ;; spill `v0` to a stack slot&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    mov rdx, [rsp+16] ;; move `v0` from stackslot to a new liverange in `rdx`&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;loop:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    add rcx, rdx      ;; no load within loop&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    mov [...], rcx&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    jmp loop&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This seems quite powerful and useful -- so why doesn&#x27;t every register
allocator do this? In brief, because it &lt;em&gt;makes the problem much much
harder&lt;&#x2F;em&gt;. When we have a fixed number of liveranges, we have a fixed
amount of work, and we assign a register per liverange, probably in
some priority order. And then we are done.&lt;&#x2F;p&gt;
&lt;p&gt;But as soon as we allow for splitting, we can &lt;em&gt;increase&lt;&#x2F;em&gt; the amount of
work we have almost arbitrarily: we could split every liverange into
many tiny pieces, greatly multiplying the cost of register
allocation. A well-placed split reduces the constraints in the problem
we&#x27;re solving, making it easier, but too many splits just increases
work and also the likelihood that we will unnecessarily move values
around.&lt;&#x2F;p&gt;
&lt;p&gt;Splitting is thus the kind of problem that requires finely-tuned
heuristics. To be concrete, consider the example above: we showed a
split outside of the tight inner loop. But a naive splitting
implementation might just split before the use, putting a move from
stack to register inside the inner loop. Some sort of cost model is
necessary to put splits in &quot;cheap&quot; places.&lt;&#x2F;p&gt;
&lt;p&gt;With all of that, hopefully you have some feel for the problem: we
compute liveranges, we might split them, and then we choose where to
put them. That&#x27;s (almost) it -- modulo many tiny details.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;regalloc2-s-design&quot;&gt;regalloc2&#x27;s Design&lt;&#x2F;h2&gt;
&lt;p&gt;At a high level, regalloc2 is a &lt;em&gt;backtracking&lt;&#x2F;em&gt; register allocator that
computes precise liveranges, performs &lt;em&gt;merging&lt;&#x2F;em&gt; according to some
heuristics into &quot;bundles&quot;, and then runs a main loop that assigns
locations to bundles, sometimes &lt;em&gt;splitting&lt;&#x2F;em&gt; them to make the
allocation problem easier (or possible at all). Once every part of
every liverange has a location, it inserts move instructions to
connect all of the pieces.&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s break that down a bit:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;regalloc2 starts with &lt;em&gt;precise liveranges&lt;&#x2F;em&gt;. These are computed
according to the input to the allocator, which is a program that
refers to &lt;em&gt;virtual registers&lt;&#x2F;em&gt; and may be in
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Static_single_assignment_form&quot;&gt;SSA&lt;&#x2F;a&gt;
form (one definition per register) or non-SSA (multiple definitions
per register).&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;It then &lt;em&gt;merges&lt;&#x2F;em&gt; these liveranges into larger-than-liverange
&quot;bundles&quot;. If done correctly, this reduces work (fewer liverange
bundles to process) and also gives better code (when merged, it is
guaranteed that the related pieces will not need a move instruction
to join them).&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;It then builds a priority queue of bundles, and processes them until
every bundle has a location. (In simplified terms, regalloc2&#x27;s
entire job is to &quot;assign locations to bundles&quot;.) This processing may
involve &lt;em&gt;undoing&lt;&#x2F;em&gt;, or &lt;em&gt;backtracking&lt;&#x2F;em&gt;, earlier assignments, and may
also involve &lt;em&gt;splitting&lt;&#x2F;em&gt; bundles into smaller bundles when separate
pieces could attain better allocations.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;We&#x27;ll explain each of these design aspects in turn below.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;input-instructions-with-operands&quot;&gt;Input: Instructions with Operands&lt;&#x2F;h3&gt;
&lt;p&gt;First, let&#x27;s talk about the &lt;em&gt;input&lt;&#x2F;em&gt; to the register allocator. To
understand how regalloc2 works, we first need to understand how it
sees the world. (Said another way, before we solve the problem, let&#x27;s
define it fully!)&lt;&#x2F;p&gt;
&lt;p&gt;regalloc2 processes a &quot;program&quot; that consists of instructions that
refer to &lt;em&gt;virtual registers&lt;&#x2F;em&gt; rather than real machine registers. These
instructions are arranged in a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Control-flow_graph&quot;&gt;control-flow
graph&lt;&#x2F;a&gt; of basic
blocks.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#4&quot;&gt;4&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The most important principle regarding the allocator&#x27;s view of the
program is: the &lt;em&gt;meaning&lt;&#x2F;em&gt; of instructions is mostly
irrelevant. Instead, the allocator cares mainly how a particular
instruction &lt;em&gt;accesses program values&lt;&#x2F;em&gt; as registers: &lt;em&gt;which&lt;&#x2F;em&gt; values and
&lt;em&gt;how&lt;&#x2F;em&gt; (read or written), and with &lt;em&gt;what constraints&lt;&#x2F;em&gt; on
location. Let&#x27;s look more at the implications of this principle.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;constraints&quot;&gt;Constraints&lt;&#x2F;h4&gt;
&lt;p&gt;The allocator views the input program as a sequence of instructions
that &lt;em&gt;use&lt;&#x2F;em&gt; and &lt;em&gt;define&lt;&#x2F;em&gt; virtual registers. Every access to a register
managed by the regalloc (an &quot;allocatable register&quot;) must be via a
virtual register.&lt;&#x2F;p&gt;
&lt;p&gt;Already we see a divergence from common ISAs like x86: there are
instructions that implicitly access certain registers. (One form of
the x86 integer multiply instruction always places its output in &lt;code&gt;rax&lt;&#x2F;code&gt;
and &lt;code&gt;rdx&lt;&#x2F;code&gt;, for example.) Since these registers are not mentioned by
the instruction explicitly, one might initially think that there is no
need to create regalloc operands or use virtual registers for
them. But these registers (e.g. &lt;code&gt;rax&lt;&#x2F;code&gt; and &lt;code&gt;rdx&lt;&#x2F;code&gt;) can also be used by
explicit inputs and outputs to instructions; so the regalloc at least
needs to know that the registers will be clobbered, and at some later
point presumably the results will be read and the registers become
free again.&lt;&#x2F;p&gt;
&lt;p&gt;We solve this problem by allowing &lt;em&gt;constraints&lt;&#x2F;em&gt; on operands. An
instruction that always reads or writes a specific physical register
still names a virtual-register operand. The only difference from an
ordinary instruction that can use any register is that this operand is
&lt;em&gt;constrained to a particular register&lt;&#x2F;em&gt;. This lets the allocator
uniformly reason about virtual registers allocated to physical
registers as the basic way that space is reserved; the constraint
becomes only a detail of the allocation process.&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s take an x86 instruction &lt;code&gt;mul&lt;&#x2F;code&gt; (integer multiply) as an example
to see how this works. Ordinarily, one would write the following in
assembly:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ;; Multiplicand is implicitly in rax.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    mul rcx  ; multiply by rcx&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ;; 128-bit wide result is implicitly placed rdx (high 64 bits)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ;; and rax (low 64 bits).&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The instruction &lt;code&gt;mul rcx&lt;&#x2F;code&gt; does not tell the whole story from
regalloc2&#x27;s point of view, so we would instead present an instruction
like so to the register allocator, with constraints annotating
uses&#x2F;definitions of virtual registers:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ;; Put inputs in v0 and v1.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    mul v0 [use, fixed rax], v1 [use, any reg], v2 [def, fixed rax], v3 [def, fixed rdx]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ;; Use results in v2 and v3.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The allocator will &quot;do the right thing&quot; by either inserting moves or
generating inputs directly into, and using outputs directly from, the
appropriate registers. The advantage of this scheme is that aside from
the constraints, it makes &lt;code&gt;mul&lt;&#x2F;code&gt; behave like any other instruction: it
isolates complexity in one place and presents a more uniform,
easier-to-use abstraction for the rest of the compiler.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;modify-operands-and-reused-input-constraints&quot;&gt;&quot;Modify&quot; Operands and Reused-Input Constraints&lt;&#x2F;h4&gt;
&lt;p&gt;The next difference we might observe between a real ISA like x86 and a
compiler&#x27;s view of the world is: an operator in a compiler IR usually
produces its result as a &lt;em&gt;completely new value&lt;&#x2F;em&gt;, but real machine
instructions often &lt;em&gt;modify existing values&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;For example, on x86, arithmetic operations are written in &lt;em&gt;two-operand
form&lt;&#x2F;em&gt;. They look like &lt;code&gt;add rax, rbx&lt;&#x2F;code&gt;, which means: compute the sum of
&lt;code&gt;rax&lt;&#x2F;code&gt; and &lt;code&gt;rbx&lt;&#x2F;code&gt;, and store the result in &lt;code&gt;rax&lt;&#x2F;code&gt;, overwriting that input
value.&lt;&#x2F;p&gt;
&lt;p&gt;The register allocator reasons about segments of value dataflow from
definitions (defs) to uses; but the use of &lt;code&gt;rax&lt;&#x2F;code&gt; in this example seems
to be neither. Or rather, it is both: it consumes a value in &lt;code&gt;rax&lt;&#x2F;code&gt;,
and it produces a value in &lt;code&gt;rax&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;But we can&#x27;t decompose it into a separate use and def either, because
then the allocator might choose different locations for each. The
encoding of &lt;code&gt;add rax, rbx&lt;&#x2F;code&gt; only has slots for two register names: the
input in &lt;code&gt;rax&lt;&#x2F;code&gt; and output in &lt;code&gt;rax&lt;&#x2F;code&gt; must be in the same register!&lt;&#x2F;p&gt;
&lt;p&gt;We solve this by introducing a new kind of constraint: the &quot;reused
input register&quot; constraint.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#5&quot;&gt;5&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; At the regalloc level, we say that the
&lt;code&gt;add&lt;&#x2F;code&gt; above has &lt;em&gt;three&lt;&#x2F;em&gt; operands: two inputs (uses) and an output (a
def). It exactly corresponds to the compiler IR-level operator, with
nicely separated values in different virtual registers. But, we
constrain the output by indicating that it &lt;em&gt;must be placed in the same
register as the input&lt;&#x2F;em&gt;. We can assert that this is the case when we
get final assignments from the regalloc, then emit that register
number into the &quot;first source and also destination&quot; slot of the
instruction.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#6&quot;&gt;6&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;So, instead of &lt;code&gt;add rax, rbx&lt;&#x2F;code&gt; (or &lt;code&gt;add v0, v1&lt;&#x2F;code&gt; with &lt;code&gt;v0&lt;&#x2F;code&gt; a &quot;modify&quot;
operand), we can present the following 3-operand instruction to the
register allocator:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    add v0 [use, any reg], v1 [use, any reg], v2 [def, reuse-input(0)]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This corresponds more closely to what the compiler IR describes,
namely a new value for the result of the add and non-destructive uses
of both operands. All of the complexity of saving the destructive
source if needed is pushed to the allocator itself.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;program-points-early-and-late-operands&quot;&gt;Program Points, &quot;Early&quot; and &quot;Late&quot; Operands&lt;&#x2F;h4&gt;
&lt;p&gt;Finally, we need to go a bit deeper on what exactly it means to
allocate a register &quot;at&quot; an instruction. To see why there may be some
subtlety, let&#x27;s consider an example. Take the instruction &lt;code&gt;movzx&lt;&#x2F;code&gt; on
x86: this instruction does a 16-to-64-bit zero-extend, with one input
and one output. In pre-regalloc form with virtual registers, we could
write &lt;code&gt;movzx v1, v0&lt;&#x2F;code&gt;, reading an input in &lt;code&gt;v0&lt;&#x2F;code&gt; and putting the output
in &lt;code&gt;v1&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;An intuitive understanding of liveranges and the allocation problem
might lead us to reason: both &lt;code&gt;v0&lt;&#x2F;code&gt; and &lt;code&gt;v1&lt;&#x2F;code&gt; are &quot;live&quot; at this
instruction, so they overlap, and have to be placed in different
registers.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                          v0    v1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                           :&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    v1 := movzx v0         |     |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                 |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                 :&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But an experienced assembly programmer, knowing that &lt;code&gt;v0&lt;&#x2F;code&gt; is not used
again after this instruction, might reuse its register for the
output. So for example, if it were initially in &lt;code&gt;r13&lt;&#x2F;code&gt;, one might write
&lt;code&gt;movzx r13, r13w&lt;&#x2F;code&gt; (the &lt;code&gt;r13w&lt;&#x2F;code&gt; is x86&#x27;s archaic way of writing &quot;the 16
bit version of &lt;code&gt;r13&lt;&#x2F;code&gt;&quot;).&lt;&#x2F;p&gt;
&lt;p&gt;But isn&#x27;t this an invalid assignment, because we have put two
liveranges in the same register &lt;code&gt;r13&lt;&#x2F;code&gt; when they are both &quot;live&quot; at
this particular instruction?&lt;&#x2F;p&gt;
&lt;p&gt;It turns out that this will work fine, for a subtle reason: generally
instructions read all of their inputs, &lt;em&gt;then&lt;&#x2F;em&gt; write all of their
outputs. In other words, there is a sort of two-phase semantics to
most instructions. So we could say that the input &lt;code&gt;v0&lt;&#x2F;code&gt; is live up to,
and including, the &quot;read&quot; or &quot;early&quot; phase of this instruction, and
the output &lt;code&gt;v1&lt;&#x2F;code&gt; is live starting at the &quot;write&quot; or &quot;late&quot; phase of
this instruction. These two liveranges don&#x27;t conflict at all! So the
above figure showing liveranges overlapping becomes:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                          v0    v1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                           :&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                   EARLY   |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    v1 := movzx v0               &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                   LATE         |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                :&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;regalloc2 (along with most other register allocators) thus has a
notion of &quot;when&quot; an operand occurs -- the &quot;operand position&quot; -- and it
calls these two points in an instruction &lt;code&gt;Early&lt;&#x2F;code&gt; and &lt;code&gt;Late&lt;&#x2F;code&gt;. Along
with this, throughout the allocator, we name program points (distinct
points at which allocations are made) as &lt;code&gt;Before&lt;&#x2F;code&gt; or &lt;code&gt;After&lt;&#x2F;code&gt; a given
instruction.&lt;&#x2F;p&gt;
&lt;p&gt;One final bit of subtlety: when a single instruction from a regalloc
point of view actually emits multiple instructions at the machine
level, sometimes the usual phasing of reads and writes breaks
down. For example, maybe a pseudoinstruction becomes a sequence that
starts to write outputs before it has read all of its inputs. In such
a case, reusing one of the inputs (which is no longer live at &lt;code&gt;Late&lt;&#x2F;code&gt;)
as an output register could be catastrophic. For this reason,
regalloc2 &lt;em&gt;decouples&lt;&#x2F;em&gt; an operand&#x27;s position from its kind (use or
def). One could have an &quot;early def&quot; or a &quot;late use&quot;. Temporary
registers are also possible: these are live during both early and late
points on an instruction so they do not conflict with any input or
output, and can be used in sequences emitted from one
pseudoinstruction.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;regalloc2-s-view-of-operands&quot;&gt;regalloc2&#x27;s View of Operands&lt;&#x2F;h4&gt;
&lt;p&gt;To summarize, each instruction can have a sequence of &quot;operands&quot;, each
of which:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Names a &lt;em&gt;virtual register&lt;&#x2F;em&gt; that corresponds to a value in the
original program;&lt;&#x2F;li&gt;
&lt;li&gt;Indicates whether this value is read (&quot;used&quot;) or written
(&quot;defined&quot;),&lt;&#x2F;li&gt;
&lt;li&gt;Indicates when during the instruction execution the value is
accessed (&quot;early&quot;, before the instruction executes; or &quot;late&quot;, after
it does);&lt;&#x2F;li&gt;
&lt;li&gt;Indicates where the value should be placed: in a machine register of
a certain kind, or a specific machine register, or in the same
register that another operand took, or in a slot in the stack frame.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h3 id=&quot;stage-1-live-ranges&quot;&gt;Stage 1: Live Ranges&lt;&#x2F;h3&gt;
&lt;p&gt;We&#x27;ve described what the register allocator expects as its input. Now
let&#x27;s talk about how the input is processed into an &quot;allocation
problem&quot; that can be solved by the main algorithm.&lt;&#x2F;p&gt;
&lt;p&gt;The input is described as a graph of blocks of instructions, with
operands; but most of the allocator works in terms of &lt;em&gt;liveranges&lt;&#x2F;em&gt; and
&lt;em&gt;bundles of liveranges&lt;&#x2F;em&gt; instead.&lt;&#x2F;p&gt;
&lt;p&gt;In brief, a &lt;em&gt;liverange&lt;&#x2F;em&gt; (originally &quot;live range&quot;, but we say it so
often it has become one word!) is a span of program points -- that is,
a range of &quot;before&quot; and &quot;after&quot; points on instructions -- that
connects a program value in a virtual register from a definition to
one or more uses. A liverange represents one unit of needed storage,
either as a register or a slot in the stackframe.&lt;&#x2F;p&gt;
&lt;p&gt;The basic strategy of regalloc2 is to reduce the input into liveranges
as soon as possible, and then operate mostly on liveranges,
translating back to program terms (inserted moves and assigned
registers per instruction) only at the very end of the process. This
lets us reason about a simpler &quot;core problem&quot; that is actually quite
concisely specified:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;A liverange is a span of program points, which can be numbered
consecutively;&lt;&#x2F;li&gt;
&lt;li&gt;A liverange has constraints at certain points that arise from
program uses&#x2F;defs;&lt;&#x2F;li&gt;
&lt;li&gt;We must assign locations to liveranges, such that:
&lt;ul&gt;
&lt;li&gt;At any point, at most one liverange lives in a given location;&lt;&#x2F;li&gt;
&lt;li&gt;At all points, a liverange&#x27;s constraints are satisfied;&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;We are allowed to split a liverange into pieces and assign different
locations to each piece. However, moves between pieces have a cost,
and we must minimize cost.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;And that&#x27;s it! No need to reason about ISA specifics, or the way that
regalloc2 generates moves, or anything else. We&#x27;ll worry about
generating moves to &quot;reify&quot; (make real) the assignments later. For
now, we just need to slot the liveranges into locations and avoid
conflicts.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;computing-liveness&quot;&gt;Computing Liveness&lt;&#x2F;h4&gt;
&lt;p&gt;To build up our set of liveranges, we first need to compute
&lt;em&gt;liveness&lt;&#x2F;em&gt;. This is a property of any particular virtual register at a
program point indicating that it has a value that will eventually be
used.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Live_variable_analysis&quot;&gt;Liveness
analysis&lt;&#x2F;a&gt; is an
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Data-flow_analysis&quot;&gt;iterative dataflow
analysis&lt;&#x2F;a&gt; that is
computed in the backward direction: any use of a virtual register
propagates liveness backward (&quot;upward&quot; in the program), and a
definition of that virtual register&#x27;s value ends the liveness (when
scanning upward), because the old value (from above) is no longer
relevant.&lt;&#x2F;p&gt;
&lt;p&gt;Thus the first thing that regalloc2 does with the input program is to
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;33611a68b90e40869bba52934449315a8f4e5477&#x2F;src&#x2F;ion&#x2F;liveranges.rs#L318-L319&quot;&gt;run a worklist algorithm to compute precise
liveness&lt;&#x2F;a&gt;. This
produces a bitset&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#7&quot;&gt;7&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; that, at each basic block entry and exit, gives
us the set of live virtual registers.&lt;&#x2F;p&gt;
&lt;p&gt;Once we know which registers are live into and out of each basic
block, we can &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;33611a68b90e40869bba52934449315a8f4e5477&#x2F;src&#x2F;ion&#x2F;liveranges.rs#L413-L419&quot;&gt;perform block-local
processing&lt;&#x2F;a&gt;
to compute actual liveranges with each use of the register properly
noted. This is another backward scan, but this time we &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;33611a68b90e40869bba52934449315a8f4e5477&#x2F;src&#x2F;ion&#x2F;liveranges.rs#L176-L195&quot;&gt;build the data
structures&lt;&#x2F;a&gt;
we&#x27;ll use for the rest of the allocation program.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;normalization-and-saving-fixups-for-later&quot;&gt;Normalization, and Saving Fixups for Later&lt;&#x2F;h4&gt;
&lt;p&gt;We mentioned above that one way to see the liverange-building step is
as a simplification of the problem to its core essence, in order to
more easily solve it. &quot;Ranges that may overlap&quot; is certainly simpler
than &quot;instructions that access registers with certain
semantics&quot;. However, even the constraints on the liveranges can be
made simpler in several ways.&lt;&#x2F;p&gt;
&lt;p&gt;A good example of a complex set of constraints is the following:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    inst v0 [use, fixed r0], v0 [use, fixed r1], v1 [def, any reg]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is an instruction that has two inputs, and takes the inputs in
fixed physical registers &lt;code&gt;r0&lt;&#x2F;code&gt; and &lt;code&gt;r1&lt;&#x2F;code&gt;. This is completely reasonable
and such instructions exist in real ISAs (see, e.g., x86&#x27;s integer
divide instruction, with inputs in &lt;code&gt;rdx&lt;&#x2F;code&gt; and &lt;code&gt;rax&lt;&#x2F;code&gt;, or a call with an
ABI that passes arguments in fixed registers). If the two inputs
happen to be given the same program value, here virtual register &lt;code&gt;v0&lt;&#x2F;code&gt;,
then we have created an impossible constraint: we require &lt;code&gt;v0&lt;&#x2F;code&gt; to be
in &lt;em&gt;both&lt;&#x2F;em&gt; &lt;code&gt;r0&lt;&#x2F;code&gt; and &lt;code&gt;r1&lt;&#x2F;code&gt; at the same time.&lt;&#x2F;p&gt;
&lt;p&gt;As we have formulated the problem, a liverange is in only one place at
a time; and in fact this is a very useful simplifying invariant, and a
simpler model than &quot;there are N copies of the virtual register at
once&quot; (which one(s) are up-to-date, if we allow multiple defs?).&lt;&#x2F;p&gt;
&lt;p&gt;We can &quot;simplify to a previously solved problem&quot; in this case with a
neat trick: we keep a side-list of &quot;fixup moves&quot; to add back in, after
we complete allocation, and we insert such a fixup move from &lt;code&gt;r0&lt;&#x2F;code&gt; to
&lt;code&gt;r1&lt;&#x2F;code&gt; just before this instruction. Then we &lt;em&gt;delete the constraint&lt;&#x2F;em&gt; on
the second operand that uses &lt;code&gt;v0&lt;&#x2F;code&gt;. The rest of the allocation will
proceed as if &lt;code&gt;v0&lt;&#x2F;code&gt; were only required in &lt;code&gt;r0&lt;&#x2F;code&gt;; it will end up in that
location; and the fixup move will copy it to &lt;code&gt;r1&lt;&#x2F;code&gt; as well.&lt;&#x2F;p&gt;
&lt;p&gt;We perform a similar rewrite for reused-input constraints. These seem
as if they would be fairly fundamental to the core allocation loop,
because they tie one decision to another; now we have to deal with
dependent allocation decisions. But we can do a simpler thing: we
&lt;em&gt;edit the liveranges&lt;&#x2F;em&gt; so that (i) the output that reuses the input has
a liverange that starts at the &lt;em&gt;early&lt;&#x2F;em&gt; (input) phase, and (ii) the
input has a liverange that ends just before the instruction, not
overlapping. (In other words, we shift both back by one program
point.) Then we insert a fixup move from input to output. The figure
below illustrates this rewrite.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      INITIAL&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                         v0   v1    v2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                         :    :&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                         |    |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                  EARLY use  use&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      add v2, v0, v1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                  LATE             def reuse(0)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                     |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                     :&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                     &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      REWRITTEN&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                         v0   v1    v2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                         :    :&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                         |    |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                         |    |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                  LATE   |    |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                              |    (implicit copy: v2 := v0)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                  EARLY      use   def&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      add v2, v0, v1                 |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                  LATE               |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                     |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                     :&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;One may object that this pessimizes all reused-input allocations --
haven&#x27;t we removed all knowledge of the constraint, so we will almost
always get different registers at input and output, and cause many new
moves to be inserted? The answer to this issue comes in the &lt;em&gt;bundle
merging&lt;&#x2F;em&gt;, which we discuss below (basically, we try to rejoin the two
parts if no overlap would result).&lt;&#x2F;p&gt;
&lt;p&gt;In general, this is a powerful technique: whenever some complexity
arises from a constraint or feature, it is best if the complexity can
be kept as close to the &lt;em&gt;outer boundary&lt;&#x2F;em&gt; of the system as
possible. Rewrites or lowerings into a simpler &quot;core form&quot; are common
in compilers, and it so happens that considering regalloc constraints
in this light is useful too.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#8&quot;&gt;8&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;step-2-bundles-and-merging&quot;&gt;Step 2: Bundles and Merging&lt;&#x2F;h3&gt;
&lt;p&gt;Once we have created a list of liveranges with constraints, we could
in theory begin to assign locations right away, finding available
locations that fulfill constraints and splitting where necessary to do
so. However, such an approach would almost certainly run more slowly,
and produce worse code, than most state-of-the-art allocators
today. Why is that?&lt;&#x2F;p&gt;
&lt;p&gt;A key observation about liveranges in real programs is that there are
&lt;em&gt;clusters of related liveranges connected by moves&lt;&#x2F;em&gt;. Several examples
are the liveranges on either side of an SSA block parameter (or
phi-node), or on either side of a move instruction, or the input and
reused-register-constrained output of an instruction.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#9&quot;&gt;9&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; These
liveranges often would benefit if they were in the same register: in
all three cases, it would mean one fewer move instruction in the final
program.&lt;&#x2F;p&gt;
&lt;p&gt;Processing such related liveranges together, as one unit of
allocation, would guarantee that they would be assigned the same
location. (If impossible, the merged liveranges could always be split
again.) Attaining this result some other way would require reasoning
about &quot;affinity&quot; for locations between related liveranges, which is a
much more complex question.&lt;&#x2F;p&gt;
&lt;p&gt;Furthermore, processing multiple liveranges together brings all the
usual efficiency benefits of batching: the more progress we can make
with a single decision, the faster the register allocator runs.&lt;&#x2F;p&gt;
&lt;p&gt;We thus define a &quot;bundle&quot; of liveranges as the unit of
allocation. After computing liveranges in the initial input program
scan, we merge liveranges into bundles according to a few simple
heuristics: across SSA block parameters, across move instructions, and
from inputs to outputs of instructions with &quot;reused-input&quot; constraints.&lt;&#x2F;p&gt;
&lt;p&gt;The one key invariant is: all liveranges in a bundle &lt;em&gt;must not
overlap&lt;&#x2F;em&gt;. We greedily grow a bundle with the above heuristics, testing
at each step whether another liverange can join.&lt;&#x2F;p&gt;
&lt;p&gt;Beyond this point in the allocation process, we will reason about
bundles: we enqueue them in the priority workqueue, we process them
one at a time and assign locations or split. At the end of the
process, we&#x27;ll scan the liveranges in the bundle and assign each the
location that the bundle received.&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    CORE ALLOCATION PROBLEM:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;         bundle0              bundle1               bundle2          bundle3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;           |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;           |                    |                                       |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;           |                    |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                |                     |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                |                     |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;           |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;           |                                                            |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;           &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ==&amp;gt;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;           bundle0: r0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;           bundle1: r1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;           bundle2: r0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;           bundle3: r2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;&lt;h3 id=&quot;step-3-assignment-loop-and-splitting-heuristics&quot;&gt;Step 3: Assignment Loop and Splitting Heuristics&lt;&#x2F;h3&gt;
&lt;p&gt;The heart of the allocator is the main loop that &lt;em&gt;allocates locations
to bundles&lt;&#x2F;em&gt;. This is at least conceptually simple: pull a bundle off
of a queue, &quot;probe&quot; potential locations one at a time to see if it
will fit (has no overlap with points in time for which that location
is already reserved), assign it the first place it fits. But there is
significant complexity in the details, as always.&lt;&#x2F;p&gt;
&lt;p&gt;The key data structures are: (i) an &quot;allocation map&quot; for each physical
register, kept as a BTree for fast lookups, that indicates whether the
register is free or occupied at any program point and the liverange
that occupies it; and (ii) a queue of bundles to process. (The &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;main&#x2F;doc&#x2F;DESIGN.md&quot;&gt;design
document&lt;&#x2F;a&gt;
describes several others, such as the second-chance allocation queue
and the structures used for stackslots, which we skip here for
simplicity.)&lt;&#x2F;p&gt;
&lt;p&gt;The core part of the allocator&#x27;s processing occurs here: we pull one
bundle at a time from the queue and attempt to place it in one of the
registers (again we&#x27;re ignoring stackslot constraints for simplicity).&lt;&#x2F;p&gt;
&lt;p&gt;For each bundle, we can perform one of the following actions:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;If we find a register with no overlapping allocations already in
place, we can allocate the bundle to the register; then we&#x27;re done!
This is the best case.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Otherwise, we can pick a register where some bundles with a lower
&quot;spill cost&quot; (determined as a sum of some heuristic values for each
use of a liverange in a bundle) and &lt;em&gt;evict&lt;&#x2F;em&gt; those already-allocated
bundles, punting them back to the queue, then put our present bundle
in this register instead. We do this only if the present bundle has
a higher spill cost.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;If this is also not an option, we can split our present bundle into
pieces and try again. Heuristically, we find it works well to split
at the first conflict point; in other words, allocate as much as
would have fit in any register, and then put the remainder back in
the queue.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;   TO ALLOCATE:                  GIVEN:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;       bundle0                     r0     r1    r2       r3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;         |                          |b1          |b4&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;         |                          |            |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;         |                                       |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                    |b2          |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                    |            |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                                 |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;         |                               |b3     |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;         |                               |       |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;         &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;         &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    OPTION 1: Take a free register (r3)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        - Possible if no overlap. Easiest option!&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    OPTION 2: Evict, if bundle0&amp;#39;s spill cost is higher than evicted bundles&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;              and if no completely free register exists:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;       bundle1  bundle2            r0     r1    r2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;         |                          |b0          |b4&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;         |                          |            |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                    |            |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                 |                               |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                 |                               |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                                 |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                    |b0  |b3     |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                    |    |       |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                 &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        (b1 and b2 are re-enqueued)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;         &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    OPTION 3: Split!&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                      &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                       &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                   r0     r1    r2 &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                              --&amp;gt;   |b1   |b0    |b4&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                    |     |      |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                          |      |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                    |b2          |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                    |            |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                                 |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                              --&amp;gt;   |b0  |b3     |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                    |    |       |&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The presence of &lt;em&gt;eviction&lt;&#x2F;em&gt; as an option is what makes regalloc2 a
&lt;em&gt;backtracking&lt;&#x2F;em&gt; allocator. It&#x27;s not clear why the allocator should
always finish its job, if it is allowed to undo work. In fact &lt;em&gt;many&lt;&#x2F;em&gt;
bundles may be evicted in order to place just &lt;em&gt;one&lt;&#x2F;em&gt; bundle instead --
isn&#x27;t this backward progress?&lt;&#x2F;p&gt;
&lt;p&gt;The key to maintaining forward progress is that we &lt;em&gt;only evict bundles
of lower spill weight&lt;&#x2F;em&gt;, together with the fact that &lt;em&gt;spill weight
monotonically decreases when splitting&lt;&#x2F;em&gt;. Eventually, if bad luck
continues far enough, a bundle will be split into individual pieces
around each use, and these can always be allocated because (if the
input program does not have fundamentally conflicting constraints on
one instruction) these single-use bundles have the lowest possible
spill weight.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;step-4-move-handling&quot;&gt;Step 4: Move Handling&lt;&#x2F;h3&gt;
&lt;p&gt;Finally, once we have a series of locations assigned to each bundle,
we have &quot;solved the problem&quot;, but... we still need to convey our
solution back to the real world, where a compiler is waiting for us to
provide a series of move, load, and store instructions to place values
into the right spots.&lt;&#x2F;p&gt;
&lt;p&gt;We split the overall problem into two pieces for the usual simplicity
reasons: first, we allow ourselves to cut liveranges into as many
pieces as needed, and put each piece in a different place, at a single
instruction granularity. We assume that we can edit the program
somehow to connect these pieces back up. That allowed the above
liverange&#x2F;bundle processing to become a tractable problem for a solver
core to handle. Now, need to connect those liverange fragments. This
is the second half of the problem: generating moves.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;all-in-one-liverange-connectors-program-moves-and-edge-moves&quot;&gt;All-in-One: Liverange Connectors, Program Moves, and Edge Moves&lt;&#x2F;h4&gt;
&lt;p&gt;The abstract model for the input to this stage of the allocator is
that between each pair of instructions, we perform some &lt;em&gt;arbitrary
permutation&lt;&#x2F;em&gt; of liveranges in locations. One way to see this
permutation is as a &lt;em&gt;parallel move&lt;&#x2F;em&gt;: a data-movement action that reads
values in all of their old locations (inputs of the permutation), then
in parallel, writes the values to all of their new locations (outputs
of the permutation).&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        EARLY&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    inst1      r2, r0, r1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        LATE&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;          { r4 := r0 }              &amp;lt;--- regalloc-inserted moves&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        EARLY&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    inst2      r0, r2, r3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        LATE&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;          { r6 := r5, r5 := r6 }    &amp;lt;--- multiple moves in parallel!&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;                                         (arbitrary permutations)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;          &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        EARLY&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    inst3      r5, r4, r2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;        LATE&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is why we make a distinction between the &quot;After&quot; point of
instruction &lt;em&gt;i&lt;&#x2F;em&gt; and the &quot;Before&quot; point of instruction &lt;em&gt;i+1&lt;&#x2F;em&gt;, though a
traditional compiler textbook would tell you that there is only one
program point between a pair of instructions. We have two, and between
these two program points lies the parallel move.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#10&quot;&gt;10&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The process for generating these moves is: we scan liveranges, finding
points at which they have been split into pieces where the value must
flow from one piece to the next. We also account for CFG edges and
block parameters at this point, as well as for move instructions in
the input program. Once we have accumulated the set of moves that must
happen, in parallel, at a given priority at a given location, we
resolve these into a sequence of individual move&#x2F;load&#x2F;store
instructions using the algorithm we describe in the next section.&lt;&#x2F;p&gt;
&lt;p&gt;One thing to note about this design is that we are handling &lt;em&gt;all&lt;&#x2F;em&gt;
value movement in the program with a single resolution mechanism:
regalloc-induced movement but also movement that was present in the
original program. This is valuable because it allows the moves to be
handled more efficiently. In contrast, we have observed issues in the
past in allocators that lower moves in stages -- e.g., SSA block
parameters to moves prior to regalloc, then regalloc-induced moves
during regalloc -- where chains of moves occur because each level of
abstraction is not aware of what other levels below or above it are
doing.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;parallel-move-resolution&quot;&gt;Parallel Move Resolution&lt;&#x2F;h4&gt;
&lt;p&gt;The actual problem of resolving a permutation such as:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    { r0 := r1 ; r1 := r2 ; r2 := r0 }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;into a sequence of moves&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    scratch := r0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    r0 := r1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    r1 := r2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    r2 := scratch&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;is a well-studied one, and is known as the &quot;parallel moves
problem&quot;. The crux of the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;0395614545da5ccc45866bfa50dcdbe9cc37c253&#x2F;src&#x2F;moves.rs&quot;&gt;solution&lt;&#x2F;a&gt;
is to understand the permutation as a kind of dependency graph, and
sort its moves so that we pull an old value out of a given register
before overwriting it. When we encounter a cycle, we can use a scratch
register as above.&lt;&#x2F;p&gt;
&lt;p&gt;One might think that something like &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Tarjan%27s_strongly_connected_components_algorithm&quot;&gt;Tarjan&#x27;s
algorithm&lt;&#x2F;a&gt;
for finding strongly-connected components is needed, but in fact there
is a nice property of the problem that greatly simplifies it. Because
any valid permutation has at most one writer for any given register,
we can &lt;em&gt;only have simple cycles&lt;&#x2F;em&gt; of moves, with other uses of old
values in the cycle handled before realizing the cyclic move. Some
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;0395614545da5ccc45866bfa50dcdbe9cc37c253&#x2F;src&#x2F;moves.rs#L92-L103&quot;&gt;more
description&lt;&#x2F;a&gt;
is available in our implementation. In fact, this is such a nice
observation that we later discovered &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;hal.inria.fr&#x2F;inria-00289709&#x2F;document&quot;&gt;a
paper&lt;&#x2F;a&gt; by Rideau et
al. that names the resulting dependency graphs &quot;windmills&quot; for their
shape (see figure below -- there can be a simple cycle in the middle,
and only acyclic outward moves from cycle elements in a tree of
outward shifts) and, delightfully, describes more or less the same
algorithm to &quot;tilt at windmills&quot; and resolve the moves.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;cranelift-regalloc2-fig4.svg&quot; alt=&quot;Figure: &amp;quot;Windmills&amp;quot; in a register movement graph&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h4 id=&quot;scratch-registers-and-cycles&quot;&gt;Scratch Registers and Cycles&lt;&#x2F;h4&gt;
&lt;p&gt;The above algorithm works, but has one serious drawback: it requires a
scratch register whenever we have a cyclic move. The simplest approach
to this requirement is to set aside one register permanently (or
actually, one per &quot;register class&quot;: e.g., an integer register and a
float&#x2F;vector register). Especially on ISAs with relatively few
registers, like x86-64 with 16 each of integer and float registers,
this can impact performance by increasing register pressure and
forcing more spills.&lt;&#x2F;p&gt;
&lt;p&gt;We thus came up with a
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;51&quot;&gt;scheme&lt;&#x2F;a&gt; to
allow use of all registers but still find a scratch when needed for a
cyclic move. The approach begins with an idea &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;rev&#x2F;de15f9c109f9c474d00faf8032f559c236067c06&#x2F;js&#x2F;src&#x2F;jit&#x2F;BacktrackingAllocator.cpp#2582&quot;&gt;borrowed from
IonMonkey&lt;&#x2F;a&gt;,
namely to look for a free register to use as a scratch by actually
probing the allocation maps. This often works: the need for a cyclic
move doesn&#x27;t necessarily imply that we will have high register
pressure, and so there are often plenty of free registers available.&lt;&#x2F;p&gt;
&lt;p&gt;What if that doesn&#x27;t work, though? In the above PR, we take another
seemingly-simplistic approach: we use a stackslot as the scratch
instead! This means that we will resolve the cyclic move into a
sequence including stores and loads, but this is fine, because we&#x27;re
already in a situation where all registers are full and we need to
spill &lt;em&gt;something&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;We&#x27;re not quite done, though: there is another very important use of
the scratch register in a simplistic design, namely to resolve
memory-to-memory moves! This arises because our move resolution
handles both registers and stackslots in a uniform way, so some cycle
elements may be stackslots (memory locations). Using a stackslot as a
scratch above just compounds the problem. So we translate, in a
separate second phase, memory-to-memory moves into a &lt;em&gt;pair&lt;&#x2F;em&gt; of a load
(from memory into scratch) and a store (from scratch into memory).&lt;&#x2F;p&gt;
&lt;p&gt;So to recap, we may find a cyclic move permutation to be necessary,
and no registers to be free to use as scratch; so we use a stackslot
instead. But some of the original move cycle may have been between
stackslots, so we need &lt;em&gt;another&lt;&#x2F;em&gt; scratch to do make these
stackslot-to-stackslot moves possible. But we&#x27;re already out of
scratch registers!&lt;&#x2F;p&gt;
&lt;p&gt;The solution to this last issue is that we can do a last-ditch
emergency spill of &lt;em&gt;any&lt;&#x2F;em&gt; register, just for the duration of one
move. So we pick a &quot;victim&quot; register of the right kind (integer or
float), spill it to a second stackslot, use this victim register for a
memory-to-memory move (a load and store pair), then reload the victim.&lt;&#x2F;p&gt;
&lt;p&gt;This cascading series of solutions, each a little more complex but a
little rarer, is an example of a complexity-for-performance
tradeoff. Overall, it is far better to allow the program to use all
registers; this will reduce spills. And most parallel moves are &lt;em&gt;not&lt;&#x2F;em&gt;
cyclic, so scratch registers are rarely needed anyway. And when a
cyclic move &lt;em&gt;is&lt;&#x2F;em&gt; needed, we often have a free register, because this
condition is mostly orthogonal to high register pressure. It is only
when all of the bad cases line up -- cycle, no free registers, and
memory-to-memory moves -- that we reach for the highest-cost approach
(decomposing one move into four), and so the most important aspect of
this fallback is not that it is fast but that it is correct and can
handle all cases.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;everything-else&quot;&gt;Everything Else&lt;&#x2F;h3&gt;
&lt;p&gt;This has been a not-so-whirlwind tour of the allocator pipeline in
regalloc2, but despite my longwindedness, we had to skip many details!
For example, the way in which stackslots are allocated for spilled
values, the way in which split pieces of a single original bundle
share a single spill location (&quot;spill bundles&quot;), the way in which we
clean up after move insertion with Redundant Move Elimination (a sort
of abstract interpretation that tracks symbolic locations of values),
and more, are skipped here but are all described in the design
document. One could truly write a book on the engineering of a
register allocator, but the above will have to suffice; now, we must
move on and draw some lessons!&lt;&#x2F;p&gt;
&lt;h2 id=&quot;four-lessons&quot;&gt;Four Lessons&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;performance&quot;&gt;Performance&lt;&#x2F;h3&gt;
&lt;h4 id=&quot;cache-locality-and-scans&quot;&gt;Cache Locality and Scans&lt;&#x2F;h4&gt;
&lt;p&gt;One enduring theme in the regalloc2 architecture is &lt;em&gt;data structure
design for performance&lt;&#x2F;em&gt;. As I began the project by transliterating
IonMonkey code, building Rust equivalents to the data structures in
the original C++, I found several things:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The original data structures were heavily &lt;em&gt;pointer-linked&lt;&#x2F;em&gt;. For
example, liveranges within bundles and uses within liveranges were
kept as linked lists, to allow for fast insertion and removal in the
middle, and splicing. A linked list is the classical CS answer to
these requirements.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;There were quite a few linear-time queries of these data
structures. For example, when generating moves between liveranges of
a virtual register, a scan would traverse the linked list of these
liveranges, observe the range covering one end of a control-flow
transition, and do a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;rev&#x2F;7751fef9eeb3db0a07ae4680daa2a62bd8f49882&#x2F;js&#x2F;src&#x2F;jit&#x2F;BacktrackingAllocator.cpp#2196&quot;&gt;&lt;em&gt;linear-time
scan&lt;&#x2F;em&gt;&lt;&#x2F;a&gt;
(through the linked list) for the liverange at the other end!&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;These two design trends combine to make CPU caches exceptionally
unhappy. First there is the algorithmic inefficiency, then there is
the cache-unfriendly demand access to random liveranges, each of which
is a pointer-chasing scan.&lt;&#x2F;p&gt;
&lt;p&gt;regalloc2 adopts two general themes that work against these problems:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The overall data structure design consists of &lt;em&gt;contiguous-in-memory
inline structs&lt;&#x2F;em&gt; rather than linked lists. For example, the list of
liveranges in a bundle is a &lt;code&gt;SmallVec&amp;lt;[LiveRangeListEntry; 4]&amp;gt;&lt;&#x2F;code&gt;,
i.e. a list with up to four entries inline and otherwise
heap-allocated, and the entry struct contains the program-point
range inline. Combining this more compact layout with certain
&lt;em&gt;invariants&lt;&#x2F;em&gt; -- usually, some sort of sorted-order invariant --
allows for efficient lookups and list merges even without
linked-list splicing.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;At a higher level, regalloc2 tries to &lt;em&gt;avoid random lookups as much
as possible&lt;&#x2F;em&gt;. Sometimes this is unavoidable, but where it is not, a
linear scan that produces some output as it goes is much more
cache-friendly.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;It is worth examining the particular technique we use to resolve moves
across control-flow edges. This requires looking up where a virtual
register is allocated at either end of the edge -- two arbitrary
points in the linear sequence of instructions. The problem is solved
in IonMonkey (as we linked above) by scanning over ranges to find
basic block ends and then doing a linear-time linked-list traversal to
find the &quot;other end&quot;, for overall quadratic time.&lt;&#x2F;p&gt;
&lt;p&gt;Instead we scan the liveranges for a virtual register once and
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;0395614545da5ccc45866bfa50dcdbe9cc37c253&#x2F;src&#x2F;ion&#x2F;moves.rs#L126-L164&quot;&gt;produce &quot;half-moves&quot; into a
&lt;code&gt;Vec&lt;&#x2F;code&gt;.&lt;&#x2F;a&gt;
These &quot;half-moves&quot; are records of either the &quot;source&quot; side of a move,
at the origin point of a CFG edge, or the &quot;destination&quot; side of a
move, at the destination point of a CFG edge. After our single scan,
we sort the list of half-moves by a key (the vreg and destination
block) so that the source and destination(s) appear together. We can
then scan &lt;em&gt;this&lt;&#x2F;em&gt; list once and generate all moves in bulk.&lt;&#x2F;p&gt;
&lt;p&gt;If that sounds something like
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;MapReduce&quot;&gt;MapReduce&lt;&#x2F;a&gt;, that is not an
accident: the technique of leveraging a sort with a well-chosen key
was invented to allow for efficient parallel computation, and here
allows the two &quot;ends&quot; of the move to be processed independently.&lt;&#x2F;p&gt;
&lt;p&gt;This technique provides better algorithmic efficiency, much better
cache residency (we have two steps that boil down to &quot;scan input list
linearly and produce output list linearly&quot;), and leans on the
standard-library implementation of &lt;code&gt;sort()&lt;&#x2F;code&gt;, which is likely to be
faster than anything we can come up with. Profiling of regalloc2 runs
shows sometimes up to 10% or so of runtime spent in &lt;code&gt;sort()&lt;&#x2F;code&gt;, but this
is far better than the alternative, in which we do a random
pointer-chasing lookup at every step.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;compact-data&quot;&gt;Compact Data&lt;&#x2F;h4&gt;
&lt;p&gt;Another lesson learned over and over during regalloc2 optimization is
this: data compactness matters! A single &lt;code&gt;struct&lt;&#x2F;code&gt; growing from 16 to
24 bytes could lead to significant slowdowns if a large input leads to
allocation and traversals over an array of 10,000 such structs. Every
improvement in memory footprint is a reduction in cache misses.&lt;&#x2F;p&gt;
&lt;p&gt;We play many games with bitpacking to achieve this. For example,
regalloc2 puts its &lt;code&gt;Operand&lt;&#x2F;code&gt; in &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;0395614545da5ccc45866bfa50dcdbe9cc37c253&#x2F;src&#x2F;lib.rs#L407-L422&quot;&gt;32
bits&lt;&#x2F;a&gt;,
and this includes a virtual register number, a constraint, a physical
register number to possibly go with that constraint, the position
(early&#x2F;late), kind (def&#x2F;use), and register class of the operand. Some
of this optimization requires compromise: as a result of our encoding
scheme, for example, we can allow only 2M (2&lt;sup&gt;21&lt;&#x2F;sup&gt;) virtual
registers per function body. But in practice most applications will have
other limits that take effect before this matters. (And in any case,
many compilers play these same sorts of tricks, so megabytes-large
function bodies are problematic in all sorts of ways.) And we sometimes
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;13&quot;&gt;find ways to pack a few more
bits&lt;&#x2F;a&gt; (more such
PRs are always welcome!).&lt;&#x2F;p&gt;
&lt;p&gt;We play similar tricks with program points, spill weights (we &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;0395614545da5ccc45866bfa50dcdbe9cc37c253&#x2F;src&#x2F;ion&#x2F;liveranges.rs#L55-L70&quot;&gt;store
them as
bfloat16&lt;&#x2F;a&gt;
because spill weights need not be too precise, only relatively
comparable, and using only 16 bits lets us pack some flags in the
upper 16 and save a &lt;code&gt;u32&lt;&#x2F;code&gt;), and more.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, trading off indirection and data-inlining is important: e.g.,
a
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;0395614545da5ccc45866bfa50dcdbe9cc37c253&#x2F;src&#x2F;ion&#x2F;data_structures.rs#L89-L94&quot;&gt;&lt;code&gt;LiveRangeList&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
keeps the program-point range (32 + 32 bits) inline, then a 32-bit
index to indirect to everything else about the liverange, because
checking for bundle overlap is the most common reason for traversing
this list and reducing cache misses in this inner loop is paramount.&lt;&#x2F;p&gt;
&lt;h4 id=&quot;reducing-work&quot;&gt;Reducing Work&lt;&#x2F;h4&gt;
&lt;p&gt;One final performance technique that at once both sounds completely
obvious and superficial, yet is quite powerful, is: &quot;simply do less
work!&quot;&lt;&#x2F;p&gt;
&lt;p&gt;One can often get lost in profiler results, wondering how to shave off
some hotspots by compacting some data or reworking some inner-loop
logic, only to miss that one is implicitly assuming that the actual
computation to be done is invariant. In other words, one might look
for the fastest way to compute a particular subproblem or framing of
the problem, rather than the ultimate problem at hand (in this case,
the register allocation).&lt;&#x2F;p&gt;
&lt;p&gt;In the case of regalloc2, this primarily means that we can improve
performance by &lt;em&gt;reducing the number of bundles and liveranges&lt;&#x2F;em&gt;. In
turn, this means that we can get outsized wins by improving our
merging and splitting heuristics.&lt;&#x2F;p&gt;
&lt;p&gt;Early in the optimization push, I realized that regalloc2 was often
finding an abnormally large number of conflicts between bundles, and
splitting far too aggressively. It turned out that the liveness
analysis was initially &lt;em&gt;approximate&lt;&#x2F;em&gt;, in an intentional, if premature,
efficiency tradeoff to avoid a fixpoint loop in favor of a single-pass
loop-backedge-based algorithm that overapproximated liveness (which is
fine for correctness). The time that this saved was more than offset
by the large increase in iterations of the bundle processing loop. So
I reworked this into a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;b78ccbce6e5700bc1ed2356bbb1d3221de49a353&#x2F;src&#x2F;ion&#x2F;liveranges.rs#L327-L401&quot;&gt;precise
analysis&lt;&#x2F;a&gt;
that iterates until fixpoint. It is worthwhile to pay that extra
analysis cost upfront to get exact liveness in order to make our lives
(and our runtime) better later.&lt;&#x2F;p&gt;
&lt;p&gt;The way in which we compute that precise liveness itself also raises
an interesting way of reducing work: by carefully choosing
invariants. We perform the liverange-building scan in such a way that
we &lt;em&gt;always observe liveranges in (reverse) program order&lt;&#x2F;em&gt;. This lets
us build the liverange data structures, which are normally sorted,
with simple appends, merging with contiguous sections from adjacent
blocks. This is in contrast to the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;rev&#x2F;70cf6863bd85af2a3188ec1fe5209a3ec1b2de86&#x2F;js&#x2F;src&#x2F;jit&#x2F;BacktrackingAllocator.cpp#263-340&quot;&gt;original IonMonkey allocator&#x27;s
equivalent
function&lt;&#x2F;a&gt;
to add liveranges during analysis, which essentially does an insertion
sort and merge, leading to O(n²) behavior. Note that the IonMonkey
code has a &lt;code&gt;CoalesceLimit&lt;&#x2F;code&gt; constant that caps the O(n²) behavior at
some fixed limit. In contrast our liverange build in regalloc2 is
always linear-time.&lt;&#x2F;p&gt;
&lt;p&gt;The final way in which one can reduce work, related to data-structure
and invariant choice, is by designing the input (API or data format)
correctly in order to efficiently encode the problem. The register
allocator that preceded regalloc2, regalloc.rs, did not have a notion
of register constraints in instructions&#x27; use of virtual
registers. Instead, it required the user to use move instructions:
reused-input constraints become a move prior to the instruction, and
fixed-register constraints become moves to&#x2F;from physical registers. It
then relied on a separate move-elision analysis to try to eliminate
these moves. regalloc2 has a smaller input because constraints are
carried on every operand. It can still generate these moves when
needed, but they often are not. This results in faster allocation as
well as often better generated code.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;correctness-design-for-test-and-fuzzing-first-development&quot;&gt;Correctness: &quot;Design for Test&quot; and Fuzzing-First Development&lt;&#x2F;h3&gt;
&lt;p&gt;The next set of lessons to come from regalloc2 have to do with &lt;em&gt;how to
attain correctness in complex programs&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;I believe that regalloc2 is maybe the most &lt;em&gt;intrinsically complex&lt;&#x2F;em&gt;
program I have written: its operation relies on many interlocking
invariants across the allocation process, and there are many, many
edge cases to get right. It is &amp;gt;10K lines of very dense Rust
code. There should be approximately zero chance for any human to get
this correct, working on real inputs, in any reasonable timeframe. And
relying on something this complex to uphold security guarantees that
rely on correct compilation should be terrifying.&lt;&#x2F;p&gt;
&lt;p&gt;And yet somehow it seems to work, and we haven&#x27;t found any miscompiles
caused by RA2 itself since we switched Cranelift to use regalloc2 in
April. More broadly, there was
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;54&quot;&gt;one&lt;&#x2F;a&gt; issue
where constraints generated by Cranelift could not be handled in some
cases, resulting in a panic&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#11&quot;&gt;11&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;; and
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;56&quot;&gt;another&lt;&#x2F;a&gt; where
spillslots were not reused as they should be, resulting in worse
performance; neither could result in incorrect generated code. In the
integration of RA2 into Cranelift, there were
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;4042&quot;&gt;two&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;4044&quot;&gt;bugs&lt;&#x2F;a&gt; that
could, but both were found within 24 hours by the fuzzers. (That
doesn&#x27;t mean there won&#x27;t be any more of course -- but things have been
surprisingly boring and quiet!)&lt;&#x2F;p&gt;
&lt;p&gt;The main superpower, if one can call it that, that enabled this to
work out is &lt;em&gt;fuzzing&lt;&#x2F;em&gt;. And in particular, a step-by-step approach to
fuzzing in which I built fuzzing oracles, test harnesses, and fuzz
targets as I built the allocator itself, and drove development with
it. Until about 4 months in when I wired up the first version of the
Cranelift integration, regalloc2 had &lt;em&gt;only&lt;&#x2F;em&gt; ever performed register
allocation for fuzz-target-generated inputs. It still doesn&#x27;t have a
test harness for manually-written tests; there seems to be no need, as
the fuzzer is remarkably prescient at finding bugs.&lt;&#x2F;p&gt;
&lt;p&gt;I find it helpful to think of this philosophy in terms of the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Design_for_testing&quot;&gt;design-for-test&lt;&#x2F;a&gt;
idea from digital hardware design. In brief, the idea is that one
builds additional features or interfaces into the hardware
specifically so its internal state is visible and it can be tested in
controlled, systematic ways.&lt;&#x2F;p&gt;
&lt;p&gt;The first thing that I built in the regalloc2 tree was a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;0395614545da5ccc45866bfa50dcdbe9cc37c253&#x2F;src&#x2F;fuzzing&#x2F;func.rs#L300-L308&quot;&gt;function
body
generator&lt;&#x2F;a&gt;
that produces arbitrary control flow, either reducible or irreducible,
and arbitrary uses and defs according to what SSA allows. I then built
an &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;main&#x2F;src&#x2F;ssa.rs&quot;&gt;SSA
validator&lt;&#x2F;a&gt;,
and finally, &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;main&#x2F;fuzz&#x2F;fuzz_targets&#x2F;ssagen.rs&quot;&gt;fuzzed one against the
other&lt;&#x2F;a&gt;. This
way I built confidence that I had fuzzing input that included
interesting edge cases. This would become an important tool for
testing the whole allocator, but it was important to &quot;test the tester&quot;
first and cross-check it against SSA&#x27;s requirements. Of course,
checking SSA requires one to compute flowgraph dominance on the CFG,
and &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;main&#x2F;fuzz&#x2F;fuzz_targets&#x2F;domtree.rs&quot;&gt;that can be fuzzed
too&lt;&#x2F;a&gt;,
using a from-first-principles definition of graph dominance. So the
test-tester has itself been tested in this additional way.&lt;&#x2F;p&gt;
&lt;p&gt;Once I had built enough tools with the lower-level tools, and
sharpened them all against each other, it was time to write the
register allocator itself. Once each major piece was implemented, I
first fuzzed it with the SSA function generator to check for panics
(assertion failures, mostly). Getting a clean run, given the
relatively generous spread of asserts throughout the codebase, gave
some confidence that the allocator was doing &lt;em&gt;something&lt;&#x2F;em&gt;
reasonable. But to truly be confident that the results were
semantically correct answers, we needed to lean more heavily on some
program analysis techniques.&lt;&#x2F;p&gt;
&lt;p&gt;In &lt;a href=&quot;&#x2F;blog&#x2F;2021-03&#x2F;15&#x2F;cranelift-isel-3&#x2F;&quot;&gt;another blog post&lt;&#x2F;a&gt; I detailed
our &quot;register allocator checker&quot;. In brief, this is a &lt;em&gt;symbolic
verification&lt;&#x2F;em&gt; engine that checks that the resulting register
allocations produce the same dataflow connectivity as the original,
pre-regalloc program. To fully verify regalloc2, I ported the checker
over, and drove the whole pipeline -- SSA function generator,
allocator, and checker -- with a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;main&#x2F;fuzz&#x2F;fuzz_targets&#x2F;ion_checker.rs&quot;&gt;fuzz
target&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;This workflow was remarkably (sometimes maddeningly!) effective. I
started with a supposedly complete allocator, and ran the
fuzzer. Within a few seconds it found a &quot;counterexample&quot; where,
according to the checker, regalloc2 produced an incorrect
allocation. I built
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;main&#x2F;src&#x2F;ion&#x2F;dump.rs&quot;&gt;annotation&lt;&#x2F;a&gt;
tooling to produce &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;gist.github.com&#x2F;cfallin&#x2F;38d80aac45da75ce9eb142f5e28c0648&quot;&gt;views of the allocator&#x27;s liveranges and other
metadata&lt;&#x2F;a&gt;
over the original program. I pored over this and debug-log output of
the allocator&#x27;s various stages, eventually worked out the bug (often
some corner-case I had not considered, or sometimes an unexpected
interaction between two different parts of the system) and came up
with a fix. With the particular fuzz-bug fixed, I started up the main
fuzzer again. libFuzzer&#x27;s startup seems to run over the entire corpus
before generating new inputs, so sometimes my bugfixes would quickly
cause regressions in other cases I had already handled before. After
juggling solutions and finding some way to maintain correctness in all
cases, I would let the fuzzer run again, usually finding my next novel
fuzzbug within a few minutes.&lt;&#x2F;p&gt;
&lt;p&gt;This was my life for a month or so. Fuzzers, especially over complex
programs with strict oracles, are &lt;em&gt;relentless&lt;&#x2F;em&gt;: they leave no rock
unturned, they find every bug you could imagine and some you can&#x27;t,
and they accept no excuses. But one day... you run the fuzzer and you
find that it keeps running. And running. Three hours later, it&#x27;s still
running. There is no better feeling in the software-engineering
universe, and frankly fuzzing with a strong oracle (like symbolic
checking or differential execution fuzzing) is probably the
second-strongest assurance one will get that one&#x27;s code is &lt;em&gt;correct&lt;&#x2F;em&gt;
(with respect to the &quot;spec&quot; implied by the testcase generator and
oracles, mind!) short of actual formal verification. This was the
project that changed my opinion on fuzzing from &quot;nice to have
supplemental correctness technique&quot; to &quot;the only way to develop
complex software&quot;.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;compatibility-and-migration-path&quot;&gt;Compatibility and Migration Path&lt;&#x2F;h3&gt;
&lt;p&gt;The last lesson I want to draw from my regalloc2 experience is how one
might think about compatibility and migrations, in the context of
large &quot;replace a whole unit&quot; updates to software projects.&lt;&#x2F;p&gt;
&lt;p&gt;The regalloc2 effort occurred within the context of the Cranelift
project, and was designed primarily for use in Cranelift (though it
can be used, and apparently is being used, as a standalone library
elsewhere as well). As such, a primary design directive for regalloc2
could be &quot;do whatever is needed to fit into Cranelift&#x27;s assumptions
about the register allocator&quot;.&lt;&#x2F;p&gt;
&lt;p&gt;On the other hand, conforming to the imprint left by the last register
allocator is a good way to sacrifice a rare chance to explore
different corners of the design space. The design of the API of
regalloc.rs made in 2020 was quite good for the time -- simple, easy
to use, and purpose-built for Cranelift -- but we subsequently learned
several lessons. For example, regalloc.rs required the program to be
already lowered out of SSA, resulting in somewhat inefficient
interactions between blockparam-generated moves and regalloc-generated
moves. Ideally we wanted to do something better here.&lt;&#x2F;p&gt;
&lt;p&gt;A timeline for context: regalloc2 proper was working, with its fuzzer
as its only client, after about 6 weeks of initial implementation
(late March to early May 2021). I cheerfully dove into a refactoring
of Cranelift at that point to adapt to the new abstractions.&lt;&#x2F;p&gt;
&lt;p&gt;Less cheerfully after a few weeks of effort, I stopped this
direct-port effort at around 547 type errors remaining (having never
gotten past a full typecheck). There was simply too much changing all
at once, and it was clearly not going to be a reasonable single diff
to review or to trust for correctness. I had underestimated how much
would have to change; pulling one string loosened three others.&lt;&#x2F;p&gt;
&lt;p&gt;It was clear that some sort of transition would need to happen in
multiple stages, so I next built a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc.rs&#x2F;pull&#x2F;127&quot;&gt;compatibility
shim&lt;&#x2F;a&gt; as a
new &quot;algorithm&quot; in regalloc.rs that was a thin wrapper around
regalloc2. This involved significant work in regalloc2 to expand its
range of accepted inputs: support for non-SSA code, support for
&quot;modify&quot; operands as well as uses and defs, and explicit handling of
program-level moves with integration into the move generation
logic. This was working by August of 2021. Performance results were
not as good as initially expected with &quot;native&quot; regalloc2 API usage,
but were a promising intermediate step nonetheless.&lt;&#x2F;p&gt;
&lt;p&gt;However, for somewhat complicated reasons, review of that PR stalled,
and I spent time in other parts of Cranelift (the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;main&#x2F;cranelift&#x2F;isle&#x2F;docs&#x2F;language-reference.md&quot;&gt;ISLE&lt;&#x2F;a&gt;
DSL and instruction-selector backends using it). When I eventually
came back to RA2, in February 2022, several things had changed: some
refactoring (as a result of ISLE) made adaptation to &quot;SSA-like&quot; form
in x86 instructions easier, and the enhancements to regalloc2 as part
of the regalloc.rs compatibility shim also let us use RA2 directly and
migrate away from &quot;modify&quot; operands, moves, etc., in an incremental
way.&lt;&#x2F;p&gt;
&lt;p&gt;So I made a second attempt at porting Cranelift to use regalloc2
directly, this time
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;3989&quot;&gt;succeeding&lt;&#x2F;a&gt;,
to &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;issues&#x2F;3942&quot;&gt;fairly good
results&lt;&#x2F;a&gt;. We&#x27;ve
been using RA2 since that PR merged in mid-April 2022, about a year
after RA2 began.&lt;&#x2F;p&gt;
&lt;p&gt;I learned a few valuable lessons from this saga, but the main one is:
incremental migration paths are everything. The above PR may look
horribly scary but much of the churn was &quot;semantically boring&quot;: RA2
supported, in the end, most of the same abstractions as regalloc.rs,
with only blockparam handling changing fundamentally. This is a sort
of hybrid of the &quot;compatibility shim&quot; and &quot;direct use of new API&quot;
approaches: new API, but supporting a superset of the semantic demands
of the old API. One can then migrate single API use-sites at a time
away from &quot;legacy semantics&quot; and eventually delete the warts (e.g.,
&quot;modify&quot; operands in addition to pure uses&#x2F;defs) if one desires, but
that is decoupled from the main atomic switchover. I indeed hope to do
such cleanup in Cranelift, in due time.&lt;&#x2F;p&gt;
&lt;p&gt;Along with that, it is useful to think of finite budget for
semantic&#x2F;design-level cleanup per change. Rewrites are opportune times
to push a project into a better design-space and benefit from lessons
learned, sometimes in ways that would be hard or impossible to do with
a truly incremental approach. However, at the margins where the
rewrite connects to the outside world, this shift causes tension and
so is fundamentally constrained or else has to pull the whole world
along with it. I am happy that regalloc2 pulls responsibility for SSA
lowering into the allocator; it can be handled more efficiently
there. Likewise I am happy that the compatibility-shim effort filled
in support for regalloc.rs features that made the rest of the
transition easier.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;unending-and-unwinnable-nature-of-heuristic-tuning&quot;&gt;Unending and Unwinnable Nature of Heuristic-Tuning&lt;&#x2F;h3&gt;
&lt;p&gt;The final lesson I wish to pull out of this experience is one that has
become apparent in the time since the initial transition to RA2: any
program that solves an NP-complete problem in a complex way, with a
hybridized ball of hundreds of individual heuristics and techniques
that somehow works most of the time, is &lt;em&gt;always&lt;&#x2F;em&gt; going to make someone
unhappy in some case and at some point unambiguous wins become very
hard to find. That is not at all to say that it&#x27;s not worth continuing
attempts at optimization; sometimes improvements do become
apparent. But they become much rarer after an initial hillclimb to the
top of a &quot;competent implementation of one point in design-space&quot; local
maximum.&lt;&#x2F;p&gt;
&lt;p&gt;While looking for more performance, I experimented with many different
split heuristics. Especially difficult are splits&#x27; relationship to
loops: when one has a hot inner loop, one &lt;em&gt;really&lt;&#x2F;em&gt; wants to place a
split-point that implies an expensive move (load or store) &lt;em&gt;outside&lt;&#x2F;em&gt;
the inner loop. But getting this right in all cases is subtle, because
the winning tradeoff depends on register pressure inside the loop, how
many values are live across the loop and to the following code, how
many uses occur in the loop and how frequently (rare path vs. common
path), and so on. In the end, I actually abandoned a number of more
complex cost heuristics (an example is in this &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;commit&#x2F;428e6a41f7b37697196e3a82e8326f22839307b5&quot;&gt;never-merged
commit&lt;&#x2F;a&gt;)
and went with several simple heuristics: &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;b78ccbce6e5700bc1ed2356bbb1d3221de49a353&#x2F;src&#x2F;ion&#x2F;process.rs#L896-L903&quot;&gt;minimize the cost of the
implied move at a
split&lt;&#x2F;a&gt;,
and &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;b78ccbce6e5700bc1ed2356bbb1d3221de49a353&#x2F;src&#x2F;ion&#x2F;process.rs#L1036-L1052&quot;&gt;explicitly hoist split-points outside of
loops&lt;&#x2F;a&gt;. This
worked best overall, but did leave a little performance unclaimed in
some microbenchmarks.&lt;&#x2F;p&gt;
&lt;p&gt;Sometimes clearer improvements are still possible. One example of a
recent investigation: in
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;3785&quot;&gt;#3785&lt;&#x2F;a&gt;, we
noticed that switching to RA2 had caused an extra move instruction to
appear in a particular sequence. This seems minor, but it is always
good to understand &lt;em&gt;why&lt;&#x2F;em&gt; it might have occurred and if it points to
some deeper issue. After some
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;pull&#x2F;49&quot;&gt;investigation&lt;&#x2F;a&gt;
it became apparent that the splitting heuristics were suboptimal in
the particular case of a liverange that spans from a
register-constrained use to a stack-constrained use. The details are
beyond the scope of this post (thank goodness, it&#x27;s long enough
already!); but empirically I found that trimming liveranges around a
split-site in a slightly different way tended to improve results.&lt;&#x2F;p&gt;
&lt;p&gt;So, some changes will be an unmitigated win, but not every tradeoff is
so. At the very least, the nature of a register allocator is that one
will likely have an unending stream of &quot;could work better in this
case&quot; sorts of issues. Can&#x27;t win &#x27;em all (but keep trying
nonetheless!).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusions&quot;&gt;Conclusions&lt;&#x2F;h2&gt;
&lt;p&gt;We&#x27;re finally at the conclusions -- thanks to all who have persisted
in reading this far!&lt;&#x2F;p&gt;
&lt;p&gt;regalloc2 has been an immensely rewarding project for me, despite (or
perhaps because of) the ups-and-downs inherent in building an
honest-to-goodness, actually-works,
somewhat-competitive-with-peer-compilers register allocator. It was a
far larger project than I had anticipated: when I began, I told my
manager it would probably be a few weeks to evaluate scope, maybe a
month of work total. Witness &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Hofstadter%27s_law&quot;&gt;Hofstadter&#x27;s
Law&lt;&#x2F;a&gt; in action: that
is, it will always take longer than you think it will, even when
accounting for Hofstadter&#x27;s Law.&lt;&#x2F;p&gt;
&lt;p&gt;I hope some of the above lessons have been illuminating, and perhaps
this post has given some sense of how many interesting problems the
register-allocator space contains. It&#x27;s a well-studied area for at
least 40 years now, with countless approaches and clever tricks to
learn and to combine in new ways; the work is far from over!&lt;&#x2F;p&gt;
&lt;h2 id=&quot;acknowledgments&quot;&gt;Acknowledgments&lt;&#x2F;h2&gt;
&lt;p&gt;Many, many thanks to: &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;julian-seward1&quot;&gt;Julian
Seward&lt;&#x2F;a&gt; and &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bnjbvr&quot;&gt;Benjamin
Bouvier&lt;&#x2F;a&gt; for numerous discussions about
register allocation throughout 2020, and Julian for several followup
discussions after regalloc2 started to exist; Julian Seward and
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;Amanieu&quot;&gt;Amanieu d&#x27;Antras&lt;&#x2F;a&gt; for initial code-review
of regalloc2 proper; Amanieu for a number of really high-quality PRs
to improve RA2 and add feature support; and &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;fitzgen&quot;&gt;Nick
Fitzgerald&lt;&#x2F;a&gt; for code-review of the (quite
extensive) Cranelift refactoring to use regalloc2. Enormous thanks to
Nick for reading over this entire post and providing feedback as well.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;1&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;1&lt;&#x2F;sup&gt;
&lt;p&gt;which is to say, the original
&lt;a href=&quot;&#x2F;blog&#x2F;2020&#x2F;09&#x2F;18&#x2F;cranelift-isel-1&#x2F;&quot;&gt;three&lt;&#x2F;a&gt;-&lt;a href=&quot;&#x2F;blog&#x2F;2021&#x2F;01&#x2F;22&#x2F;cranelift-isel-2&#x2F;&quot;&gt;part&lt;&#x2F;a&gt;
&lt;a href=&quot;&#x2F;blog&#x2F;2021&#x2F;03&#x2F;15&#x2F;cranelift-isel-3&#x2F;&quot;&gt;series&lt;&#x2F;a&gt; covered a range of
topics summarizing the goals and ideas of Cranelift&#x27;s new
backend design, but we haven&#x27;t stopped working to improve things
since then! The series is now four-thirds complete; by the time
I&#x27;m done it may be five-thirds or more...&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;2&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;2&lt;&#x2F;sup&gt;
&lt;p&gt;In fact, it is perhaps the most important problem to solve for a
fast Wasm-focused compiler, because most other common compiler
optimizations will have been done at least to some degree to the
Wasm bytecode; register allocation is the main transform that
bridges the semantic gap from stack bytecode to machine code.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;3&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;3&lt;&#x2F;sup&gt;
&lt;p&gt;Other sorts of constraints are possible too; in general, a
liverange is constrained by all of the &quot;register mentions&quot; in
instructions that touch the liverange&#x27;s vreg, and we have to
satisfy all of these constraints at once. A constraint may be
&quot;any register of this kind&quot;, or &quot;this particular physical
register&quot;, or &quot;a slot on the stack&quot;, or &quot;the same register as
given to another liverange&quot;, for example. And beyond
constraints, we may have soft &quot;hints&quot; as well, which if
followed, reduce the need to move values around.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;4&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;4&lt;&#x2F;sup&gt;
&lt;p&gt;regalloc2 supports arbitrary control flow (i.e., does not impose
any restrictions on
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Control-flow_graph#Reducibility&quot;&gt;reducibility&lt;&#x2F;a&gt;);
its only requirement is that &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Control-flow_graph#Special_edges&quot;&gt;critical
edges&lt;&#x2F;a&gt;
are split, which Cranelift ensures by construction during
lowering.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;5&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;5&lt;&#x2F;sup&gt;
&lt;p&gt;Full credit for this idea, as well as most of the constraint
design in regalloc2, goes to IonMonkey&#x27;s register allocator.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;6&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;6&lt;&#x2F;sup&gt;
&lt;p&gt;As we&#x27;ll note under &quot;Lessons&quot; below, during development of a
compatibility layer that allowed regalloc2 to emulate
regalloc.rs, an earlier register allocator, we actually added a
&quot;modify&quot; kind of operand that directly corresponds to the
semantics of &lt;code&gt;rax&lt;&#x2F;code&gt; above, namely read-then-written all in one
register. We subsequently used it in several places while
migrating Cranelift. But for simplicity we hope to eventually
remove this (once all uses of it are gone).&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;7&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;7&lt;&#x2F;sup&gt;
&lt;p&gt;It&#x27;s actually a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;33611a68b90e40869bba52934449315a8f4e5477&#x2F;src&#x2F;indexset.rs#L13-L26&quot;&gt;sparse
bitset&lt;&#x2F;a&gt;
that, when large enough, stores a hashmap whose values are
&lt;em&gt;contiguous 64-bit chunks&lt;&#x2F;em&gt; of the whole bitset. This is because,
for large functions with thousands of virtual registers, keeping
thousands of bits per basic block would be impractical. However,
the naive sparse approach, where we keep a &lt;code&gt;HashSet&amp;lt;VReg&amp;gt;&lt;&#x2F;code&gt; or
equivalent, is also costly because it spends 32 bits per set
element (plus load-factor overhead). We observed that the live
registers at a given point are often &quot;clustered&quot;: there are some
long-range live values from early in the function, and then a
bunch of recently-defined registers. (This depends also on
virtual registers being numbered roughly in program order, which
is generally a good heuristic to rely on.) So we have a few
&lt;code&gt;u64&lt;&#x2F;code&gt;s and pay the sparse map cost for those, then have a dense
map within each 64-bit chunk.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;8&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;8&lt;&#x2F;sup&gt;
&lt;p&gt;Credit must go to IonMonkey for &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;rev&#x2F;70cf6863bd85af2a3188ec1fe5209a3ec1b2de86&#x2F;js&#x2F;src&#x2F;jit&#x2F;BacktrackingAllocator.cpp#643-647&quot;&gt;this
trick&lt;&#x2F;a&gt;
as well, though the details of how to edit the liveranges
appropriately to get the right interference semantics were &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;33611a68b90e40869bba52934449315a8f4e5477&#x2F;src&#x2F;ion&#x2F;moves.rs#L762-L805&quot;&gt;far
from
clear&lt;&#x2F;a&gt;
and the path to our current approach was &quot;paved by fuzzbug
failures&quot;, so to speak.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;9&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;9&lt;&#x2F;sup&gt;
&lt;p&gt;Some literature on SSA form calls the connected set of
liveranges via phi-nodes or block parameters &quot;webs&quot;. Our notion
of a bundle encompasses this case but is a bit more general; in
principle we can merge any liveranges into a bundle as long as
they don&#x27;t overlap.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;10&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;10&lt;&#x2F;sup&gt;
&lt;p&gt;Actually, there are &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc2&#x2F;blob&#x2F;0395614545da5ccc45866bfa50dcdbe9cc37c253&#x2F;src&#x2F;ion&#x2F;data_structures.rs#L561-L570&quot;&gt;up to
seven&lt;&#x2F;a&gt;
parallel moves between instructions, at priorities according to
the way that various constraint edge-cases are lowered. For
example, when a single vreg must be placed in multiple physical
registers due to multiple uses with different fixed-register
constraints, the move that makes this happen occurs at
&lt;code&gt;MultiFixedReg&lt;&#x2F;code&gt; priority, which comes after the main
inter-instruction permutation (it is logically part of the input
setup for the following instruction). And &lt;code&gt;ReusedInput&lt;&#x2F;code&gt; moves
happen after that, because any one of the fixed-register inputs
could be reused as an input. The detailed reasoning for the
order here is beyond the scope of this blogpost, but suffice it
to say that the fuzzer helped immensely in getting this ordering
right!&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;11&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;11&lt;&#x2F;sup&gt;
&lt;p&gt;Pertinent to the broader point about fuzzing, this combination
of constraints was not generated by RA2&#x27;s fuzz target, which is
why the resulting corner cases were not seen during
development. As soon as the fuzzing testcase generator was
extended to do so, the fuzzer found a counterexample within a
few seconds, and helped to verify the constraint rewrites in
RA2&#x27;s frontend that fixed this issue.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
</content>
    </entry>
    <entry xml:lang="en">
        <title>Cranelift, Part 3: Correctness in Register Allocation</title>
        <published>2021-03-15T00:00:00+00:00</published>
        <updated>2021-03-15T00:00:00+00:00</updated>
        <author>
            <name></name>
        </author>
        <link rel="alternate" type="text/html" href="https://cfallin.org/blog/2021/03/15/cranelift-isel-3/"/>
        <id>https://cfallin.org/blog/2021/03/15/cranelift-isel-3/</id>
        <content type="html" xml:base="https://cfallin.org/blog/2021/03/15/cranelift-isel-3/">&lt;p&gt;This post is the last in a three-part series about
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;tree&#x2F;main&#x2F;cranelift&quot;&gt;Cranelift&lt;&#x2F;a&gt;.
In the &lt;a href=&quot;&#x2F;blog&#x2F;2020&#x2F;09&#x2F;18&#x2F;cranelift-isel-1&#x2F;&quot;&gt;first post&lt;&#x2F;a&gt;, I covered
overall context and the instruction-selection problem; in the &lt;a href=&quot;&#x2F;blog&#x2F;2021&#x2F;01&#x2F;22&#x2F;cranelift-isel-2&#x2F;&quot;&gt;second
post&lt;&#x2F;a&gt;, I took a deep dive into
compiler performance via careful algorithmic design.&lt;&#x2F;p&gt;
&lt;p&gt;In this post, I want to dive into how we engineer for and work to
ensure &lt;em&gt;correctness&lt;&#x2F;em&gt;, which is perhaps the most important aspect of
any compiler project. A compiler is usually a complex beast: to obtain
reasonable performance, one must perform quite complex analyses and
carefully transform an arbitrary program in ways that preserve its
meaning. It is likely that one will make mistakes and miss subtle
corner cases, especially in the cracks and crevices between
components. Despite all of that, correct code generation is &lt;em&gt;vital&lt;&#x2F;em&gt;
because the consequences of miscompilation are potentially so severe:
basically any guarantee (security-related or otherwise) that we make
at a higher level of the system stack relies on the (quite
reasonable!) assumption that the computer will execute the source code
we have written faithfully. If the compiler translates our code to
something else, then all bets are off.&lt;&#x2F;p&gt;
&lt;p&gt;There are ways that one can apply good engineering principles to
reduce this risk. An extremely powerful technique derives from the
insight that &lt;em&gt;checking a result&lt;&#x2F;em&gt; is usually easier than &lt;em&gt;computing&lt;&#x2F;em&gt;
it, and if we randomly generate many inputs, run our compiler (or
other program) on these inputs, and check its output, we can get to a
&lt;em&gt;statistical approximation&lt;&#x2F;em&gt; of the claim &quot;for all inputs, the compiler
generates the correct output&quot;. The more random inputs we try, the
stronger this statement becomes. This technique is known as
&lt;em&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Fuzzing&quot;&gt;fuzzing&lt;&#x2F;a&gt;&lt;&#x2F;em&gt; with a
&lt;em&gt;program-specific
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Test_oracle&quot;&gt;oracle&lt;&#x2F;a&gt;&lt;&#x2F;em&gt;, and I could
write a lengthy ode to its uncanny power to find bugs (many others
have, already).&lt;&#x2F;p&gt;
&lt;p&gt;In this post, I will cover how we worked to ensure correctness in our
register allocator,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc.rs&quot;&gt;regalloc.rs&lt;&#x2F;a&gt;, by
developing a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc.rs&#x2F;tree&#x2F;main&#x2F;lib&#x2F;src&#x2F;checker.rs&quot;&gt;symbolic
checker&lt;&#x2F;a&gt;
that uses abstract interpretation to prove correctness for a specific
register allocation result. By using this checker as a fuzzing oracle,
and driving just the register allocator with a focused fuzzing target,
we have been able to uncover some very interesting and subtle bugs,
and achieve a fairly high confidence in the allocator&#x27;s robustness.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;what-is-register-allocation&quot;&gt;What is Register Allocation?&lt;&#x2F;h2&gt;
&lt;p&gt;Before we dive in, we need to cover a few basics. Most importantly:
what is the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Register_allocation&quot;&gt;register allocation
problem&lt;&#x2F;a&gt;, and what
makes it hard?&lt;&#x2F;p&gt;
&lt;p&gt;In a typical programming language, a program can have an arbitrary
number of variables or values in scope. This is a very useful
abstraction: it is easiest to describe an algorithm when one does not
have to worry about where to store the values.&lt;&#x2F;p&gt;
&lt;p&gt;For example, one could write the program:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;void f() {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    int x0 = compute(0);&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    int x1 = compute(1);&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &#x2F;&#x2F; ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    int x99 = compute(99);&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &#x2F;&#x2F; ---&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    consume(x0);&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    consume(x1);&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    &#x2F;&#x2F; ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    consume(x99);&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;At the midpoint of the program (the &lt;code&gt;---&lt;&#x2F;code&gt; mark), there are 100
&lt;code&gt;int&lt;&#x2F;code&gt;-sized values that have been computed and are later used. When
the compiler produces machine code for this function, where are those
values stored?&lt;&#x2F;p&gt;
&lt;p&gt;For small functions with only a few values, it is easy to place every
value in a CPU register. But most CPUs do not have 100 general-purpose
registers for storing integers&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#1&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;; and in general, most languages
either do not place limits on the number of local variables or else
impose limits that are much, much higher than the typical number of
CPU registers. So we need some approach that scales beyond, say, about
16 values (x86-64) or about 32 values (aarch64) in use at once.&lt;&#x2F;p&gt;
&lt;p&gt;A very simple answer is to allocate a &lt;em&gt;memory&lt;&#x2F;em&gt; location for each local
variable. In fact this is exactly what the C programming model
provides: all of the &lt;code&gt;xN&lt;&#x2F;code&gt; variables above &lt;em&gt;semantically&lt;&#x2F;em&gt; live in
memory, and we can take the address &lt;code&gt;&amp;amp;xN&lt;&#x2F;code&gt;. If one does this, one will
find that the addresses are part of the &lt;em&gt;stack&lt;&#x2F;em&gt;. When the function is
called, it allocates a new area on the stack called the &lt;em&gt;stack frame&lt;&#x2F;em&gt;
and uses it to store local variables.&lt;&#x2F;p&gt;
&lt;p&gt;This is far from the best we can do, though! Consider what this means
when we actually perform some operation on the locals. If we read two
locals, perform an addition, and store the result in a third, like so:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;x0 = x1 + x2;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;then in machine code, because most CPUs do not have instructions that
can read two in-memory values and write back a third in-memory result,
we would need to emit something like the following:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;ld r0, [address of x1]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;ld r1, [address of x2]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;add r0, r0, r1  &#x2F;&#x2F; r0 := r0 + r1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;st r0, [address of x0]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Compiling code in this way is very &lt;em&gt;fast&lt;&#x2F;em&gt; because we need to make
almost no decisions: a variable reference &lt;em&gt;always&lt;&#x2F;em&gt; becomes a memory
load, for example. This is how a &quot;baseline &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Just-in-time_compilation&quot;&gt;JIT
compiler&lt;&#x2F;a&gt;&quot;
typically works, actually: for example, in the SpiderMonkey JS and
Wasm JIT compiler, the baseline JIT tier -- which is meant to produce
passable code very, very quickly -- actually keeps a stack of values
in memory that correspond one-to-one to the JS bytecode or Wasm
bytecode&#x27;s value stack. (You can read the code
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;rev&#x2F;38ed718a101aca27db25984413c052ccd8c0ceda&#x2F;js&#x2F;src&#x2F;jit&#x2F;CacheIRCompiler.h#301&quot;&gt;here&lt;&#x2F;a&gt;:
it actually keeps a few of the most recent values, at the top of
operand stack, in fixed registers and the rest in memory.)&lt;&#x2F;p&gt;
&lt;p&gt;Unfortunately, accessing memory multiple times for every operation is
very slow. What&#x27;s more, it is often the case that values are &lt;em&gt;reused
soon after being produced&lt;&#x2F;em&gt;: for example, we might have&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;x0 = x1 + x2;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;x3 = x0 * 2;&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;When we compute &lt;code&gt;x3&lt;&#x2F;code&gt; using &lt;code&gt;x0&lt;&#x2F;code&gt;, do we reload &lt;code&gt;x0&lt;&#x2F;code&gt;&#x27;s value from memory
immediately after storing it? A smarter compiler should be able to
remember that it had just computed the value, and should keep it in a
register, avoiding the round-trip through memory altogether.&lt;&#x2F;p&gt;
&lt;p&gt;This is &lt;em&gt;register allocation&lt;&#x2F;em&gt;: it is assigning a value in the program
to a register for storage. What makes register allocation interesting
is that (as noted above) there are fewer CPU registers than the number
of allowable program values, so we have to choose some subset of
values to keep in registers. This is often constrained in certain
ways: for example, an &lt;code&gt;add&lt;&#x2F;code&gt; instruction on RISC-like CPUs can only
read from, and write to, registers, so a value&#x27;s storage location must
be a register immediately before it is used by a &lt;code&gt;+&lt;&#x2F;code&gt;
operator. Fortunately, the location assignments can change over time,
so that at different points in the machine code, a register can be
assigned to hold different values. The job of the register allocator
is to decide how to shuffle values between memory and registers, and
between registers, so that at any given time the values that need to
be in registers are so.&lt;&#x2F;p&gt;
&lt;p&gt;In our design, the register allocator will accept as input a type of
almost-machine-code called &quot;virtual-register code&quot;, or &lt;code&gt;VCode&lt;&#x2F;code&gt;. This
has a sequence of machine instructions, but registers named in the
instructions are &lt;em&gt;virtual&lt;&#x2F;em&gt; registers: the compiler can use as many of
them as it needs. The register allocator will (i) rewrite the register
references in the instructions to be actual machine register names,
and (ii) insert instructions to shuffle data as needed. These
instructions are called &lt;em&gt;spills&lt;&#x2F;em&gt; when they move a value from a
register to memory; &lt;em&gt;reloads&lt;&#x2F;em&gt; when the move a value from memory back
to a register; and &lt;em&gt;moves&lt;&#x2F;em&gt; when they move values between
registers. The memory locations where values are stored when not in
registers are called &lt;em&gt;spill slots&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;An example of the register-allocation problem is shown below on a
program with four instructions:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2020-12-23-regalloc-web.svg&quot; alt=&quot;Figure: Register allocation&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;This allocation is performed onto a machine with two registers (&lt;code&gt;r0&lt;&#x2F;code&gt;
and &lt;code&gt;r1&lt;&#x2F;code&gt;). On the left, the original program is written in an
assembly-like form with &lt;em&gt;virtual registers&lt;&#x2F;em&gt;. On the right, the program
has been modified to use only &lt;em&gt;real registers&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Between each instruction, we have written a mapping from virtual
registers to real registers. The register allocator&#x27;s task is just
(&quot;just&quot;!) to compute these mappings and then edit the instructions,
taking their register references through these mappings.&lt;&#x2F;p&gt;
&lt;p&gt;Note that the program, at one point, has &lt;em&gt;three&lt;&#x2F;em&gt; live values, or
values that still must be preserved because they will be used later:
between the first and second instructions, all of &lt;code&gt;v0&lt;&#x2F;code&gt;, &lt;code&gt;v1&lt;&#x2F;code&gt; and &lt;code&gt;v2&lt;&#x2F;code&gt;
are live.  The machine has only two registers, so it cannot hold all
live values in them; it must spill at least one. This is the reason
for the &lt;em&gt;spill instruction&lt;&#x2F;em&gt;, written as a store to the stack slot
&lt;code&gt;[sp+0]&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;how-hard-is-register-allocation&quot;&gt;How Hard is Register Allocation?&lt;&#x2F;h2&gt;
&lt;p&gt;In general, the register allocator will first analyze the program to
work out which values are live at which program points. This liveness
information and related constraints specify a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Combinatorial_optimization&quot;&gt;combinatorial
optimization&lt;&#x2F;a&gt;
problem: certain values must be stored &lt;em&gt;somewhere&lt;&#x2F;em&gt; at each point,
constraints limit which choices can be made and some choices will
conflict with some others (e.g., two values cannot occupy a register
at the same time), and a set of choices implies some cost (in data
movement). The allocator will solve this optimization problem as well
as it can using heuristics of some sort, depending on the register
allocator.&lt;&#x2F;p&gt;
&lt;p&gt;Is this a hard problem? In fact, it is not only hard in a colloquial sense,
but &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;NP-completeness&quot;&gt;NP-complete&lt;&#x2F;a&gt;: this
means that it is as hard as any other NP problem, for which we know only
exponential-time brute-force algorithms in the worst case.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#2&quot;&gt;2&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#3&quot;&gt;3&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt; The
reason is that the problem does not have &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Optimal_substructure&quot;&gt;optimal
substructure&lt;&#x2F;a&gt;: it
cannot be decomposed into non-interacting parts that can each be solved
separately and then built up into an overall solution; rather, decisions at
one point affect decisions elsewhere, potentially anywhere else in the
function body. Thus, in the worst case, we can&#x27;t do better than a
brute-force search if we want an optimal solution.&lt;&#x2F;p&gt;
&lt;p&gt;There are many good &lt;em&gt;approximations&lt;&#x2F;em&gt; to optimal register allocation. A
common one is &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;doi&#x2F;10.1145&#x2F;330249.330250&quot;&gt;linear-scan register
allocation&lt;&#x2F;a&gt;, which can run in
almost-linear time (with respect to the code size). Allocators that can
afford to spend more time are more complex: for example, in
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc.rs&quot;&gt;regalloc.rs&lt;&#x2F;a&gt;, in addition
to the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc.rs&#x2F;tree&#x2F;main&#x2F;lib&#x2F;src&#x2F;linear_scan&quot;&gt;linear-scan
implementation&lt;&#x2F;a&gt;
(written by my brilliant colleague Benjamin Bouvier), we have a
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc.rs&#x2F;blob&#x2F;main&#x2F;lib&#x2F;src&#x2F;bt_main.rs&quot;&gt;&quot;backtracking&quot;
algorithm&lt;&#x2F;a&gt;
(written by my other brilliant colleague Julian Seward) that can edit and
improve its choices as it discovers higher-priority uses for registers.&lt;&#x2F;p&gt;
&lt;p&gt;The details of how these algorithms work do not really matter here,
except to say that they are &lt;em&gt;very&lt;&#x2F;em&gt; complicated and hard to get
right. An algorithm that appears relatively simple at the conceptual
level or in pseudocode quickly runs into interesting and subtle
considerations as real-world constraints creep in. The regalloc.rs
codebase is about 25K lines of deeply-algorithmic Rust code; any
reasonable engineer would expect this to include at least several
bugs! Compounding the urgency here, a register-allocation bug can
result in &lt;em&gt;arbitrary&lt;&#x2F;em&gt; incorrect results, because the register
allocator is in charge of &quot;wiring up&quot; all of the dataflow in the
program. If we exchange one arbitrary value with another arbitrary
value in the program, anything could happen.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;how-to-verify-correctness&quot;&gt;How to Verify Correctness&lt;&#x2F;h2&gt;
&lt;p&gt;So we want to write a correct register allocator. How do we even start
on a task like this?&lt;&#x2F;p&gt;
&lt;p&gt;It might help to break down what we mean by &quot;correct&quot;. Note that the
register allocation problem has a nice property: the programs both
&lt;em&gt;before&lt;&#x2F;em&gt; and &lt;em&gt;after&lt;&#x2F;em&gt; allocation have a well-defined semantics. In
particular, we can think of register allocation as a transformation
that converts programs running on an &lt;em&gt;infinite-register machine&lt;&#x2F;em&gt;
(where we can use as many virtual registers as we want) to a
&lt;em&gt;finite-register machine&lt;&#x2F;em&gt; (where the CPU has a fixed set of
registers). If the original program on the infinite-register machine
yields the same result as the transformed (register-allocated) program
on the finite-register machine, then we have achieved a correct
register allocation.&lt;&#x2F;p&gt;
&lt;p&gt;How do we test this equivalence?&lt;&#x2F;p&gt;
&lt;h3 id=&quot;single-program-single-input-equivalence&quot;&gt;Single-Program, Single-Input Equivalence&lt;&#x2F;h3&gt;
&lt;p&gt;The simplest way to test whether two programs are equivalent is to run
them and compare the results! Let&#x27;s say we do this: for a single
program, choose some random inputs, and run the virtual-registerized
program alongside its register-allocated version on the appropriate
interpreters. Compare register and memory state at the end.&lt;&#x2F;p&gt;
&lt;p&gt;What does it mean if the final machine states match? It means that
&lt;em&gt;for this one program&lt;&#x2F;em&gt;, our register allocator produces a transformed
program that is correct &lt;em&gt;for this one program input&lt;&#x2F;em&gt;. Note the two
qualifications here. First, we have not necessarily shown that the
register allocation is correct given another program input. Perhaps a
different input causes a branch to go down another program path, and
the register allocator introduced an error on that path. Second, we
have not shown anything for any other program; we have only tested a
single program and its register-allocated output.&lt;&#x2F;p&gt;
&lt;p&gt;We can attempt to address the first limitation -- correctness only
under one input -- by taking more sample points. For example, we could
choose a thousand random program inputs, and even drive this random
choice with some sort of feedback that tries to maximize control-flow
coverage or other &quot;interesting&quot; behavior (as fuzzers do). We could
probably achieve reasonable confidence that this single register
allocation result is correct, given enough test cases.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2021-03-10-single-program-single-input-web.svg&quot; alt=&quot;Figure: Checking a program with concrete inputs&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;However, this is still very &lt;em&gt;expensive&lt;&#x2F;em&gt;: we are asking to run the
whole program N times to get a sample size of N. Even a single
execution may be expensive: the program on which we have performed
register allocation might be a compiler, or a videogame, for example.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;single-program-for-all-input-equivalence&quot;&gt;Single-Program, For-all-Input Equivalence&lt;&#x2F;h3&gt;
&lt;p&gt;Can we avoid the need to run the program &lt;em&gt;at all&lt;&#x2F;em&gt; to test that its
register-allocated version is correct?&lt;&#x2F;p&gt;
&lt;p&gt;The answer is surprisingly simple: yes, we can, by simply altering the
&lt;em&gt;domain&lt;&#x2F;em&gt; that the program executes on. Ordinarily we think of CPU
registers as containing concrete numbers -- say, 64-bit values. What
if they contained &lt;em&gt;symbols&lt;&#x2F;em&gt; instead?&lt;&#x2F;p&gt;
&lt;p&gt;By generalizing over program values with symbols, we can often
represent the state of the system in terms of inputs without caring
what those inputs are. For example, given the program:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;ld v0, [A]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;ld v1, [B]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;ld v2, [C]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;add v3, v0, v1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;add v4, v2, v3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;return v4&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;register-allocated to:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;ld r0, [A]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;ld r1, [B]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;ld r2, [C]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;add r0, r0, r1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;add r0, r2, r0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;return r0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;without symbolic reasoning, we could store arbitrary integers to
memory locations &lt;code&gt;A&lt;&#x2F;code&gt;, &lt;code&gt;B&lt;&#x2F;code&gt; and &lt;code&gt;C&lt;&#x2F;code&gt; and simulate the program&#x27;s execution
before and after register allocation, never seeing a mismatch, but
this would not prove anything unless we iterated through all possible
values. However, if we suppose that after the three loads, &lt;code&gt;r0&lt;&#x2F;code&gt;
contains &lt;code&gt;v0&lt;&#x2F;code&gt; (as a symbolic value, whatever it is), &lt;code&gt;r1&lt;&#x2F;code&gt; contains
&lt;code&gt;v1&lt;&#x2F;code&gt;, and &lt;code&gt;r2&lt;&#x2F;code&gt; contains &lt;code&gt;v2&lt;&#x2F;code&gt;, and that &lt;code&gt;r0&lt;&#x2F;code&gt; contains &lt;code&gt;v3&lt;&#x2F;code&gt; after the
first add and &lt;code&gt;v4&lt;&#x2F;code&gt; after the second add, we can see the correspondence
by matching up the symbols.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2021-03-10-single-program-symbolic-web.svg&quot; alt=&quot;Figure: Checking a program with symbolic values&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;This is a very simple example, and perhaps under-sells the insight and
power of this approach; we will come back to it later when we talk
about &lt;em&gt;Abstract Interpretation&lt;&#x2F;em&gt; below.&lt;&#x2F;p&gt;
&lt;p&gt;In any case, what we have shown is that for a single instance of the
register-allocation problem, we can &lt;em&gt;prove&lt;&#x2F;em&gt; that it transformed the
program in a correct way. Concretely, this means that the machine code
that we generate will execute just as if we were interpreting the
virtual-register code; if we can correctly generate virtual-register
code, then our compiler is correct. That&#x27;s excellent! Can we go
further?&lt;&#x2F;p&gt;
&lt;h3 id=&quot;for-all-programs-equivalence&quot;&gt;For-all-Programs Equivalence&lt;&#x2F;h3&gt;
&lt;p&gt;We could prove a-priori that the register allocator will &lt;em&gt;always&lt;&#x2F;em&gt;
transform &lt;em&gt;any&lt;&#x2F;em&gt; program in a way that is correct. In other words, we
could abstract not only over the input values to the program, but over
the &lt;em&gt;program&lt;&#x2F;em&gt; itself.&lt;&#x2F;p&gt;
&lt;p&gt;If we can prove this, then we have no need to run any sort of check at
runtime. Abstracting over program inputs lets us avoid the need to run
the program; we know the register allocation is correct for all
inputs. In an analogous way, abstracting over the program to be
register-allocated would let us avoid the need to run the register
allocator; we know the register &lt;em&gt;allocator&lt;&#x2F;em&gt; is correct for all
&lt;em&gt;programs&lt;&#x2F;em&gt; and for all &lt;em&gt;inputs&lt;&#x2F;em&gt; to those programs.&lt;&#x2F;p&gt;
&lt;p&gt;One can imagine that this is much harder. In fact, it has been done,
but is a significant proof-engineering effort, and is a realm of
active research: this basically requires writing a machine-verifiable
proof that one&#x27;s compiler algorithms are correct. Such proven-correct
compilers exist: e.g., &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;compcert.org&#x2F;&quot;&gt;CompCert&lt;&#x2F;a&gt; has been
proven to compile C correctly to machine code for several
platforms. Unfortunately, such efforts are strongly limited by the
proof-engineering effort that is required, and thus this approach is
unlikely to be feasible for a compiler unless it is their primary
goal.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;our-approach-allocator-with-checker&quot;&gt;Our Approach: Allocator with Checker&lt;&#x2F;h3&gt;
&lt;p&gt;Given all of the above, we choose what we believe is the most
reasonable tradeoff: we build a &lt;em&gt;symbolic checker&lt;&#x2F;em&gt; for the &lt;em&gt;output&lt;&#x2F;em&gt; of
the register allocator. This does not let us make a static claim that
our register allocator is correct, but it &lt;em&gt;does&lt;&#x2F;em&gt; let us &lt;em&gt;prove&lt;&#x2F;em&gt; that
it is correct for any given compiler run; and if we use this as a
fuzzing oracle, we can build &lt;em&gt;statistical confidence&lt;&#x2F;em&gt; that it is
correct for all compiler runs.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;checking-the-register-allocator&quot;&gt;Checking the Register Allocator&lt;&#x2F;h2&gt;
&lt;p&gt;Our overall flow is pictured below:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2021-03-10-checker-web.svg&quot; alt=&quot;Figure: Augmenting register allocation with a checker&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;There are two ways in which we can add a
register-allocator checker into the system. The first, on the left, we call
&quot;runtime checking&quot;: in this mode, every register allocator execution is checked
and the machine code using the allocations is not permitted to execute (i.e.
the compiler does not return a result) until the checker verifies equivalence.
This is the safest mode: it provides the same guarantees as a proven-correct
allocator (&quot;for-all-programs equivalence&quot; above). However, it imposes some
overhead on every compilation, which may not be desirable. For this reason,
while running the register allocator with the checker is a supported option in
Cranelift, it is not the default.&lt;&#x2F;p&gt;
&lt;p&gt;The second mode is one in which we apply the checker to a &lt;em&gt;fuzzing&lt;&#x2F;em&gt; workflow,
and is the approach we have generally preferred (we have a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc.rs&#x2F;blob&#x2F;main&#x2F;fuzz&#x2F;fuzz_targets&#x2F;bt.rs&quot;&gt;fuzz
target&lt;&#x2F;a&gt;
in regalloc.rs that generates arbitrary input programs and runs the
checker on each one; and we are &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;google&#x2F;oss-fuzz&#x2F;blob&#x2F;4a6021021043bddfa017df7d0aea26ad76edbba0&#x2F;projects&#x2F;wasmtime&#x2F;build.sh#L59&quot;&gt;running this
continuously&lt;&#x2F;a&gt;
as part of Wasmtime&#x27;s membership in Google&#x27;s
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;google&#x2F;oss-fuzz&#x2F;&quot;&gt;oss-fuzz&lt;&#x2F;a&gt; continuous-fuzzing
initiative). In this mode, we use the checker as an
application-specific oracle for a fuzzing engine: as the fuzzing engine
generates random programs (test cases), we run the register allocator over
these programs, run the checker on the result, and tell the engine whether the
register allocator passed or failed.  The fuzzer will flag any failing test
cases for a human developer to debug. If the fuzzer runs for a long time
without finding any issues, we can then have more confidence that the register
allocator is correct, even without running the checker; and the longer the
fuzzer runs, the greater our confidence becomes. The application-specific
oracle sigificantly improves over more generic fuzzer feedback mechanisms, such
as program crashes or incorrect output: a register-allocator bug may not
immediately manifest in incorrect execution, or when it does, the resulting
crash may have no obvious connection to the actual mis-allocated register. The
checker is able to point to a specific register use at a specific instruction
and say &quot;this register is wrong&quot;. Such a result makes for much smoother
debugging!&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s now walk through how we build the &quot;checker&quot; whose goal is to
verify a particular register allocation is correct. We will come at
the solution in stages, first reasoning about the easiest case --
straight-line code -- and then introducing control flow. At the end,
we&#x27;ll have a simple algorithm that runs in linear time (relative to
code size) and whose simplicity allows us to be reasonably confident
in its guarantees.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;symbolic-equivalence-and-abstract-interpretation&quot;&gt;Symbolic Equivalence and Abstract Interpretation&lt;&#x2F;h3&gt;
&lt;p&gt;Recall that we described above a sort of symbolic interpretation of
execution: one can reason about CPU registers containing &quot;symbolic&quot;
values, where each symbol represents a virtual register in the
original code. For example, we can take the code&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;mov v0, 1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;mov v1, 2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;add v2, v0, v1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;return v2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;and a register-allocated form of that code&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;mov r0, 1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;mov r1, 2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;add r0, r0, r1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;return r0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;and &lt;em&gt;somehow&lt;&#x2F;em&gt; find a set of substitutions that makes them equivalent:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;mov r0, 1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      [ r0 = v0 ]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;mov r1, 2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      [ r1 = v1 ]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;add r0, r0, r1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      [ r0 = v2 ]&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;return r0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But how do we solve for these substitutions? Recall that above we
hinted at a form of execution that operates on symbols rather than
values. We can simply take the semantics of the original instruction
set, and reformulate it to operate on symbolic values instead, and
then step through the code to find a representation of &lt;em&gt;all
executions&lt;&#x2F;em&gt; at once. This is called symbolic execution, and
with some enhancements described below, is the basis of &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Abstract_interpretation&quot;&gt;abstract
interpretation&lt;&#x2F;a&gt;&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#4&quot;&gt;4&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;.
It is a very powerful technique!&lt;&#x2F;p&gt;
&lt;p&gt;What are the semantics of the instruction set that are relevant here?
It turns out, because the register allocator does not modify any of
the program&#x27;s original instructions&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#5&quot;&gt;5&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;, we can understand each
instruction as &lt;em&gt;mostly&lt;&#x2F;em&gt; an arbitrary, opaque operator. The only
important pieces of information are which registers it &lt;em&gt;reads&lt;&#x2F;em&gt; (before
its operation) and which it &lt;em&gt;writes&lt;&#x2F;em&gt; (after its operation).&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#6&quot;&gt;6&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;It turns out that to verify the output of the register allocator when
it &lt;em&gt;spills&lt;&#x2F;em&gt; values, and when it &lt;em&gt;moves&lt;&#x2F;em&gt; values between registers, we
need to have special knowledge of spills, reloads, and moves. Hence,
we can reduce the input program to a sort of minimal ISA that captures
only what is important for symbolic reasoning (the real definition is
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc.rs&#x2F;blob&#x2F;109455ce4cea07a6e8d87e06d200c1318605c0ea&#x2F;lib&#x2F;src&#x2F;checker.rs#L393-L430&quot;&gt;here&lt;&#x2F;a&gt;;
we simplify a bit for this post):&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;Spill &amp;lt;spillslot&amp;gt;, &amp;lt;CPU register&amp;gt;&lt;&#x2F;code&gt;: copy data (symbol representing
virtual register) from a register to a spill slot.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;Reload &amp;lt;CPU register&amp;gt;, &amp;lt;spillslot&amp;gt;&lt;&#x2F;code&gt;: copy data from a spill slot to
a register.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;Move &amp;lt;CPU register&amp;gt;, &amp;lt;CPU register&amp;gt;&lt;&#x2F;code&gt;: move data from one CPU
register to another (N.B.: &lt;em&gt;only&lt;&#x2F;em&gt; regalloc-inserted moves are
recognized as a &lt;code&gt;Move&lt;&#x2F;code&gt;, not moves in the original input program.)&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;Op read:&amp;lt;CPU register list&amp;gt;, read_orig:&amp;lt;virtual register list&amp;gt; write:&amp;lt;CPU register list&amp;gt; write_orig:&amp;lt;virtual register list&amp;gt;&lt;&#x2F;code&gt;: some
arbitrary operation that reads some registers and writes some other
registers.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The last instruction is the most interesting: notice that it carries
the &lt;em&gt;original&lt;&#x2F;em&gt; virtual registers as well as the
&lt;em&gt;post-register-allocation CPU registers&lt;&#x2F;em&gt; for the instruction. The need
for this will become clearer below, but the intuition is that we
need to see &lt;em&gt;both&lt;&#x2F;em&gt; in order to establish the &lt;em&gt;correspondence&lt;&#x2F;em&gt; between
the two.&lt;&#x2F;p&gt;
&lt;p&gt;We can produce the above instructions while the register allocator is
scanning over the code and editing it; that part is a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc.rs&#x2F;blob&#x2F;d1956e08b5a3d6759bccf067c9739cd1a8c23be9&#x2F;lib&#x2F;src&#x2F;checker.rs#L16-L42&quot;&gt;straightforward
translation&lt;&#x2F;a&gt;. Once
we have the &lt;em&gt;abstracted&lt;&#x2F;em&gt; program, we can &quot;execute&quot; it over the domain
of symbols. How do we do this? With the following rules:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;We maintain some &lt;em&gt;state&lt;&#x2F;em&gt;, just as a real CPU does: for each CPU
register, and for each location in the stack frame, we track a
&lt;em&gt;symbol&lt;&#x2F;em&gt; (rather than an integer value). This symbol can be a
virtual-register name, if we know that the storage location
currently contains that register&#x27;s value. It can also be &lt;code&gt;Unknown&lt;&#x2F;code&gt;,
if the checker doesn&#x27;t know, or &lt;code&gt;Conflicted&lt;&#x2F;code&gt;, if the value could be
one of several virtual registers. (The difference between the latter
two will become clear when we discuss control-flow below. For now
it&#x27;s enough to see that we abstract the state to: either we know the
slot contains a program value, symbolically, or we know nothing.)&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;When we see a &lt;code&gt;Spill&lt;&#x2F;code&gt;, &lt;code&gt;Reload&lt;&#x2F;code&gt;, or &lt;code&gt;Move&lt;&#x2F;code&gt;, we copy the symbolic
state from the source location (register or spill slot) to the
destination location. In other words, we know that these
instructions always move the integer value of a register or memory
word, whatever it may be; so if we have knowledge about the source
location, symbolically for all possible executions, then we can
extend that knowledge to the destination as well.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;When we see an &lt;code&gt;Op&lt;&#x2F;code&gt;, we do some checks then some updates:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;For each &lt;em&gt;read&lt;&#x2F;em&gt; (input) register, we examine the symbolic value
stored in the given CPU register (post-allocation location). If
that symbol matches the virtual register that the original
instruction used, then the allocator has properly conveyed the
virtual register&#x27;s value to its use here, and thus the allocation
is &lt;em&gt;correct&lt;&#x2F;em&gt; (preserves program dataflow). If not, we can signal a
checker error, and look for the bug in our register allocator. We
know &lt;em&gt;for sure&lt;&#x2F;em&gt; it must be a bug (i.e., there are no false
positives), because we only track a symbol for a storage location
when we have proven (for all executions!) that that storage must
contain that virtual register.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;For each &lt;em&gt;write&lt;&#x2F;em&gt; (output) register, we set the symbolic value
stored in the given CPU register to be the given (pre-allocation)
virtual register. In other words, each write &lt;em&gt;produces&lt;&#x2F;em&gt; a
symbol. This symbol then flows through the program, moving via
spills&#x2F;reloads&#x2F;moves, until it reaches consumers.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;And that&#x27;s it! We can prove in a fairly straightforward way that this
is exactly correct -- produces no false positives or false negatives
-- for straight-line code (code with no jumps). We can do this by
induction: if the symbolic state is correct before an instruction,
then the above rules just encode the data movement that the concrete
program performs, and the symbolic state will be updated in the same
way, so the symbolic state after the instruction is also correct.&lt;&#x2F;p&gt;
&lt;p&gt;Note that this is &lt;em&gt;linear&lt;&#x2F;em&gt; as well -- so it&#x27;s very fast, with a single
scan over straight-line code. This is possible because we have &lt;em&gt;help&lt;&#x2F;em&gt;
from the register allocator: we know about spills, reloads, and
register allocator-inserted moves, and we have pre- and
post-allocation registers for all other instructions. Consider what we
would have to do if we did not know about these, but only saw machine
instructions. In that case, any load, store or move instruction could
have come from the allocator or from the original program. We would
have nothing but a graph of operators with connectivity between them,
and we would have to solve a &lt;em&gt;graph isomorphism&lt;&#x2F;em&gt; problem. That is much
harder, and much slower!&lt;&#x2F;p&gt;
&lt;p&gt;So are we done? Not quite: we have only considered straight-line
code. What happens when we encounter a jump?&lt;&#x2F;p&gt;
&lt;h3 id=&quot;control-flow-joins-lattices-and-iterative-dataflow-analysis&quot;&gt;Control-Flow Joins, Lattices, and Iterative Dataflow Analysis&lt;&#x2F;h3&gt;
&lt;p&gt;Control-flow makes analysis interesting because it allows for
&lt;em&gt;multiple possibilities&lt;&#x2F;em&gt;. Consider a simple program with an
if-then-else pattern (a &quot;control-flow diamond&quot;, as it is sometimes
called, due to its shape):&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2021-03-10-diamond-predicate-web.svg&quot; alt=&quot;Figure: A control-flow diamond and symbolic analysis&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s say that a symbolic analysis decides that on the left branch,
&lt;code&gt;r0&lt;&#x2F;code&gt; has symbolic state &lt;code&gt;A&lt;&#x2F;code&gt;, and on the right branch, it has symbolic
state &lt;code&gt;B&lt;&#x2F;code&gt;. What state does it have in the lower block, after the two
paths re-join?&lt;&#x2F;p&gt;
&lt;p&gt;We can give a precise answer if we are allowed to &quot;predicate&quot;, or make
the answer conditional on some other program state. For example, if we
knew that the if-condition were represented by some symbol &lt;code&gt;C&lt;&#x2F;code&gt; that
has a boolean type, we could invent an abstract expression language
and then write &lt;code&gt;if C { A } else { B }&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;However, this quickly becomes untenable. We will find that programs
with loops lead to &lt;em&gt;unbounded&lt;&#x2F;em&gt; symbolic expressions. (To see this,
consider that a symbolic representation can have a size larger than
its inputs. Any cyclic data dependency around a loop could thus
generate an infinitely-large symbolic representation.) Even with only
acyclic control flow, path-sensitive symbolic expressions can grow
exponentially with program size: consider that a program with &lt;code&gt;N&lt;&#x2F;code&gt;
basic blocks and no loops can have &lt;code&gt;O(2^N)&lt;&#x2F;code&gt; paths through those
blocks, and fully precise symbolic expressions would need to capture
the effects of each of those paths.&lt;&#x2F;p&gt;
&lt;p&gt;We thus need some way to &lt;em&gt;approximate&lt;&#x2F;em&gt;. Note that an abstract
interpretation of a program need not precisely capture all of the
program&#x27;s behavior losslessly. For example, we might perform a simple
abstract interpretation analysis that only tracks possible numeric
signs (positive, negative, unknown) for integer variables. So it is
always fine to &quot;summarize&quot; and drop detail to remain tractable. Let us
thus consider how we might &quot;merge&quot; state when multiple possibilities
exist.&lt;&#x2F;p&gt;
&lt;p&gt;It turns out that there is a very nice mathematical object that
captures the notion of &quot;merging&quot; in a way that is very useful: the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Lattice_(order)&quot;&gt;lattice&lt;&#x2F;a&gt;.  A lattice
consists of a set of elements and a &lt;em&gt;partial order&lt;&#x2F;em&gt; between them,
together with a least element &quot;bottom&quot; and a greatest element &quot;top&quot;,
an operator called &quot;meet&quot; that finds the &quot;greatest lower bound&quot; for
any two elements (the largest element that is less than its two
operands) and a &quot;join&quot; that finds the &quot;least upper bound&quot; (the dual of
the above).&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;Hasse_diagram_of_powerset_of_3.svg&quot; alt=&quot;Figure: A lattice&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;(Figure credit:
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;commons.wikimedia.org&#x2F;wiki&#x2F;File:Hasse_diagram_of_powerset_of_3.svg&quot;&gt;Wikimedia&lt;&#x2F;a&gt;,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;creativecommons.org&#x2F;licenses&#x2F;by-sa&#x2F;3.0&#x2F;deed.en&quot;&gt;CC BY-SA
3.0&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;An extremely useful property of lattices is that their merging operations
of meet and join are &lt;em&gt;commutative, associative and reflexive&lt;&#x2F;em&gt;. This is a
formal way of saying that the result only depends on the set of elements
&quot;thrown into the mix&quot;, in any order and with any repetition. In other
words, the meet of many elements is a function only of the set of elements,
not of the order in which we process them.&lt;&#x2F;p&gt;
&lt;p&gt;How is this useful? If we define particular analysis states -- and as
a reminder, in our specific case, these are maps from CPU registers
and spillslots to symbolic virtual registers -- to be lattice
elements, and define a &quot;meet function&quot; that somehow merges the states
-- then we can use this merging behavior to implement a sort of
program analysis over all programs, &lt;em&gt;including&lt;&#x2F;em&gt; those with loops,
without unbounded analysis growth! This is called the
&quot;meet-over-all-paths&quot; solution and is a standard way that compilers
perform &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Data-flow_analysis&quot;&gt;dataflow
analysis&lt;&#x2F;a&gt; today.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#7&quot;&gt;7&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;To understand how a lattice describes &quot;merging&quot; in a program analysis in a
useful way, one can see the lattice ordering relation (the arrows in the
figure above) as denoting that one state is more or less refined (contains
more or less knowledge) than another. One starts at the &quot;greatest&quot; or
&quot;top&quot; element: anything could be true; we know nothing. We then move to
progressively more refined states.  One analysis state is ordered &quot;less
than&quot; another if it captures all the constraints we have learned in the
other state, plus some new ones. The &quot;meet&quot; operator, which computes the
greatest lower bound, will thus give us an analysis state that captures all
of the knowledge in both inputs, and no more.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#8&quot;&gt;8&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The general approach to performing an analysis on an arbitrary CFG
is as follows:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;We define our analysis state as a &lt;em&gt;lattice&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;We trace the current analysis state at each &lt;em&gt;program point&lt;&#x2F;em&gt;, or
point between instructions. Initially, the state at every program
point is the &quot;top&quot; lattice element; as values meet, they move
&quot;down&quot; the lattice, toward the &quot;bottom&quot; element.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;We process the effect of each instruction, computing the state at
its post-program-point from its pre-program-point.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;When analysis state reaches a control-flow edge, we propagate the
state across the edge, and &lt;em&gt;meet&lt;&#x2F;em&gt; it with the incoming state from
all other edges. This may then lead us to recompute states in the
destination block.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;We run a &quot;fixpoint&quot; loop, processing updates as analysis states at
block entries change, until no more changes occur.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;In this way, we find a solution to the dataflow problem that satisfies
all of the instruction semantics for &lt;em&gt;any&lt;&#x2F;em&gt; path through the
program. It may not be fully precise (i.e., it may not answer every
question) -- because it is often impossible to capture a fully precise
answer for executions that include loops, and impractical for programs
with significant control-flow -- but it is &lt;em&gt;sound&lt;&#x2F;em&gt;, in the sense that
any claims we make from the analysis result will be &lt;em&gt;correct&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;a-register-checker-as-a-dataflow-analysis-problem&quot;&gt;A Register Checker as a Dataflow Analysis Problem&lt;&#x2F;h2&gt;
&lt;p&gt;We now have all of the pieces that we need in order to check the
register-allocator output for any program. We saw above that we could
model the machine state symbolically for any straight-line code, which
allows us to detect register allocator errors exactly (no false
negatives and no false positives) as long as there is no control
flow. We then discussed the usual static analysis approach to control
flow. How can we combine the two?&lt;&#x2F;p&gt;
&lt;p&gt;The answer is that we define a &lt;em&gt;lattice&lt;&#x2F;em&gt; of &lt;em&gt;symbolic register state&lt;&#x2F;em&gt;,
and then walk through the same per-instruction semantics as above in a
fixpoint dataflow analysis. Put simply, for each storage location (CPU
register or spill slot), we have a lattice:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2020-12-23-lattice.svg&quot; alt=&quot;Figure: Abstract value lattice&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;The &quot;unknown&quot; state is the &quot;top&quot; lattice value. This means simply that
we don&#x27;t know what is in the register because the analysis hasn&#x27;t
converged yet (or no write has occurred).&lt;&#x2F;p&gt;
&lt;p&gt;The &quot;conflicted&quot; state is the &quot;bottom&quot; lattice value. This means that
two or more symbolic definitions have merged. Rather than try to
represent a superposition of both with some sort of predication or
loop summary, we simply give up and move to a state that indicates
&quot;bad value&quot;. This is not a checker error &lt;em&gt;as long as it is never
used&lt;&#x2F;em&gt;, and it can be overwritten with a good value at any time; but if
the value is used as an instruction source, then we flag an error.&lt;&#x2F;p&gt;
&lt;p&gt;The meet function, then, is &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc.rs&#x2F;blob&#x2F;109455ce4cea07a6e8d87e06d200c1318605c0ea&#x2F;lib&#x2F;src&#x2F;checker.rs#L144-L155&quot;&gt;very
simple&lt;&#x2F;a&gt;:
two registers meet to &quot;conflicted&quot; unless they are the same register;
&quot;unknown&quot; meets with anything to produce that anything; and
&quot;conflicted&quot; is contagious, in the sense that meeting any other state
with &quot;conflicted&quot; remains &quot;conflicted&quot;.&lt;&#x2F;p&gt;
&lt;p&gt;Note that we said above that our analysis state is a &lt;em&gt;map&lt;&#x2F;em&gt; from
registers and spill slots to symbolic states; not just a single
symbolic state. So our lattice is actually a
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Product_order&quot;&gt;product&lt;&#x2F;a&gt; of each
individual storage location&#x27;s state, and we meet symbols
piecewise. (The resulting map contains entries only for keys that
appear in all meet-inputs; i.e. we take the intersection of the
domains.)&lt;&#x2F;p&gt;
&lt;p&gt;With the analysis state and its meet-function defined, we run a
dataflow analysis loop, allow it to converge, and look for errors; and
we&#x27;re done!&lt;&#x2F;p&gt;
&lt;p&gt;And that&#x27;s it!&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#9&quot;&gt;9&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;effectiveness-can-it-find-bugs&quot;&gt;Effectiveness: Can it Find Bugs?&lt;&#x2F;h2&gt;
&lt;p&gt;The short answer is that yes, it can find some &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;regalloc.rs&#x2F;pull&#x2F;86&quot;&gt;pretty subtle
bugs&lt;&#x2F;a&gt;!&lt;&#x2F;p&gt;
&lt;p&gt;The benefit of the regalloc.rs checker is twofold:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;It has found real bugs. In the above example, there was a conceptual
error in the reference-types (precise GC rooting) support: in
certain cases where a spillslot was allocated for a pointer-typed
value but never used, it could be added to the stackmap (list of
pointer-typed spillslots) provided to the GC. This bug needs a
specific set of circumstances to happen: we have to have enough
register pressure that we decide to allocate a spillslot for a
virtual register, but then hit the (rare) code-path in which we
don&#x27;t actually need to do the spill because a register became
available. We never hit this in our other, hand-written tests of GC
(Wasm reference types), despite some
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;source&#x2F;js&#x2F;src&#x2F;jit-test&#x2F;tests&#x2F;wasm&#x2F;ref-types&#x2F;stackmaps1.js&quot;&gt;pretty&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;source&#x2F;js&#x2F;src&#x2F;jit-test&#x2F;tests&#x2F;wasm&#x2F;ref-types&#x2F;stackmaps2.js&quot;&gt;extensive&lt;&#x2F;a&gt;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;searchfox.org&#x2F;mozilla-central&#x2F;source&#x2F;js&#x2F;src&#x2F;jit-test&#x2F;tests&#x2F;wasm&#x2F;ref-types&#x2F;stackmaps3.js&quot;&gt;tests&lt;&#x2F;a&gt;
at least in SpiderMonkey&#x27;s WebAssembly test suite driving the
Cranelift backend. The fuzzer was able to drive toward full
coverage, hit this rare code-path, and then allow the checker to
discover the error.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;It serves as a gold-standard test while developing &lt;em&gt;new&lt;&#x2F;em&gt; register
allocators. Feedback while developing the linear-scan allocator
(whose reference-type &#x2F; precise GC rooting support came a bit later
than the backtracking allocator&#x27;s) indicated that the checker found
many real issues and allowed for faster and more confident progress.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;h2 id=&quot;related-work&quot;&gt;Related Work&lt;&#x2F;h2&gt;
&lt;p&gt;It&#x27;s surprisingly difficult to find prior work on checkers that
validate individual runs of a register allocator.  There are several
fully-verified compilers in existence;
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;compcert.org&#x2F;&quot;&gt;CompCert&lt;&#x2F;a&gt; and &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;cakeml.org&#x2F;&quot;&gt;CakeML&lt;&#x2F;a&gt;
are two that can compile realistic languages (C and ML,
respectively). These compilers have fully verified register allocators
in the sense that the algorithm itself is proven correct; there is no
need to run a checker on an individual compilation result. The
engineering effort to achieve this is much higher than to write a
checker, however (in the latter case, ~700 lines of Rust).&lt;&#x2F;p&gt;
&lt;p&gt;CakeML&#x27;s approach to proving the register allocator correct is
described by Tan et al. in &quot;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;kar.kent.ac.uk&#x2F;71304&#x2F;1&#x2F;paper.pdf&quot;&gt;The Verified CakeML Compiler
Backend&lt;&#x2F;a&gt;&quot; (J. Func Prog 29,
2019). They appear to have nicely factored the problem so that the
compilation is correct as long as a valid graph coloring or
&quot;permutation&quot; (mapping of program values to storage slots) is
provided. This allows reasoning about the core issue (dataflow
equivalence before and after allocation) separately from the details
of the allocator (graph coloring algorithm).&lt;&#x2F;p&gt;
&lt;p&gt;Proof-producing compilers exist as well: for example,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;doi&#x2F;abs&#x2F;10.1145&#x2F;3192366.3192377&quot;&gt;Crellvm&lt;&#x2F;a&gt; is a
recent extension of several LLVM passes that generates a
(machine-checkable) correctness proof alongside the transformed
program. This approach is conceptually at the same level as our
register-allocator checker: it results in the validation of a single
compiler run, but is much easier to build than a full a-priori
correctness proof. This effort does not yet appear to address register
allocation, however.&lt;&#x2F;p&gt;
&lt;p&gt;Rideau and Leroy in &quot;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;xavierleroy.org&#x2F;publi&#x2F;validation-regalloc.pdf&quot;&gt;Validating Register Allocation and
Spilling&lt;&#x2F;a&gt;&quot; (CC
2010) describe a similar taxonomy to ours, separating &quot;once and for
all&quot; correctness proofs from &quot;translation validation checks&quot; and
providing the latter. Their validator, however, defines a fairly
complex transfer function that builds a set of equality constraints
that must be solved. It appears that the validator does not leverage
hints from the allocator, specifically w.r.t. spills, reloads and
inserted moves as distinguished from stores, loads and moves in the
original program; without these hints, a much more general and complex
dataflow-equivalence scheme is needed.&lt;&#x2F;p&gt;
&lt;p&gt;Nandivada et al. in &quot;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;homepages.dcc.ufmg.br&#x2F;~fernando&#x2F;publications&#x2F;papers&#x2F;SAS07.pdf&quot;&gt;A Framework for End-to-End Verification andEvaluation of
Register
Allocators&lt;&#x2F;a&gt;&quot;
(SAS 2007) describe a system very similar to our checker in which physical
register contents (as virtual-register or &quot;pseudo&quot; symbols) are encoded into a
post-regalloc IR that is then typechecked. Their typechecker can uncover the
same sorts of regalloc errors that our checker can. Thus, their approach is
largely equivalent to ours; the main difference is that we do not encode the
problem as typechecking on a dedicated IR but rather a standalone static
analysis.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;This post concludes the three-post series
(&lt;a href=&quot;&#x2F;blog&#x2F;2020&#x2F;09&#x2F;18&#x2F;cranelift-isel-1&#x2F;&quot;&gt;one&lt;&#x2F;a&gt;,
&lt;a href=&quot;&#x2F;blog&#x2F;2021&#x2F;01&#x2F;22&#x2F;cranelift-isel-2&#x2F;&quot;&gt;two&lt;&#x2F;a&gt;) describing the work we&#x27;ve
done to develop all the pieces of Cranelift&#x27;s new backend over the
past year!  It has been a very interesting and educational ride for me
personally; I discovered an entirely new world of interesting problems
to solve in the compiler backend, as distinct from the &quot;middle end&quot;
(IR-level optimizations) that is more commonly taught and
studied. Additionally, the focus on &lt;em&gt;fast&lt;&#x2F;em&gt; compilation is an
interesting twist, and one that I believe is not studied enough. It is
easy to justify higher analysis precision and better generated code
through ever-more-complex techniques; the benefit to be found in
design tradeoffs for fast compilation is more subjective and more
dependent on workload.&lt;&#x2F;p&gt;
&lt;p&gt;It is my hope that these writeups have illuminated some of the
thinking that went into our design decisions. Our work is by no means
done, however! The &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;rfcs&#x2F;pull&#x2F;8&quot;&gt;roadmap for Cranelift work in
2021&lt;&#x2F;a&gt; lists a number
of ideas that we&#x27;ve discussed to achieve higher compiler performance
and better code quality. I am excited to explore these more in the
coming year; they may even result in more blog posts. Until then,
happy compiling!&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;For discussions about this post, please feel free to join us on our Zulip
instance in &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bytecodealliance.zulipchat.com&#x2F;#narrow&#x2F;stream&#x2F;217117-cranelift&#x2F;topic&#x2F;blog.20post.20on.20regalloc.20checker&quot;&gt;this
thread&lt;&#x2F;a&gt;.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;em&gt;Thanks to &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.reddit.com&#x2F;u&#x2F;po8&quot;&gt;&#x2F;u&#x2F;po8&lt;&#x2F;a&gt; on Reddit for &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;rust&#x2F;comments&#x2F;m5w3y4&#x2F;cranelift_part_3_correctness_in_register&#x2F;gr2w3i5&#x2F;&quot;&gt;several
suggestions&lt;&#x2F;a&gt;
which I have incorporated. Thanks also to bjorn3 for several suggestions.
Finally, thanks to &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;homepages.dcc.ufmg.br&#x2F;~fernando&#x2F;&quot;&gt;Fernando M Q
Pereira&lt;&#x2F;a&gt; for bringing my attention to
his
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;homepages.dcc.ufmg.br&#x2F;~fernando&#x2F;publications&#x2F;papers&#x2F;SAS07.pdf&quot;&gt;paper&lt;&#x2F;a&gt;
in SAS 2007 that proposes a very similar idea, which I&#x27;ve added to the related
work section. Any and all feedback is welcome!&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;1&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;1&lt;&#x2F;sup&gt;
&lt;p&gt;Why do CPUs have a limited number of registers? The bound is
mostly due to &lt;em&gt;ISA encoding limitations&lt;&#x2F;em&gt;: there are only so many
bits in an instruction to name a particular register source or
destination. When the CPU designer chooses how many registers to
define, providing more will improve performance (up to a point)
because the CPU can hold more state at one time, but will also
impose an increasing cost in code size and CPU
complexity.&lt;&#x2F;p&gt;
&lt;p&gt;Computer architect&#x27;s tangent: due to &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Register_renaming&quot;&gt;register
renaming&lt;&#x2F;a&gt;, a
modern high-performance out-of-order CPU will have many more
&lt;em&gt;physical&lt;&#x2F;em&gt; registers, with architectural register names mapped
to physical registers at any given program point by the
register-renaming hardware (in common parlance, the register
allocation table or RAT), but the ISA encoding restrictions
limit the number that have architectural names at any time. The
existence of register renaming sometimes causes confusion in
discussions of register allocation -- why rename onto so few
registers when we have so many? -- well, we could do much better
if we had more bits to refer to them all! Architectural
standardization is another reason for this: we would not want to
recompile code every time the PRF (physical register file)
became larger. Simpler to say &quot;x86-64 has 16 integer registers&quot;
and be done with it.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#10&quot;&gt;10&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;2&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;2&lt;&#x2F;sup&gt;
&lt;p&gt;We don&#x27;t know if exponential time is the &lt;em&gt;best&lt;&#x2F;em&gt; we can do in the
worst case, though most computer scientists suspect so. This is
the famous &lt;code&gt;P=NP&lt;&#x2F;code&gt; problem, and if you can solve it, you &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Millennium_Prize_Problems#P_versus_NP&quot;&gt;win a
million
dollars&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;3&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;3&lt;&#x2F;sup&gt;
&lt;p&gt;A slight correction from &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.reddit.com&#x2F;r&#x2F;rust&#x2F;comments&#x2F;m5w3y4&#x2F;cranelift_part_3_correctness_in_register&#x2F;gr2w3i5&#x2F;&quot;&gt;&#x2F;u&#x2F;po8&#x27;s
comment&lt;&#x2F;a&gt;:
register allocation on &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Structured_programming&quot;&gt;structured
programs&lt;&#x2F;a&gt; &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;link.springer.com&#x2F;chapter&#x2F;10.1007&#x2F;978-3-642-37051-9_1&quot;&gt;can be
done&lt;&#x2F;a&gt; in
polynomial time, i.e., better than an exponential brute-force search.
However, the problem remains quite complex!&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;4&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;4&lt;&#x2F;sup&gt;
&lt;p&gt;Abstract interpretation was introduced by Radhia and Patrick
Cousot in their seminal 1977 POPL paper &quot;Abstract
interpretation: A Unified Lattice Model for Static Analysis of
Programs by Construction or Approximation of Fixpoints&quot;
(&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.di.ens.fr&#x2F;~cousot&#x2F;publications.www&#x2F;CousotCousot-POPL-77-ACM-p238--252-1977.pdf&quot;&gt;pdf&lt;&#x2F;a&gt;).&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;5&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;5&lt;&#x2F;sup&gt;
&lt;p&gt;Except for move elimination, but we can ignore that for now --
it is possible to adapt the abstract interpretation rules to
account for it later.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;6&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;6&lt;&#x2F;sup&gt;
&lt;p&gt;In regalloc.rs we also have a notion of an instruction that
&quot;modifies&quot; a register, which is like a combined read and write
except that the value must be mapped to the &lt;em&gt;same&lt;&#x2F;em&gt; register for
both. This isn&#x27;t fundamental to the point we&#x27;re illustrating so
we&#x27;ll skip over it for now.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;7&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;7&lt;&#x2F;sup&gt;
&lt;p&gt;This dataflow analysis approach was proposed by &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Gary_Kildall&quot;&gt;Gary
Kildall&lt;&#x2F;a&gt; in the POPL
1973 paper &quot;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;dl.acm.org&#x2F;doi&#x2F;10.1145&#x2F;512927.512945&quot;&gt;A unified approach to global program
optimization&lt;&#x2F;a&gt;&quot;. (He
is perhaps better-known for writing the microcomputer OS
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;CP&#x2F;M&quot;&gt;CP&#x2F;M&lt;&#x2F;a&gt;, a predecessor to
DOS.) Kildall&#x27;s Dataflow analysis builds on the control-flow
graph ideas invented several years prior by &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Frances_Allen&quot;&gt;Fran
Allen&lt;&#x2F;a&gt;; in her 1970
paper &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.cs.columbia.edu&#x2F;~suman&#x2F;secure_sw_devel&#x2F;p1-allen.pdf&quot;&gt;Control Flow
Analysis&lt;&#x2F;a&gt;,
she proposes interval-based dataflow analysis, which is the
other main approach known and used today.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;8&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;8&lt;&#x2F;sup&gt;
&lt;p&gt;Note that we have been somewhat vague here about directionality.
What does &quot;more constrained&quot; or &quot;more refined&quot; mean? There are actually
two directions an analysis may work, and these have to do with how it
handles imprecision. A &quot;may-analysis&quot;, or &quot;widening analysis&quot;, computes
what the program may do. It generally begins with an &quot;empty set&quot; of sorts
-- a variable has no possible values, a statement has no side-effects, a
register contains nothing -- and then uses a &lt;em&gt;union&lt;&#x2F;em&gt;-like meet operator
to aggregate all &lt;em&gt;possibilities&lt;&#x2F;em&gt;. The real program behavior will be some
subset of these possibilities. In contrast, a &quot;must-analysis&quot;, or
&quot;narrowing analysis&quot;, computes only what we know the program &lt;em&gt;must&lt;&#x2F;em&gt; do.
It generally begins with the &quot;universe set&quot; and then uses
&lt;em&gt;intersection&lt;&#x2F;em&gt;-like meet operators. The real program&#x27;s behavior is a
superset of this analysis&#x27;s description. We can&#x27;t have both, usually,
because an analysis cannot generally be fully precise.&lt;&#x2F;p&gt;
&lt;p&gt;By convention, we always start analysis values at &quot;top&quot; and use
&quot;meet&quot; to move down the lattice as the analysis converges, though
we could just as well start at &quot;bottom&quot; and move up with &quot;join&quot;,
since flipping the lattice&#x27;s order relation and swapping meet and
join produces another lattice.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;9&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;9&lt;&#x2F;sup&gt;
&lt;p&gt;Well, not quite, as you might have guessed. One significant
detail I&#x27;ve omitted is how we handle &lt;em&gt;reference types&lt;&#x2F;em&gt; and
&lt;em&gt;precise garbage collection&lt;&#x2F;em&gt;. Precise GC rooting entails tracking
a specific kind of &lt;em&gt;type information&lt;&#x2F;em&gt; for each register and
spillslot: specifically, whether each storage location contains a
&lt;em&gt;pointer&lt;&#x2F;em&gt; that the GC should observe when it performs a garbage
collection. It is important in many applications for this to be
&quot;precise&quot;, which means that we can only say that a register
contains a pointer if it &lt;em&gt;actually&lt;&#x2F;em&gt; does, and we &lt;em&gt;must&lt;&#x2F;em&gt; include
all registers that contain pointers. Precision is important
because the GC will assume any root pointer it traces points to a
valid object (so false positives are bad); and must know about
every pointer in case it is a moving GC and relocates an object
(so false negatives are bad).&lt;&#x2F;p&gt;
&lt;p&gt;In our particular variant of the problem, we need this information at
&lt;em&gt;safepoints&lt;&#x2F;em&gt;: these are points at which the GC could be invoked. (It
would be too expensive to plan for a GC invocation at every point in
the program.) Furthermore, we needed to support GCs that could only
trace pointers on the stack (hence, spillslots), not in registers. So
we needed to induce &lt;em&gt;additional spills&lt;&#x2F;em&gt; around safepoints to ensure
pointers were only live on the stack, not in registers.&lt;&#x2F;p&gt;
&lt;p&gt;To check this, we extended the abstract value lattice to note whether
each virtual register is a pointer-typed value or not. Then, at every
safepoint, we (i) ensure that every actual pointer-typed value in a
spillslot is listed in the stackmap provided to the GC, and (ii)
&lt;em&gt;clear&lt;&#x2F;em&gt; any other pointer-typed stack location not listed in the
stack map to an &lt;code&gt;Unknown&lt;&#x2F;code&gt; state. Why the latter? Because an actual
pointer-typed value in a stack slot might be &quot;dead&quot; (not used again),
and so is legal to omit from the stackmap; instead of immediately
flagging an error when one is excluded, we simply ensure that a
later use of it is invalid.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;10&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;10&lt;&#x2F;sup&gt;
&lt;p&gt;Note that some computer architectures &lt;em&gt;do&lt;&#x2F;em&gt; task the compiler
with some form of register renaming. For example, the Intel
Itanium (&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;IA-64&quot;&gt;IA-64&lt;&#x2F;a&gt;) had a
novel sort of &quot;rotating register reference&quot; feature for loops,
and trusted the compiler with managing a full 128 integer and
128 floating-point registers. Modern GPUs also have thousands
of &quot;registers&quot; managed by the compiler.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
</content>
    </entry>
    <entry xml:lang="en">
        <title>Cranelift, Part 2: Compiler Efficiency, CFGs, and a Branch Peephole Optimizer</title>
        <published>2021-01-22T00:00:00+00:00</published>
        <updated>2021-01-22T00:00:00+00:00</updated>
        <author>
            <name></name>
        </author>
        <link rel="alternate" type="text/html" href="https://cfallin.org/blog/2021/01/22/cranelift-isel-2/"/>
        <id>https://cfallin.org/blog/2021/01/22/cranelift-isel-2/</id>
        <content type="html" xml:base="https://cfallin.org/blog/2021/01/22/cranelift-isel-2/">&lt;p&gt;This post is the second in a three-part series about
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;tree&#x2F;main&#x2F;cranelift&quot;&gt;Cranelift&lt;&#x2F;a&gt;.
In the &lt;a href=&quot;&#x2F;blog&#x2F;2020&#x2F;09&#x2F;18&#x2F;cranelift-isel-1&#x2F;&quot;&gt;first post&lt;&#x2F;a&gt;, I described
the context around Cranelift and our project to replace its backend
code-generation infrastructure, and detailed the instruction-selection
problem and how we solve it. The remaining two posts will be
deep-dives into some interesting engineering problems.&lt;&#x2F;p&gt;
&lt;p&gt;In this post, I want to dive into the &lt;em&gt;compiler performance&lt;&#x2F;em&gt; aspect of
our work more deeply. (In the next post we&#x27;ll explore correctness.)
There are many interesting aspects of compilation speed I could talk
about, but one particularly difficult problem is the handling of
&lt;em&gt;control flow&lt;&#x2F;em&gt;: how do we translate structured control flow at the
Wasm level into control-flow graphs at the IR level, and finally to
branches in a linear stream of instructions at the machine-code level?&lt;&#x2F;p&gt;
&lt;p&gt;Doing this translation efficiently requires careful attention to the
overall pass structure, with the largest wins coming when one can
completely eliminate a category of work. We&#x27;ll see this in how we
combine several passes in a traditional lowering design (critical-edge
splitting, block ordering, redundant-block elimination, branch
relaxation, branch target resolution) into &lt;em&gt;inline transforms&lt;&#x2F;em&gt; that
happen during other passes (lowering of the CLIF, or Cranelift IR,
into machine-specific IR; and later, binary emission).&lt;&#x2F;p&gt;
&lt;p&gt;This post basically describes the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;main&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;machinst&#x2F;buffer.rs&quot;&gt;&lt;code&gt;MachBuffer&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;,
a &quot;smart machine-code buffer&quot; that knows about branches and edits them
on-the-fly as we emit them, and the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;main&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;machinst&#x2F;blockorder.rs&quot;&gt;&lt;code&gt;BlockLoweringOrder&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;,
which allows us to lower code in final basic-block order, with split
critical edges inserted implicitly, by traversing a never-materialized
implicit graph. The work was done mostly in &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;1718&quot;&gt;Cranelift PR
#1718&lt;&#x2F;a&gt;, which
resulted in a ~10% compile-time improvement and a ~25%
compile+run-time improvement on a CPU-intensive benchmark (&lt;code&gt;bz2&lt;&#x2F;code&gt;).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;control-flow-graphs&quot;&gt;Control-Flow Graphs&lt;&#x2F;h2&gt;
&lt;p&gt;Before we discuss any of that, we need to review control-flow graphs
(CFGs)! The CFG is a fundamental data structure used in almost all
modern compilers. In brief, it represents how execution (i.e., program
control) may flow through instructions, using graph nodes to represent
linear sequences of instructions and graph edges to represent all
possible control-flow transfers at branch instructions.&lt;&#x2F;p&gt;
&lt;p&gt;At the end of the instruction selection process, which we learned
about in the &lt;a href=&quot;&#x2F;blog&#x2F;2020&#x2F;09&#x2F;18&#x2F;cranelift-isel-1&#x2F;&quot;&gt;previous post&lt;&#x2F;a&gt;, we
have a function body lowered into VCode that consists of &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Basic_block&quot;&gt;&lt;em&gt;basic
blocks&lt;&#x2F;em&gt;&lt;&#x2F;a&gt;. A basic block is
a contiguous sequence of instructions that has no outbound branches
except at the end, and has no inbound branches except at the
beginning. In other words, it is &quot;straight-line&quot; code: execution
always starts at the top and proceeds to the end. An example
control-flow graph (CFG) consisting of four basic blocks is shown
below:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2020-10-08-cfg-web.svg&quot; alt=&quot;Figure: Control-flow graph with four basic blocks in a diamond&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Control-flow graphs are excellent data structures for compilers to
use. By making the flow of execution explicit as graph edges, rather
than reasoning about instructions in order in memory as the processor
sees them, many analyses can be performed more easily. For example,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Data-flow_analysis&quot;&gt;dataflow analysis&lt;&#x2F;a&gt;
problems can be solved easily because the CFG makes traversal of
possible control-flow transfers easy. Graph-based representations of
the program also allow easier &lt;em&gt;moving and insertion of code&lt;&#x2F;em&gt;: it is
less error-prone to manipulate an explicit graph than to reason about
implicit control-flow (e.g. fallthrough from a not-taken conditional
branch). Finally, the graph representation factors out the question of
&lt;em&gt;block ordering&lt;&#x2F;em&gt;, which can be important for performance; we can
address this problem separately by choosing how we serialize the graph
nodes (blocks). For these reasons, most compiler IRs, including
Cranelift&#x27;s CLIF and &lt;code&gt;VCode&lt;&#x2F;code&gt;, are CFG-based.&lt;&#x2F;p&gt;
&lt;p&gt;(Historical note: control-flow graphs were invented by the late
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Frances_Allen&quot;&gt;Frances Allen&lt;&#x2F;a&gt;, who
largely established the algorithmic foundations that modern compilers
use. Her paper &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.clear.rice.edu&#x2F;comp512&#x2F;Lectures&#x2F;Papers&#x2F;1971-allen-catalog.pdf&quot;&gt;A catalogue of optimizing
transformations&lt;&#x2F;a&gt;&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#1&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;
covers essentially all of the important optimizations used today and
is well worth a read.)&lt;&#x2F;p&gt;
&lt;h2 id=&quot;cpus-and-branch-instructions&quot;&gt;CPUs and Branch Instructions&lt;&#x2F;h2&gt;
&lt;p&gt;To represent a CFG&#x27;s end-of-block branches at the instruction level,
we can use &lt;em&gt;two-way branches&lt;&#x2F;em&gt;: these are instructions that branch
either to one basic-block target if some condition is true, or another
if the condition is false. (Basic blocks can also end in simple
unconditional single-target branches.) We wrote such a branch as &lt;code&gt;if r0, L1, L2&lt;&#x2F;code&gt; above; this means that the block &lt;code&gt;L0&lt;&#x2F;code&gt; will be followed in
execution either by &lt;code&gt;L1&lt;&#x2F;code&gt; or &lt;code&gt;L2&lt;&#x2F;code&gt;, depending on the value in &lt;code&gt;r0&lt;&#x2F;code&gt;.&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#2&quot;&gt;2&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;branches-with-fallthrough&quot;&gt;Branches with Fallthrough&lt;&#x2F;h3&gt;
&lt;p&gt;However, CPUs rarely have such two-way branch instructions. Instead,
conditional control-flow in common ISAs is almost always provided with
a &lt;em&gt;conditional branch with fallthrough&lt;&#x2F;em&gt;. This is an instruction that,
if some condition is true, branches to another location; otherwise,
does nothing, and allows execution to continue sequentially. This is a
better fit for a hardware implementation for a number of reasons: it&#x27;s
easier to encode one target than two (the destination of the jump
might be quite far away for some branches, and instructions have
limited bits available), and it&#x27;s usually the case that the compiler
can place one of the successor blocks immediately afterward anyway.&lt;&#x2F;p&gt;
&lt;p&gt;Now, this isn&#x27;t much of a problem if we just want a working compiler;
instead of a two-way branch&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    if r0, L1, L2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We can write a sequence of branches&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    br_if r0, L1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    goto L2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;where &lt;code&gt;br_if&lt;&#x2F;code&gt; branches to &lt;code&gt;L1&lt;&#x2F;code&gt; or falls through to the unconditional
&lt;code&gt;goto&lt;&#x2F;code&gt;. But this is not so efficient in many cases. Consider what
would happen if we laid out basic blocks in the order &lt;code&gt;L0&lt;&#x2F;code&gt;, &lt;code&gt;L2&lt;&#x2F;code&gt;,
&lt;code&gt;L1&lt;&#x2F;code&gt;, &lt;code&gt;L3&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    L0:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      br_if r0, L1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      goto L2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    L2:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      goto L3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    L1:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      goto L3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    L3:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      return&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;There are two redundant unconditional branches (&lt;code&gt;goto&lt;&#x2F;code&gt; instructions),
each of which uselessly branches to the following instruction. We can
remove both of them with no ill effects, taking advantage instead of
&lt;em&gt;fallthrough&lt;&#x2F;em&gt;, or allowing execution to proceed directly from the end
of one block to the start of the next one:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    L0:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      br_if r0, L1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      &#x2F;&#x2F; ** Otherwise, fall through to L2 **&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    L2:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      goto L3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    L1:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      &#x2F;&#x2F; ** Always fall through to L3 **&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    L3:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;      return&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This seems like an easy enough problem to solve: we just need to
recognize when a branch is redundant and remove it, right? Well, yes,
but we can do much better than that in some cases; we&#x27;ll dig into this
problem in significantly more depth below!&lt;&#x2F;p&gt;
&lt;h3 id=&quot;machine-code-encoding-branch-offsets&quot;&gt;Machine-code Encoding: Branch Offsets&lt;&#x2F;h3&gt;
&lt;p&gt;So far, we&#x27;ve written our machine instructions in a way that humans
can read, using &lt;em&gt;labels&lt;&#x2F;em&gt; to refer to locations in the instruction
stream. At the hardware level, however, these labels do not exist;
instead, the machine code branches contain target &lt;em&gt;addresses&lt;&#x2F;em&gt; (usually
encoded as relative &lt;em&gt;offsets&lt;&#x2F;em&gt; from the branch instruction). In other
words, we do not see &lt;code&gt;goto L3&lt;&#x2F;code&gt;, but rather &lt;code&gt;goto +32&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;This gives rise to several complications when emitting machine code
from a list of instruction &lt;code&gt;struct&lt;&#x2F;code&gt;s.  At the most basic level, we
have to resolve labels to offsets and then patch the branches
appropriately. This is analogous to (but at a lower level than) the
job of a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Linker_(computing)&quot;&gt;linker&lt;&#x2F;a&gt;:
we resolve symbols to concrete values after deciding placement, and
then edit the code according to &lt;em&gt;relocations&lt;&#x2F;em&gt; to refer to those
symbols. In other words, whenever we emit a branch, we make a note (a
relocation, or &quot;label use&quot; in our &lt;code&gt;MachBackend&lt;&#x2F;code&gt;) to go back later and
patch it with the resolved label offset.&lt;&#x2F;p&gt;
&lt;p&gt;The second, and more interesting, problem arises because not all
branch instructions can necessarily refer to all possible labels! As a
concrete example, on AArch64, conditional branches have a ±1 MB range,
and unconditional branches have a ±128 MB range. This arises out of
instruction-encoding considerations: particularly in
fixed-instruction-size ISAs (such as ARM, MIPS, and RISC V), less than
a full machine word of bits are available for the immediate jump
offset that is embedded in the instruction word. (The instruction
itself is always a machine-word wide, and we need some bits for the
opcode and condition code too!) On x86, we have limits for a different
reason: the variable-width encoding allows either a one-byte offset
(allowing a ±128 byte range) or four-byte offset (allowing a ±2 GB
range).&lt;&#x2F;p&gt;
&lt;p&gt;To make a branch to a far-off label, then, on some machines we need to
either use a different sort of branch than the default choice for the
instruction selector, or we need to use a form of &lt;em&gt;indirection&lt;&#x2F;em&gt;, by
targetting the original branch to &lt;em&gt;another branch&lt;&#x2F;em&gt;, the latter in a
special form. The former is tricky because we do not know whether a
target will be in-range until all code is lowered and placement is
computed; so we need to either optimistically or pessimistically lower
branches to the shortest or longest form (respectively) and possibly
switch later. To make matters worse, as we edit branches to use a
shorter or longer form, their length may change, moving &lt;em&gt;other&lt;&#x2F;em&gt;
targets into or out of range; in the most general solution, this is a
&quot;fixpoint problem&quot;, where we iterate until no more changes occur.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;challenges-in-lowering-cfgs-to-machine-code&quot;&gt;Challenges in Lowering CFGs to Machine Code&lt;&#x2F;h2&gt;
&lt;p&gt;So far, we have a way to produce &lt;em&gt;correct&lt;&#x2F;em&gt; machine code. To emit the
final code for a two-target branch, we can emit a conditional-
followed by unconditional-branch machine instruction. To resolve
branch targets correctly, we can assume that any target could be
anywhere in memory, and always use the long form of a branch; then we
just need to come back in one final pass and fill in the offsets when
we know them.&lt;&#x2F;p&gt;
&lt;p&gt;We can do much better than this, though! Below I&#x27;ll describe four
problems and the ways that they are traditionally solved.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;problem-1-efficient-use-of-fallthroughs&quot;&gt;Problem 1: Efficient use of Fallthroughs&lt;&#x2F;h3&gt;
&lt;p&gt;We described above how &lt;em&gt;branch fallthroughs&lt;&#x2F;em&gt; allow us to omit some
some unconditional branches once we know for sure the order that basic
blocks will appear in the final binary. In particular, the simple
lowering of a two-way branch &lt;code&gt;if r0, label_if_true, label_if_false&lt;&#x2F;code&gt; to
two one-way branches&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    br_if r0, label_if_true&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    goto label_if_false&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;label_if_false:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;label_if_true:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;has a completely redundant and useless &lt;code&gt;goto&lt;&#x2F;code&gt;! In general, if a branch
target is the very next instruction, we can delete that branch.&lt;&#x2F;p&gt;
&lt;p&gt;However, there are slightly more complex cases where we can also find
some improvements. Consider the inverted version of the above:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    br_if r0, label_if_false&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    goto label_if_true&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;label_if_false:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;label_if_true:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;No branch here branches to its fallthrough, so one might think that
both branches are necessary. But in practice, on most CPU
architectures, all conditional branches have &lt;em&gt;inverted forms&lt;&#x2F;em&gt;. For
example, the x86 instruction &lt;code&gt;JE&lt;&#x2F;code&gt; (jump if equal) can be inverted to
&lt;code&gt;JNE&lt;&#x2F;code&gt; (jump if not equal). If we are allowed to edit branch conditions
as well, then we can rewrite the above as:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    br_if_not r0, label_if_true&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;label_if_false:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;label_if_true:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This turns out to remove many additional branches in practice.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;problem-2-empty-blocks&quot;&gt;Problem 2: Empty Blocks&lt;&#x2F;h3&gt;
&lt;p&gt;It is sometimes the case that after optimizations, a basic block is
completely &lt;em&gt;empty&lt;&#x2F;em&gt; aside from a final unconditional branch. This can
occur when all of the code in an if- or else-block is optimized away
or moved elsewhere in the function body. It can also occur when a
block was inserted to &lt;em&gt;split a critical edge&lt;&#x2F;em&gt; (see below).&lt;&#x2F;p&gt;
&lt;p&gt;Thus, a common optimization is &lt;em&gt;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Jump_threading&quot;&gt;jump
threading&lt;&#x2F;a&gt;&lt;&#x2F;em&gt;: when one
branch points directly to another, we can just edit the first branch
to point to the final target. Generalized, we can &quot;chase through&quot; any
number of branches to eliminate intermediate steps. For example:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    goto L1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;L1:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    goto L2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;L2:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    goto L3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;L3:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;can become:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    goto L3   &#x2F;&#x2F; &amp;lt;--- edited branch&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;L1:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    goto L2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;L2:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    goto L3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;L3:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;    ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;note that the intermediate branches &lt;em&gt;were not removed&lt;&#x2F;em&gt;: they may still
be the targets of &lt;em&gt;other branches&lt;&#x2F;em&gt;. We skip over them when starting
from the first branch. However, if we know some other way that these
branches are unused, we can then delete them, reducing code size.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;problem-3-branch-relaxation&quot;&gt;Problem 3: Branch Relaxation&lt;&#x2F;h3&gt;
&lt;p&gt;As we noted above, the &quot;branch relaxation&quot; problem is that we must
choose one of &lt;em&gt;multiple forms&lt;&#x2F;em&gt; for each branch instruction, each of
which may have a different range (maximal distance from current
program-counter location). This is complex because the needed range
depends on the final locations of the branch and its target, which in
turn depends on the size of instructions in the machine code; but some
of those instructions are themselves branches. We thus have a circular
dependency.&lt;&#x2F;p&gt;
&lt;p&gt;There will always be &lt;em&gt;some&lt;&#x2F;em&gt; way to branch to an arbitrary location in
the processor&#x27;s address space, so there is always the trivial but
inefficient solution of using worst-case branch forms. However, we can
usually do much better, because the majority of branches will be to
relatively small offsets.&lt;&#x2F;p&gt;
&lt;p&gt;The usual approach to solving this problem involves a &quot;fixpoint
computation&quot;: an iterative loop that continues to make improvements
until none are left. This is where the &quot;relaxation&quot; of branch
relaxation comes from: we modify branch instructions to have more
optimal forms as we discover that targets are within range; and as we
do this, we recompute code offsets and see if this enables any other
relaxations. As long as the relationship between branch range and
branch instruction size is monotonic (smaller required range allows
for shorter instruction), this will always converge to a unique
fixpoint; but it is potentially expensive, and involves sticky
data-structure design questions if we want the code editing and&#x2F;or
offset recomputation to be fast.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;problem-4-critical-edges&quot;&gt;Problem 4: Critical Edges&lt;&#x2F;h3&gt;
&lt;p&gt;For a number of reasons, we usually want to &lt;em&gt;split &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Control-flow_graph#Special_edges&quot;&gt;critical
edges&lt;&#x2F;a&gt;&lt;&#x2F;em&gt;
in the control-flow graph. A critical edge is any control-flow
transfer edge that comes &lt;em&gt;from&lt;&#x2F;em&gt; a block with multiple out-edges, and
goes &lt;em&gt;to&lt;&#x2F;em&gt; a block with multiple in-edges. We sometimes need to insert
some code to run whenever the program follows a critical edge: e.g.,
the register allocator may need to &quot;fix up&quot; the machine state, moving
values around in registers as expected by the target block. Consider
where we might insert such code: we can&#x27;t insert it prior to the jump,
because this would execute no matter what out-edge is
taken. Similarly, we can&#x27;t insert it at the target of the jump,
because this would execute for any entry into the target block, not
just transfers over the particular edge.&lt;&#x2F;p&gt;
&lt;p&gt;The solution is to &quot;split&quot; the critical edge: that is, create a new
basic block, edit the branch to point to this block, and then create
an unconditional branch in the block to the original target. This
basic block is a place where we can insert whatever fixup code we
need, and it will execute &lt;em&gt;only&lt;&#x2F;em&gt; when control flow transfers from the
one specific block to the other. A critical-edge split is illustrated
in the following figure:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2020-12-19-critical-edges-web.svg&quot; alt=&quot;Figure: Splitting a critical edge&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;There are multiple ways in which we could handle this problem: we
could preemptively split every critical edge; or we could split them
on demand, only when we need to insert code. The latter would require
editing the CFG in place, and for various reasons, we would prefer to
avoid doing this: it invalidates many analysis results, and
complicates data structures. It is also much simpler to reason about
many algorithms if we can assume that edges are already
split. However, splitting every edge will leave many empty blocks,
because we &lt;em&gt;usually&lt;&#x2F;em&gt; do not need to insert any fixup code on an edge.
In addition, splitting an edge raises the question of &lt;em&gt;where&lt;&#x2F;em&gt; to
insert the split-block. If we take the simplest approach and append it
to the end of the function, we probably significantly reduce the
number of branch-fallthrough simplifications we can make; a smarter
heuristic that placed the block near its predecessor or successor
would be better.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;traditional-approach-in-place-edits&quot;&gt;Traditional Approach: In-Place Edits&lt;&#x2F;h2&gt;
&lt;p&gt;The traditional approach to all of these problems is to decompose the
task into a number of &lt;em&gt;passes&lt;&#x2F;em&gt; and perform &lt;em&gt;in-place edits&lt;&#x2F;em&gt; with those
passes. For example, in LLVM, IR is lowered into a machine-specific
form (&lt;code&gt;MachineFunction&lt;&#x2F;code&gt; of &lt;code&gt;MachineBasicBlock&lt;&#x2F;code&gt;s) with an explicit
notion of layout and with machine-level branch instructions; then
edits are made, taking care to update branches when the layout
changes.&lt;&#x2F;p&gt;
&lt;p&gt;For example, the following sequence of passes should handle most of
the above issues:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Split all critical edges, placing the split-block after the
predecessor. (In LLVM, the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm&#x2F;llvm-project&#x2F;blob&#x2F;3fa2d37eb3f8acddcfde749ca822f2cc7d900cbb&#x2F;llvm&#x2F;lib&#x2F;Transforms&#x2F;Utils&#x2F;BreakCriticalEdges.cpp&quot;&gt;&lt;code&gt;SplitCriticalEdges&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
pass.)&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Perform other optimizations, and register allocation; these may use
the split-blocks.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Perform jump-threading transform; this will remove control-flow
transfers through empty blocks. (In LLVM, the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm&#x2F;llvm-project&#x2F;blob&#x2F;3fa2d37eb3f8acddcfde749ca822f2cc7d900cbb&#x2F;llvm&#x2F;lib&#x2F;CodeGen&#x2F;BranchFolding.cpp&quot;&gt;&lt;code&gt;BranchFolding&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
pass.)&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Compute reachability, and delete &quot;dead blocks&quot; (blocks that are no
longer reachable). (Also done by &lt;code&gt;BranchFolding&lt;&#x2F;code&gt; in LLVM.)&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Compute a block order that tries to minimize jump distances and
places at least one successor directly after every block when
possible. (In LLVM, the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm&#x2F;llvm-project&#x2F;blob&#x2F;3fa2d37eb3f8acddcfde749ca822f2cc7d900cbb&#x2F;llvm&#x2F;lib&#x2F;CodeGen&#x2F;MachineBlockPlacement.cpp&quot;&gt;&lt;code&gt;MachineBlockPlacement&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
pass.)&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Linearize the code from the CFG nodes into a single stream of
machine instructions using this block order. (In LLVM, blocks are
initially lowered into the &lt;code&gt;MachineFunction&lt;&#x2F;code&gt; and then reordered by
&lt;code&gt;MachineBlockPlacement&lt;&#x2F;code&gt;.)&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Remove branches to fallthrough blocks, and invert conditionals that
create additional fallthroughs.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Compute block offsets based on machine-code size of current
instruction sequence, assuming worst-case size for every branch.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Scan over branches, checking whether block locations allow for
shorter forms due to nearer targets. Update branches and recompute
block offsets if so. Continue until fixpoint. (In LLVM, the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;llvm&#x2F;llvm-project&#x2F;blob&#x2F;3fa2d37eb3f8acddcfde749ca822f2cc7d900cbb&#x2F;llvm&#x2F;lib&#x2F;CodeGen&#x2F;BranchRelaxation.cpp&quot;&gt;&lt;code&gt;BranchRelaxation&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
does this.)&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Fill in branch targets using final offsets. Branches are now in a
form ready for machine-code emission.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;Clearly this will work, and with some care (especially in the
block-placement heuristics), it will produce very good code. But the
above steps require &lt;em&gt;many&lt;&#x2F;em&gt; in-place edits. This is both slow (we are
re-doing some work every time we edit the code) and forces us to use
data structures that allow for such edits (e.g., linked list), which
imposes a tax on every other operation on the IR. Is there a better
way?&lt;&#x2F;p&gt;
&lt;h2 id=&quot;cranelift-s-new-approach-streaming-edits&quot;&gt;Cranelift&#x27;s New Approach: Streaming Edits&lt;&#x2F;h2&gt;
&lt;p&gt;It would be ideal if we could avoid some of the code-transform passes
described above; can we? It turns out that one can actually do the
functional equivalent of &lt;em&gt;all&lt;&#x2F;em&gt; of the above as part of other,
pre-existing work:&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;We can decide the final block order ahead of time, and do our
CLIF-to-VCode lowering in this order, so VCode never needs to be
reordered; it is already linearized. We can also insert
critical-edge splits as part of this lowering.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;We can do &lt;em&gt;all&lt;&#x2F;em&gt; of the other work -- inverting conditionals,
threading jumps, removing dead blocks, and handling various branch
sizes -- in a streaming approach during machine-code emission! The
key insight is that we can do a sort of &quot;peephole optimization&quot;: we
can immediately delete and re-emit branches at the &lt;em&gt;tail&lt;&#x2F;em&gt; of the
emission buffer. By tracking some auxiliary state during the single
emission scan, such as reachability, labels at current emission
point, a list of unresolved label-refs earlier in code, and a
&quot;deadline&quot; for short-range branches, we can do everything we need
to do without ever backing up more than a few contiguous branches
at the end of the buffer.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;Let&#x27;s go into each of those in more detail!&lt;&#x2F;p&gt;
&lt;h3 id=&quot;step-1-decide-final-order-and-split-edges-while-lowering&quot;&gt;Step 1: Decide Final order and Split Edges while Lowering&lt;&#x2F;h3&gt;
&lt;p&gt;As part of the instruction-selection pipeline described in the
&lt;a href=&quot;&#x2F;blog&#x2F;2020&#x2F;09&#x2F;18&#x2F;cranelift-isel-1&quot;&gt;previous post&lt;&#x2F;a&gt;), we need
to iterate over the basic blocks in the CLIF and, for each block,
lower its code to VCode instructions. We would like to do this
iteration in the same order as our final machine code layout so that
we don&#x27;t need to reorder the VCode later.&lt;&#x2F;p&gt;
&lt;p&gt;The only constraint that the lowering algorithm imposes is that we
examine &lt;em&gt;value uses&lt;&#x2F;em&gt; before &lt;em&gt;value defs&lt;&#x2F;em&gt;, which we can ensure by
visiting a block before any of its dominators. That leaves a lot of
freedom in how we do the lowering.&lt;&#x2F;p&gt;
&lt;p&gt;If that were the whole problem, we could just do a
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Depth-first_search#Vertex_orderings&quot;&gt;postorder&lt;&#x2F;a&gt;
traversal and be done with it. In fact, the problem is complicated by
one other factor: critical-edge splitting!&lt;&#x2F;p&gt;
&lt;p&gt;Recall that we described above that we must either preemptively split
all critical edges or else find a way to edit-in-place later. To avoid
the complexities of edit-in-place, we choose to split them all. Note
that this is cheap as far as our CFG lowering is concerned, because
our later branch optimizations will remove empty blocks almost for
free. (The register allocator&#x27;s analyses may become more expensive
with a higher block count, but in practice we have not found this to
be much of a problem.)&lt;&#x2F;p&gt;
&lt;p&gt;The challenge is in generating these blocks in the correct place &lt;em&gt;on
the fly&lt;&#x2F;em&gt;. To generate the lowering order, we define a &lt;em&gt;virtual graph&lt;&#x2F;em&gt;
that is never actually materialized, whose nodes are implied by the
CLIF blocks and edges (every CLIF edge becomes a split-edge block) and
whose edges are defined only by a successor function. To generate the
lowering order, we perform a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Depth-first_search&quot;&gt;depth-first
search&lt;&#x2F;a&gt; over the
virtual graph, recording the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Depth-first_search#Vertex_orderings&quot;&gt;postorder&lt;&#x2F;a&gt;. This
postorder is guaranteed to see uses before defs, as required. The DFS
itself is a pretty good heuristic for block placement: it will tend to
group structured-control-flow code together into its hierarchical
units.&lt;&#x2F;p&gt;
&lt;p&gt;There are additional details in &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;main&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;machinst&#x2F;blockorder.rs&quot;&gt;the
implementation&lt;&#x2F;a&gt;
that ensure we split only critical edges rather than all edges, that
record block-successor information directly as we produce lowered
blocks so that the subsequent backend stages do not need to recompute
it, and some other small optimizations.&lt;&#x2F;p&gt;
&lt;p&gt;This is illustrated in the following figure, showing a CLIF-level CFG
transformed with split edges and merged edge-blocks then linearized at
a conceptual level; and the successor function actually defined to
drive the DFS. Note that the naïve lowering of the split-edge CFG
would result in 14 branches (due to 14 CFG edges); the final lowered
machine code contains only 4 branches, while providing a slot for any
needed fixup instructions on any CFG edge.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2020-12-19-cfg-lowering-web.svg&quot; alt=&quot;Figure: CFG lowering with edge-splitting and merging using implicit DFS&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h3 id=&quot;step-2-edit-branches-while-emitting&quot;&gt;Step 2: Edit Branches while Emitting&lt;&#x2F;h3&gt;
&lt;p&gt;Once we have lowered &lt;code&gt;VCode&lt;&#x2F;code&gt;, we need to emit machine code! In a
conventional design, this would require linearization, a bunch of
branch optimizations, and branch-target resolution before we ever
produced a byte of machine-code. But we can do much better.&lt;&#x2F;p&gt;
&lt;p&gt;In Cranelift&#x27;s design, a machine backend just &lt;em&gt;emits every conditional
branch naïvely as a two-way combination&lt;&#x2F;em&gt; into a machine-code buffer we
call the &lt;code&gt;MachBuffer&lt;&#x2F;code&gt;. Critically, however, this &lt;code&gt;MachBuffer&lt;&#x2F;code&gt; is not
merely a &lt;code&gt;Vec&amp;lt;u8&amp;gt;&lt;&#x2F;code&gt;: it knows (many) things about its content,
including where its branches are, what the branches&#x27; targets are, and
how to invert the branches if necessary.&lt;&#x2F;p&gt;
&lt;p&gt;The &lt;code&gt;MachBuffer&lt;&#x2F;code&gt; will perform &lt;em&gt;streaming edits&lt;&#x2F;em&gt; on the code as it is
emitted, editing only a &lt;em&gt;suffix&lt;&#x2F;em&gt; of the buffer (contiguous bytes up to
the end, or current emission point), in order to convert two-way
branch combos when possible into simpler forms.&lt;&#x2F;p&gt;
&lt;p&gt;The abstraction that the machine backend sees is:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;The &lt;code&gt;MachBuffer&lt;&#x2F;code&gt; allows us to emit machine-code bytes.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;We can tell the &lt;code&gt;MachBuffer&lt;&#x2F;code&gt; that a certain range of machine-code
bytes that we just emitted are a &lt;em&gt;branch&lt;&#x2F;em&gt;, either conditional or
unconditional, how to invert it if conditional, and a &lt;em&gt;label&lt;&#x2F;em&gt; as the
branch target.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;We can &lt;em&gt;bind&lt;&#x2F;em&gt; a label to the current emission point.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;We parameterize the &lt;code&gt;MachBuffer&lt;&#x2F;code&gt; on a &lt;code&gt;LabelUse&lt;&#x2F;code&gt; trait
implementation which defines all the different kinds of
branch-target references, how to patch the machine code with a
resolved offset, and how to emit a &lt;em&gt;veneer&lt;&#x2F;em&gt;, i.e., a longer-form
branch that the original branch can indirect through in order to
reach further.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;And that&#x27;s it! The &lt;code&gt;MachBuffer&lt;&#x2F;code&gt; does all of the work behind the
scenes: when we emit a branch, it sometimes chomps some bytes; and
when we define a label, it sometimes scans through a list of deferred
fixups to patch earlier machine code.&lt;&#x2F;p&gt;
&lt;p&gt;A (simplified) illustrated example is shown below. The machine backend
emits two-way branches naïvely by always emitting a conditional and
unconditional branch (e.g. at the end of basic block &lt;code&gt;L0&lt;&#x2F;code&gt;). It also
provides metadata to the &lt;code&gt;MachBuffer&lt;&#x2F;code&gt; that describes where the labels
are, where the branches are, where the branches are targetted (as
labels), and how to invert conditional branches. The &lt;code&gt;MachBuffer&lt;&#x2F;code&gt; is
able to perform the listed streaming edits &lt;em&gt;as code is emitted&lt;&#x2F;em&gt;,
producing the final machine code at the right with no intermediate
buffering or additional passes. We&#x27;ll describe how this editing occurs
in more detail below.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2020-12-19-machbuffer-web.svg&quot; alt=&quot;Figure: MachBuffer emission and edits&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h4 id=&quot;branch-peephole-optimizations&quot;&gt;Branch Peephole Optimizations&lt;&#x2F;h4&gt;
&lt;p&gt;The key insight of the &lt;code&gt;MachBuffer&lt;&#x2F;code&gt; design is that we can edit
branches &lt;em&gt;at the tail of the buffer&lt;&#x2F;em&gt; as code is emitted by tracking
the &quot;most recent&quot; branches: specifically, the branches that are
&lt;em&gt;contiguous to the tail of the buffer&lt;&#x2F;em&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The first optimization that we do is &lt;em&gt;branch inversion&lt;&#x2F;em&gt;, which can
sometimes eliminate unconditional branches. In the example above, when
the backend has emitted the machine-code bytes for all of the &lt;code&gt;L0&lt;&#x2F;code&gt;
basic block, the &lt;code&gt;MachBuffer&lt;&#x2F;code&gt; will know that the last two branches,
contiguous to the tail, are a conditional branch to &lt;code&gt;L1&lt;&#x2F;code&gt; and an
unconditional branch to &lt;code&gt;L2&lt;&#x2F;code&gt;. When we then see the label &lt;code&gt;L1&lt;&#x2F;code&gt; (which
is the first branch&#x27;s target) bound to this offset, we can apply a
simple rule: a conditional that jumps over an immediately following
unconditional can be inverted, and the unconditional branch
removed. Note that, critically, because these branches are &lt;em&gt;contiguous
to the tail&lt;&#x2F;em&gt;, the edit will not affect any offsets that have already
been resolved; we are free to contract the code-size here, and offsets
of subsequently-emitted code will be correct without further fixups.&lt;&#x2F;p&gt;
&lt;p&gt;Said another way, our approach &lt;em&gt;never moves code&lt;&#x2F;em&gt;. Rather, it only
sometimes &lt;em&gt;chomps or adjusts a just-emitted branch&lt;&#x2F;em&gt;, right away,
before code-emission carries on past the branch.&lt;&#x2F;p&gt;
&lt;p&gt;The next optimizations we do are &lt;em&gt;jump threading&lt;&#x2F;em&gt; and &lt;em&gt;dead-block
removal&lt;&#x2F;em&gt;. Recall from above that jump threading means that
intermediate steps in a chain of jumps can be removed: a jump to a
jump to X becomes just a jump to X. We resolve this by keeping an
&lt;em&gt;up-to-date alias table&lt;&#x2F;em&gt; that tracks label-to-label aliases. The table
is updated whenever the &lt;code&gt;MachBuffer&lt;&#x2F;code&gt; is informed that an unconditional
jump was emitted and a label was bound to its address. We then
indirect through the alias table when resolving labels to final
offsets. The second task, dead-block removal, occurs as a side-effect
of tracking &lt;em&gt;reachability&lt;&#x2F;em&gt; of the current buffer tail. Any offset that
(i) immediately follows an unconditional jump, &lt;em&gt;and&lt;&#x2F;em&gt; (ii) has no
labels bound to it, is unreachable; an unconditional jump at an
unreachable offset can be elided. (Actually, any code at an
unreachable offset can be removed, but for simplicity and to make it
easier to reason about correctness, we restrict the &lt;code&gt;MachBuffer&lt;&#x2F;code&gt;&#x27;s
edits to code explicitly marked as branch instructions only.)&lt;&#x2F;p&gt;
&lt;p&gt;In order for this to work correctly, we need to track all labels that
have been bound to the current buffer tail and adjust them if we chomp
(truncate) the buffer or redirect a label. For this reason, the
label-binding, label-use resolution, and branch-chomping are all
tightly integrated into a set of interacting data structures:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2020-12-19-machbuffer-structures-rewrites-web.svg&quot; alt=&quot;Figure: MachBuffer data structures and rewrites&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;To summarize, we track:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Emitted bytes;&lt;&#x2F;li&gt;
&lt;li&gt;All labels bound to the current offset;&lt;&#x2F;li&gt;
&lt;li&gt;A table of all labels to a bound offset or &quot;unbound&quot;;&lt;&#x2F;li&gt;
&lt;li&gt;A table of all labels to another label as an alias or &quot;unaliased&quot;;&lt;&#x2F;li&gt;
&lt;li&gt;A list of the &lt;em&gt;last contiguous branches&lt;&#x2F;em&gt;, each of which is
conditional or not, with inverted form if so, and label-use, and
labels that are bound &lt;em&gt;to&lt;&#x2F;em&gt; this branch instruction;&lt;&#x2F;li&gt;
&lt;li&gt;A list of other label-uses for fixup.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;As we emit code, we append to the emitted-bytes buffer. We lazily
invalidate the &quot;labels at current offset&quot; set by tracking the offset
for which that set is valid; appending new code implicitly clears it.&lt;&#x2F;p&gt;
&lt;p&gt;As the machine backend tells the &lt;code&gt;MachBuffer&lt;&#x2F;code&gt; about branches, we
append to the list of the last contiguous branches. This, too, is
invalidated when code is emitted that is not a branch.&lt;&#x2F;p&gt;
&lt;p&gt;When a label-use is noted and the label is already resolved, we fix up
the buffer right away. Note that once a label resolves to an offset,
that offset cannot change; so this fixup can be done once and the
metadata discarded.&lt;&#x2F;p&gt;
&lt;p&gt;All branch simplification happens when a label is &lt;em&gt;bound&lt;&#x2F;em&gt;: this is
when new actions become possible. We perform the following algorithm
(see
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;ce6e967eebc7e293950394ba212689190e2cf0ed&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;machinst&#x2F;buffer.rs#L753&quot;&gt;&lt;code&gt;MachBuffer::optimize_branches()&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
for details):&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Loop as long as there are some branches in the latest-branches list:
&lt;ul&gt;
&lt;li&gt;If the current buffer tail is beyond the end of the latest branch,
done (and clear list).&lt;&#x2F;li&gt;
&lt;li&gt;If the latest branch (which ends at current tail) has a target
that resolves to current tail, chomp it and restart loop.&lt;&#x2F;li&gt;
&lt;li&gt;If the latest branch is unconditional &lt;em&gt;and does not branch to
itself&lt;&#x2F;em&gt;:
&lt;ul&gt;
&lt;li&gt;Update any labels pointing &lt;em&gt;at&lt;&#x2F;em&gt; this branch to point &lt;em&gt;at its
target&lt;&#x2F;em&gt; instead.&lt;&#x2F;li&gt;
&lt;li&gt;Restart loop if any labels were moved.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;If latest branch is unconditional, follows another unconditional
branch, and no labels are bound at this branch, then chomp it
(unreachable) and restart loop.&lt;&#x2F;li&gt;
&lt;li&gt;If latest branch is unconditional, follows a conditional branch,
and conditional branch target is current tail, then invert
conditional and chomp the unconditional, and restart loop.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;When loop is done, clear latest-branches list; no more can be
simplified.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;This may look to have undesirable algorithmic complexity, but in fact
it is tightly bounded: we make a forward-progress argument as labels
only move down alias chains and fixed work is done per branch (each is
only examined once and acted upon or purged). Overall, the algorithm
runs in linear time.&lt;&#x2F;p&gt;
&lt;p&gt;This linear-time algorithm that edits locally, avoids any
code-movement, and streams into a buffer in final form, is far better
than the multi-pass edit-in-place design of a traditional backend,
both asymptotically and in practice (CPUs love streaming algorithms
and minimized data movement). It seems to produce code nearly as good
as a much more complex branch simplifier at a much lower cost.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;correctness&quot;&gt;Correctness&lt;&#x2F;h3&gt;
&lt;p&gt;The algorithm for simplifying branches is one of the most critical to
correctness in the (post-optimizer) compiler backend; probably only
second to the register allocator. It is very subtle, and bugs can be
disastrous: incorrect control flow could cause &lt;em&gt;anything&lt;&#x2F;em&gt; to happen,
from impossible-to-debug incorrect program results (ask me &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;1729&quot;&gt;how I
know&lt;&#x2F;a&gt;! And
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;2083&quot;&gt;here too&lt;&#x2F;a&gt;!)
to serious security vulnerabilities.&lt;&#x2F;p&gt;
&lt;p&gt;Because of this, we have taken &lt;em&gt;extensive&lt;&#x2F;em&gt; care to ensure
correctness. In fact, more than a third of the lines in the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;ce6e967eebc7e293950394ba212689190e2cf0ed&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;machinst&#x2F;buffer.rs&quot;&gt;&lt;code&gt;MachBuffer&lt;&#x2F;code&gt;
implementation&lt;&#x2F;a&gt;
are a proof of correctness, based on several core invariants
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;ce6e967eebc7e293950394ba212689190e2cf0ed&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;machinst&#x2F;buffer.rs#L109-L141&quot;&gt;described
here&lt;&#x2F;a&gt;. At
each data-structure mutation, we show that (i) the invariants still
hold, and (ii) the code mutation did not alter execution semantics.&lt;&#x2F;p&gt;
&lt;p&gt;Because there is significant wisdom in the Knuth quote &quot;I have only
proved it correct, not tried it&quot; (there are always gaps between
specification and reality, and unless one generates an implementation
from a machine-checked proof, then the English prose or its
translation to code may have bugs too&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#3&quot;&gt;3&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;), all invariants are also
fully checked on each label-bind event in debug builds of
Cranelift. The various fuzzing harnesses that hammer on the new
backend will thus be driving these checks continuously.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;other-concerns&quot;&gt;Other Concerns&lt;&#x2F;h3&gt;
&lt;p&gt;There are many subtleties to the branch-simplification and code layout
problems that were not discussed here! Most prominently, we have not
covered &lt;em&gt;branch veneers&lt;&#x2F;em&gt; or the topic of branch ranges at all, though
we saw the &lt;em&gt;problem&lt;&#x2F;em&gt; of &quot;branch relaxation&quot; above. The &lt;code&gt;MachBuffer&lt;&#x2F;code&gt;
handles out-of-range branches by tracking a &quot;deadline&quot; (the last point
at which any currently outstanding label may be bound without causing
a branch to go out of range); if we hit the deadline, we emit an
&lt;em&gt;island&lt;&#x2F;em&gt; of &lt;em&gt;branch veneers&lt;&#x2F;em&gt;, which are commonly just long-range
unconditional branches, for each unresolved label and resolve the
labels to those branches. This extends the deadlines. In practice
island emission will almost never occur, so it is acceptable to
pessimize this case (add an extra indirection) to avoid the need to go
back and edit the original branch into a longer-range form.&lt;&#x2F;p&gt;
&lt;p&gt;We also haven&#x27;t covered constant pools; these are handled with the
same &quot;island&quot; mechanism, allowing emitted machine code to refer to
nearby constant data.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;conclusion-and-next-time&quot;&gt;Conclusion, and Next Time&lt;&#x2F;h2&gt;
&lt;p&gt;This has been a deep dive into the world of branch simplification,
with an emphasis on how we engineered Cranelift&#x27;s new backend to
provide very good compilation speed taking control-flow handling and
branch lowering&#x2F;simplification as an example. We believe that there
may be other significant opportunities to rethink, and carefully
engineer, core algorithms in the compiler backend with specific
attention to &lt;em&gt;maximizing streaming behavior&lt;&#x2F;em&gt;, &lt;em&gt;minimizing
indirection&lt;&#x2F;em&gt;, and &lt;em&gt;minimizing passes over data&lt;&#x2F;em&gt;. This is an
interesting and exciting engineering pursuit largely because it goes
beyond the world of &quot;theoretical standard compiler-book algorithms&quot;
and calls on problem solving to find clever new design tricks.&lt;&#x2F;p&gt;
&lt;p&gt;As we described near the end of this post, &lt;em&gt;correctness&lt;&#x2F;em&gt; is also an
important focus -- perhaps &lt;em&gt;the&lt;&#x2F;em&gt; most important focus -- of any
compiler engineering effort. Given that, I plan to write the next (and
final) post in this series about how we engineered for correctness by
taking a deep-dive into the &lt;em&gt;register allocator checker&lt;&#x2F;em&gt;, which is a
novel symbolic checker (which can be seen as an application of
abstract interpretation) that allows us to &lt;em&gt;prove&lt;&#x2F;em&gt; that any particular
register-allocator execution gave a correct allocation result.  I&#x27;ll
talk about how this checker, driven by a fuzzing frontend, found some
&lt;em&gt;really subtle and interesting bugs&lt;&#x2F;em&gt; that we likely never would have
found in production otherwise. With that, until next time, happy
compiling!&lt;&#x2F;p&gt;
&lt;p&gt;&lt;em&gt;For discussions about this post, please feel free to join us on our Zulip
instance in &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bytecodealliance.zulipchat.com&#x2F;#narrow&#x2F;stream&#x2F;217117-cranelift&#x2F;topic&#x2F;blog.20post.20on.20branch.20optimizations.20in.20new.20backend&quot;&gt;this
thread&lt;&#x2F;a&gt;.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;em&gt;Thanks to Benjamin Bouvier for reviewing this post and providing very helpful
feedback! Thanks also to bjorn3 for correcting a typo in a figure.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;1&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;1&lt;&#x2F;sup&gt;
&lt;p&gt;Frances E. Allen and John Cocke. &lt;em&gt;A catalogue of optimizing
transformations.&lt;&#x2F;em&gt; In &lt;em&gt;Design and Optimization of Compilers&lt;&#x2F;em&gt;
(Prentice-Hall, 1972), pp. 1--30.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;2&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;2&lt;&#x2F;sup&gt;
&lt;p&gt;This is a bit of a simplification of branches in the IR: in CLIF
(and in most other CFG-based compiler IRs), there are several
branch types. Another is the &quot;switch&quot; or &quot;branch table&quot; branch
that chooses between N possible targets with an integer
index. There are also simple single-target unconditional
branches; and a return instruction is also a &quot;branch&quot; of sorts
in that it ends a basic block, though it has no successors. The
important takeaway is that IR-level branches are an abstraction
level above machine-code control flow, allowing for a direct
choice between several or many targets as one operation.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;3&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;3&lt;&#x2F;sup&gt;
&lt;p&gt;See &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;pull&#x2F;2083&quot;&gt;PR
#2083&lt;&#x2F;a&gt;
above, which is a bug that arose &lt;em&gt;after&lt;&#x2F;em&gt; I wrote the correctness
proof, because the proof assumed target-aliasing supported
arbitrarily-long branch chains but it actually followed only one
level. This was a deliberate earlier implementation choice to
avoid infinite loops on branch cycles. It turns out that it&#x27;s
possible to just avoid cycles in the alias table by
construction; we carefully prove that this is so and then allow
redirect-chasing through chains of branches. For extra paranoia,
because a non-terminating compiler is bad and we are merely
human, we &lt;em&gt;still&lt;&#x2F;em&gt; limit redirect-chasing to 1 million branches
and panic beyond that (because one can never be too careful);
this is a limit that will never be hit when using the Wasm
frontend (due to limits on function size) and is extremely
unlikely to be hit elsewhere.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
</content>
    </entry>
    <entry xml:lang="en">
        <title>A New Backend for Cranelift, Part 1: Instruction Selection</title>
        <published>2020-09-18T00:00:00+00:00</published>
        <updated>2020-09-18T00:00:00+00:00</updated>
        <author>
            <name></name>
        </author>
        <link rel="alternate" type="text/html" href="https://cfallin.org/blog/2020/09/18/cranelift-isel-1/"/>
        <id>https://cfallin.org/blog/2020/09/18/cranelift-isel-1/</id>
        <content type="html" xml:base="https://cfallin.org/blog/2020/09/18/cranelift-isel-1/">&lt;p&gt;This post is the first in a three-part series about my recent work on
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;tree&#x2F;main&#x2F;cranelift&quot;&gt;Cranelift&lt;&#x2F;a&gt;
as part of my day job at Mozilla. In this first post, I will set some context
and describe the instruction selection problem. In particular, I&#x27;ll talk about
a revamp to the instruction selector and backend framework in general that
we&#x27;ve been working on for the last nine months or so. This work has been
co-developed with my brilliant colleagues Julian Seward and &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;benj.me&quot;&gt;Benjamin
Bouvier&lt;&#x2F;a&gt;, with significant early input from &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;sunfishcode&quot;&gt;Dan
Gohman&lt;&#x2F;a&gt; as well, and help from all of the
wonderful Cranelift hackers.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;background-cranelift&quot;&gt;Background: Cranelift&lt;&#x2F;h2&gt;
&lt;p&gt;So what is Cranelift? The project is a compiler framework written in
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.rust-lang.org&#x2F;&quot;&gt;Rust&lt;&#x2F;a&gt; that is designed especially (but not
exclusively) for &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Just-in-time_compilation&quot;&gt;just-in-time
compilation&lt;&#x2F;a&gt;. It&#x27;s a
general-purpose compiler: its most popular use-case is to compile
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;www.webassembly.org&#x2F;&quot;&gt;WebAssembly&lt;&#x2F;a&gt;, though several other frontends
exist, for example,
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bjorn3&#x2F;rustc_codegen_cranelift&quot;&gt;cg_clif&lt;&#x2F;a&gt;, which adapts the
Rust compiler itself to use Cranelift. Folks at Mozilla and several other
places have been developing the compiler for a few years now.  It is the
default compiler backend for
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&quot;&gt;wasmtime&lt;&#x2F;a&gt;, a runtime for
WebAssembly outside the browser, and is used in production in several other
places as well. We recently flipped the switch to turn on Cranelift-based
WebAssembly support in nightly Firefox on &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;AArch64&quot;&gt;ARM64
(AArch64)&lt;&#x2F;a&gt; machines, including most
smartphones, and if all goes well, it will eventually go out in a stable
Firefox release. Cranelift is developed under the umbrella of the &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bytecodealliance.org&#x2F;&quot;&gt;Bytecode
Alliance&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;In the past nine months, we have built a new framework in Cranelift for the
&quot;machine backends&quot;, or the parts of the compiler that support particular CPU
instruction sets. We also added a new backend for AArch64, mentioned above, and
filled out features as needed until Cranelift was ready for production use in
Firefox. This blog post sets some context and describes the design process that
went into the backend-framework revamp.&lt;&#x2F;p&gt;
&lt;p&gt;It can be a bit confusing to keep all of the moving parts straight. Here&#x27;s a
visual overview of Cranelift&#x27;s place among various other components, focusing
on two of the major Rust crates (the Wasm frontend and the codegen backend) and
several of the other programs that make use of Cranelift:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2020-09-10-cranelift-components.svg&quot; alt=&quot;Figure: Cranelift and other components&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;h2 id=&quot;old-backend-design-instruction-legalizations&quot;&gt;Old Backend Design: Instruction Legalizations&lt;&#x2F;h2&gt;
&lt;p&gt;To understand the work that we&#x27;ve done recently on Cranelift, we&#x27;ll need to
zoom into the &lt;code&gt;cranelift_codegen&lt;&#x2F;code&gt; crate above and talk about how it &lt;em&gt;used to&lt;&#x2F;em&gt;
work. What is this &quot;CLIF&quot; input, and how does the compiler translate it to
machine code that the CPU can execute?&lt;&#x2F;p&gt;
&lt;p&gt;Cranelift makes use of
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;main&#x2F;cranelift&#x2F;docs&#x2F;ir.md&quot;&gt;CLIF&lt;&#x2F;a&gt;,
or the Cranelift IR (Intermediate Representation) Format, to represent the code
that it is compiling. Every compiler that performs program optimizations uses
some form of an &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Intermediate_representation&quot;&gt;Intermediate Representation
(IR)&lt;&#x2F;a&gt;: you can think
of this like a virtual instruction set that can represent all the operations a
program is allowed to do. The IR is typically simpler than real instruction
sets, designed to use a small set of well-defined instructions so that the
compiler can easily reason about what a program means. The IR is also
independent of the CPU architecture that the compiler eventually targets; this
lets much of the compiler (such as the part that generates IR from the input
programming language, and the parts that optimize the IR) be reused whenever
the compiler is adapted to target a new CPU architecture.  CLIF is in &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Static_single_assignment_form&quot;&gt;Static
Single Assignment
(SSA)&lt;&#x2F;a&gt; form, and
uses a conventional &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Control-flow_graph&quot;&gt;control-flow
graph&lt;&#x2F;a&gt; with basic blocks
(though it previously allowed extended basic blocks, these have been phased
out). Unlike many SSA IRs, it represents φ-nodes with block parameters
rather than explicit φ-instructions.&lt;&#x2F;p&gt;
&lt;p&gt;Within &lt;code&gt;cranelift_codegen&lt;&#x2F;code&gt;, before we revamped the backend design, the program
remained in CLIF throughout compilation and up until the compiler emitted the
final machine code. This might seem to contradict what we just said: how can
the IR be machine-independent, but also be the final form from which we emit
machine code?&lt;&#x2F;p&gt;
&lt;p&gt;The answer is that the old backends were built around the concept of
&quot;legalization&quot; and &quot;encodings&quot;. At a high level, the idea is that every
&lt;em&gt;Cranelift&lt;&#x2F;em&gt; instruction either corresponds to one &lt;em&gt;machine&lt;&#x2F;em&gt; instruction, or can
be replaced by a sequence of other &lt;em&gt;Cranelift&lt;&#x2F;em&gt; instructions. Given such a
mapping, we can refine the CLIF in steps, starting from arbitrary
machine-independent instructions from earlier compiler stages, performing edits
until the CLIF corresponds 1-to-1 with machine code. Let&#x27;s visualize this
process:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2020-09-10-cranelift-legalization.svg&quot; alt=&quot;Figure: legalization by repeated instruction expansion&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;A very simple example of a CLIF instruction that has a direct &quot;encoding&quot; to a
machine instruction is &lt;code&gt;iadd&lt;&#x2F;code&gt;, which just adds two integers. On essentially any
modern architecture, this should map to a simple ALU instruction that adds two
registers.&lt;&#x2F;p&gt;
&lt;p&gt;On the other hand, many CLIF instructions do not map cleanly. Some arithmetic
instructions fall into this category: for example, there is a CLIF instruction
to count the number of set bits in an integer&#x27;s binary representation
(&lt;code&gt;popcount&lt;&#x2F;code&gt;); not every CPU has a single instruction for this, so it might be
expanded into a longer series of bit manipulations. There are operations that
are defined at a higher semantic level, as well, that will necessarily be
lowered with expansions: for example, accesses to Wasm memories are lowered
into operations that fetch the linear memory base and its size, bounds-check
the Wasm address against the limit, compute the real address for the Wasm
address, and perform the access.&lt;&#x2F;p&gt;
&lt;p&gt;To compile a function, then, we iterate over the CLIF and find instructions
with no direct machine encodings; for each, we simply expand into the legalized
sequence, and then recursively consider the instructions in that sequence. We
loop until all instructions have machine encodings. At that point, we can emit
the bytes corresponding to each instruction&#x27;s encoding&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#1&quot;&gt;1&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;growing-pains-and-a-new-backend-framework&quot;&gt;Growing Pains, and a New Backend Framework?&lt;&#x2F;h2&gt;
&lt;p&gt;There are a number of advantages to the legacy Cranelift backend design, which
performs expansion-based legalization with a single IR throughout. As one might
expect, though, there are also a number of drawbacks. Let&#x27;s discuss a few of
each.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;single-ir-and-legalization-pros&quot;&gt;Single IR and Legalization: Pros&lt;&#x2F;h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;By operating on a single IR all the way to machine-code emission, the same
optimizations can be applied at multiple stages. For example, consider a
legalization expansion that turns a high-level &quot;access Wasm memory&quot;
instruction into a sequence of loads, adds and bounds-checks. If many such
sequences occur in one function, we might be able to factor out common
portions (e.g.: computing the base of the Wasm memory).  Thus the
legalization scheme exposes as much code as possible, at as many stages as
possible, to opportunities for optimization. The legacy Cranelift pipeline
in fact works in this way: it runs &quot;pre-opt&quot; and &quot;post-opt&quot; optimization
passes, before and after legalization respectively.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;If &lt;em&gt;most&lt;&#x2F;em&gt; of the Cranelift instructions become one machine instruction, and
few legalizations are necessary, then this scheme can be very fast: it
becomes simply a single traversal to fill in &quot;encodings&quot;, which were
represented by small indices into a table.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;h3 id=&quot;single-ir-and-legalization-cons&quot;&gt;Single IR and Legalization: Cons&lt;&#x2F;h3&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Expansion-based legalization may not always result in
optimal code. So far we&#x27;ve seen that legalization can convert from CLIF to
machine instructions with one-to-one or one-to-many mappings. However, there
are sometimes also &lt;em&gt;single&lt;&#x2F;em&gt; machine instructions that implement the behavior of
&lt;em&gt;multiple&lt;&#x2F;em&gt; CLIF instructions, i.e. a many-to-one mapping. In order to generate
efficient code, we want to be able to make use of these instructions.&lt;&#x2F;p&gt;
&lt;p&gt;For example, on x86, an instruction that references memory can compute an
address like &lt;code&gt;base + scale * index&lt;&#x2F;code&gt;, where &lt;code&gt;base&lt;&#x2F;code&gt; and &lt;code&gt;index&lt;&#x2F;code&gt; are registers
and &lt;code&gt;scale&lt;&#x2F;code&gt; is 1, 2, 4, or 8. There is no notion of such an address mode in
CLIF, so we would want to pattern-match the raw &lt;code&gt;iadd&lt;&#x2F;code&gt; (add) and &lt;code&gt;ishl&lt;&#x2F;code&gt;
(shift) or &lt;code&gt;imul&lt;&#x2F;code&gt; (multiply) operations when they occur in the address
computation. Then, we would want to somehow select the encoding on the
&lt;code&gt;load&lt;&#x2F;code&gt; instruction based on the fact that its input is some specific
combination of adds and shifts&#x2F;multiplies.  This seems to break the
abstraction that the encoding represents only that instruction&#x27;s operation.&lt;&#x2F;p&gt;
&lt;p&gt;In principle, we could implement more general pattern matching for legalization
rules to allow many-to-one mappings. However, this would be a significant
refactor; and as long as we were reconsidering the design in whole, there were
other reasons to avoid patching the problem in this way.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;There is a conceptual difficulty with the single-IR approach: there is
no static representation of which instructions are expanded into which others
and it is difficult to reason about the correctness and termination properties
of legalization as a whole.&lt;&#x2F;p&gt;
&lt;p&gt;Specifically, the expansion-based legalization rules must obey a partial
order among instructions: if A expands into a sequence including B, then B
cannot later expand into A. In practice, mappings were mostly one-to-one,
and for those that weren&#x27;t, there was a clear domain separation between the
&quot;input&quot; high-level instructions and the &quot;machine-level&quot; instructions.
However, for more complex machines, or more complex matching schemes that
attempt to make better use of the target instruction set, this could become
a real difficulty for the machine-backend author to keep straight.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;There are efficiency concerns with expansion-based legalization. At
an algorithmic level, we prefer to avoid fixpoint loops (in this case,
&quot;continue expanding until no more expansions exist&quot;) whenever possible. The
runtime is bounded, but the bound is somewhat difficult to reason about,
because it depends on the maximum depth of chained expansions.&lt;&#x2F;p&gt;
&lt;p&gt;The data structures that enable in-place editing are also much slower than
we would like. Typically, compilers store IR instructions in linked lists to
allow for in-place editing. While this is asymptotically as fast as an
array-based solution (we never need to perform random access), it is much
less cache-friendly or
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Instruction-level_parallelism&quot;&gt;ILP&lt;&#x2F;a&gt;-friendly
on modern CPUs. We&#x27;d prefer instead to store arrays of instructions and
perform single passes over them whenever possible.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;Our particular implementation of the legalization scheme grew to be
somewhat unwieldy over time. Witness this GitHub issue, in which my eloquent
colleague &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;benj.me&#x2F;&quot;&gt;Benjamin Bouvier&lt;&#x2F;a&gt; describes all the reasons
we&#x27;d like to fix the design: &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;issues&#x2F;1141&quot;&gt;#1141: Kill Recipes With
Fire&lt;&#x2F;a&gt;. This is no
slight to the engineers who built it; the complexity was managed as well as
could be, with a very nice DSL-based code generation step to produce the
legalizer from high-level rule specifications.  However, reasoning through
legalizations and encodings become more cumbersome than we would prefer, and
the compiler backends were not very accessible to contributors. Adding a new
instruction required learning about &quot;recipes&quot;, &quot;encodings&quot;, and
&quot;legalizations&quot; as well as mere instructions and opcodes, and finding one&#x27;s
way through the DSL to put the pieces together properly. A more conventional
code-lowering approach would avoid much of this complexity.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;A single-level IR has a fundamental tension: for analyses and optimizations
to work well, an IR should have only one way to represent any particular
operation, i.e. should consist of a small set of canonical instructions. On
the other hand, a machine-level representation should represent all of the
relevant details of the target ISA. For example, an address computation
might occur in many different ways (with different addressing modes) on the
machine, but we would prefer not to have to analyze a special
address-computation opcode in all of our analyses. An implicit rule at
emission time (&quot;a load with an add instruction as input always becomes this
addressing mode&quot;) is not ideal, either.&lt;&#x2F;p&gt;
&lt;p&gt;A single IR simply cannot serve both ends of this spectrum properly, and
difficulties arose as CLIF strayed from either end. To resolve this
conflict, it is best to have a two-level representation, connected by an
explicit instruction selector. It allows CLIF itself to be as simple and as
normalized as possible, while allowing all the details we need in
machine-specific instructions.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;For all of these reasons, as part of our revamp of Cranelift and a prerequisite
to our new AArch64 backend, we built a new framework for machine backends and
instruction selection. The framework allows machine backends to define their
own instructions, separately from CLIF; rather than legalizing with expansions
and running until a fixpoint, we define a single lowering pass; and everything
is built around more efficient data-structures, carefully optimizing passes
over data and avoiding linked lists entirely. We now describe this new design!&lt;&#x2F;p&gt;
&lt;h2 id=&quot;a-new-ir-vcode&quot;&gt;A New IR: VCode&lt;&#x2F;h2&gt;
&lt;p&gt;The main idea of the new Cranelift backend is to &lt;em&gt;add a machine-specific IR&lt;&#x2F;em&gt;,
with several properties that are chosen specifically to represent machine-code
well (i.e., the IR is very close to machine code). We call this &lt;code&gt;VCode&lt;&#x2F;code&gt;, which
comes from &quot;virtual-register code&quot;, and the VCode contains &lt;code&gt;MachInst&lt;&#x2F;code&gt;s, or
machine instructions. The key design choices we made are:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;VCode is a linear sequence of instructions. There is control-flow
information that allows traversal over basic blocks, but the data structures
are not designed to easily allow inserting or removing instructions or
reordering code. Instead, we lower into VCode with a single pass,
generating instructions in their final (or near-final) order. I&#x27;ll write more
about how we make this efficient in a follow-up post.&lt;&#x2F;p&gt;
&lt;p&gt;This design aspect avoids the inefficiencies of linked-list data structures,
allowing fast passes over arrays of instructions instead. We&#x27;ve kept the
&lt;code&gt;MachInst&lt;&#x2F;code&gt; size relatively small (16 bytes per instruction for AArch64)
which aids code generation and iteration speed as well.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;VCode is &lt;em&gt;not&lt;&#x2F;em&gt; SSA-based; instead, its instructions operate on registers.
While lowering, we allocate virtual registers. After the VCode is generated,
the register allocator computes appropriate register assignments and edits
the instructions in-place, replacing virtual registers with real registers.
(Both are packed into a 32-bit representation space, using the high bit to
distinguish virtual from real.)&lt;&#x2F;p&gt;
&lt;p&gt;Eschewing SSA at this level allows us to avoid the overhead of maintaining
its invariants, and maps more closely to the real machine. Lowerings for
instructions are allowed to, e.g., use a destination register as a temporary
before performing a final write into it. If we required SSA form, we would
have to allocate a temporary in this case and rely on the register allocator
to coalesce it back to the same register, which adds compile-time overhead.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;li&gt;
&lt;p&gt;VCode is a container for &lt;code&gt;MachInst&lt;&#x2F;code&gt;s, but there is a separate &lt;code&gt;MachInst&lt;&#x2F;code&gt;
type for each machine backend. The machine-independent part is parameterized
on &lt;code&gt;MachInst&lt;&#x2F;code&gt; (which is a trait in Rust) and is statically monomorphized to
the particular target for which the compiler is built.&lt;&#x2F;p&gt;
&lt;p&gt;Modeling a machine instruction with Rust&#x27;s excellent facilities for
strongly-typed data structures, such as &lt;code&gt;enum&lt;&#x2F;code&gt;s, avoids the issue of muddled
instruction domain (is a CLIF instruction machine-independent,
machine-dependent, or both?) and allows each backend to store the appropriate
information for its encoding.&lt;&#x2F;p&gt;
&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;One can visualize a VCode function body as consisting of the following
information (simplified; a real example is further below):&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2020-09-10-vcode.svg&quot; alt=&quot;Figure: VCode is an array of instructions with block information&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Note that the instructions are simply stored in an array, and the basic blocks
are recorded separately as ranges of array (instruction) indices. As we
described above, we designed this data structure for fast iteration, but not
for editing. We always ensure that the first block (&lt;code&gt;b0&lt;&#x2F;code&gt;) is the entry block,
and that consecutive block indices have contiguous instruction-index ranges
(i.e., are placed next to each other).&lt;&#x2F;p&gt;
&lt;p&gt;Each instruction is mostly opaque from the point of view of the VCode
container, with a few exceptions: every instruction exposes its (i) register
references, and (ii) basic-block targets, if a branch. Register references are
categorized into the usual &quot;uses&quot; and &quot;defs&quot; (reads and writes).&lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#2&quot;&gt;2&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;p&gt;
&lt;p&gt;Note as well that the instructions can refer to &lt;em&gt;either&lt;&#x2F;em&gt; virtual registers
(here denoted &lt;code&gt;v0&lt;&#x2F;code&gt;..&lt;code&gt;vN&lt;&#x2F;code&gt;) &lt;em&gt;or&lt;&#x2F;em&gt; real machine registers (here denoted
&lt;code&gt;r0&lt;&#x2F;code&gt;..&lt;code&gt;rN&lt;&#x2F;code&gt;). This design choice allows the machine backend to make use of
specific registers where required by particular instructions, or by the ABI
(parameter-passing conventions). The semantics of VCode are such that the
register allocator recognizes &lt;em&gt;live ranges&lt;&#x2F;em&gt; of the real registers, from defs to
uses, and avoids allocating virtual registers to those particular real
registers for their live ranges. After allocation, all machine instructions are
edited in place to refer only to real registers.&lt;&#x2F;p&gt;
&lt;p&gt;Aside from registers and branch targets, an instruction contained in the VCode
may contain whatever other information is necessary to emit machine code. Each
machine backend defines its own type to store this information. For example, on
AArch64, here are several of the instruction formats, simplified:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword z-storage&quot;&gt;pub enum&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Inst&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;    &#x2F;&#x2F;&#x2F; An ALU operation with two register sources and a register destination.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;    AluRRR&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;        alu_op&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; ALUOp&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;        rd&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Writable&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;Reg&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;        rn&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Reg&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;        rm&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Reg&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;    },&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;    &#x2F;&#x2F;&#x2F; An ALU operation with a register source and an immediate-12 source, and a register&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;    &#x2F;&#x2F;&#x2F; destination.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;    AluRRImm12&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;        alu_op&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; ALUOp&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;        rd&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Writable&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;Reg&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;        rn&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Reg&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;        imm12&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Imm12&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;    },&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;    &#x2F;&#x2F;&#x2F; A MOVZ with a 16-bit immediate.&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;    MovZ&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;        rd&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; Writable&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;lt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;Reg&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;&amp;gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;        imm&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; MoveWideConst&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;        size&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; OperandSize&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;    },&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;    &#x2F;&#x2F;&#x2F; A two-way conditional branch. Contains two targets; at emission time, a conditional&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;    &#x2F;&#x2F;&#x2F; branch instruction followed by an unconditional branch instruction is emitted, but&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;    &#x2F;&#x2F;&#x2F; the emission buffer will usually collapse this to just one branch. See a follow-up&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;    &#x2F;&#x2F;&#x2F; blog post for more!&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;    CondBr&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;        taken&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; BranchTarget&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;        not_taken&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; BranchTarget&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;        kind&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;:&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt; CondBrKind&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;    },&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;    &#x2F;&#x2F; ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;These enum arms could be considered similar to &quot;encodings&quot; in the old backend,
except that they are defined in a much more straightforward way. Whereas old
Cranelift backends had to define instruction encodings using a DSL, and these
encodings were assigned a numeric index and a special bit-packed encoding for
additional instruction parameters, here the instructions are simply stored in
type-safe and easy-to-use Rust data structures.&lt;&#x2F;p&gt;
&lt;p&gt;We will not discuss the VCode data-structure design or instruction interface
much further, except to note that the relevant instruction-emission
functionality for a new machine backend can be implemented by providing
a &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;cb306fd514f34e7dd818bb17658b93fba98e2567&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;machinst&#x2F;mod.rs#L140&quot;&gt;&lt;code&gt;MachInst&lt;&#x2F;code&gt; trait
implementation&lt;&#x2F;a&gt;
for one&#x27;s instruction type (and then lowering into it; see below). We believe,
and early experience seems to indicate, that this is a much easier task
than what was required to develop a backend in Cranelift&#x27;s old DSL-based
framework.&lt;&#x2F;p&gt;
&lt;h2 id=&quot;lowering-from-clif-to-vcode&quot;&gt;Lowering from CLIF to VCode&lt;&#x2F;h2&gt;
&lt;p&gt;We&#x27;ve now come to the most interesting design question: how do we lower from
CLIF instructions, which are machine-independent, into VCode with the
appropriate type of CPU instructions? In other words, what have we replaced the
expansion-based legalization and encoding scheme with?&lt;&#x2F;p&gt;
&lt;p&gt;In short, the scheme is a &lt;em&gt;single pass&lt;&#x2F;em&gt; over the CLIF instructions, and at each
instruction, we invoke a function provided by the machine backend to lower the
CLIF instruction into VCode instruction(s). The backend is given a
&quot;&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;cb306fd514f34e7dd818bb17658b93fba98e2567&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;machinst&#x2F;lower.rs#L58&quot;&gt;lowering
context&lt;&#x2F;a&gt;&quot;
by which it can examine the instruction and the values that flow into it,
performing &quot;tree matching&quot; as desired (see below). This naturally allows
1-to-1, 1-to-many, or many-to-1 translations. We incorporate a
reference-counting scheme into this pass to ensure that instructions are only
generated if their values are actually used; this is necessary to eliminate
dead code when many-to-1 matches occur.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;tree-matching&quot;&gt;Tree Matching&lt;&#x2F;h3&gt;
&lt;p&gt;Recall that the old design allowed for 1-to-1 and 1-to-many mappings from CLIF
instructions to machine instructions, but not many-to-1. This is particularly
problematic when it comes to pattern-matching for things like addressing modes,
where we want to recognize particular combinations of operations and choose a
specific instruction that covers all of those operations at once.&lt;&#x2F;p&gt;
&lt;p&gt;Let&#x27;s start by defining a &quot;tree&quot; that is rooted at a particular CLIF
instruction. For each argument to the instruction, we can look &quot;up&quot; the program
to find its producer (def). Because CLIF is in SSA form, either the instruction
argument is an ordinary value, which must have exactly one definition, or it is
a block parameter (φ-node in conventional SSA formulations) that represents
multiple possible definitions. We will say that if we reach a block parameter
(φ-node), we simply end at a tree leaf -- it is perfectly alright to
pattern-match on a tree that is a &lt;em&gt;subset&lt;&#x2F;em&gt; of the true dataflow (we might get
suboptimal code, but it will still be correct). For example, given the CLIF
code:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;block0(v0: i64, v1: i64, v2: b1):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  brnz v2, block1(v0)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  jump block1(v1)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;block1(v2: i64):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  v3 = iconst.i64 64&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  v4 = iadd.i64 v2, v3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  v5 = iadd.i64 v4, v0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  v6 = load.i64 v5&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  return v6&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;let&#x27;s consider the load instruction: &lt;code&gt;v6 = load.i64 v5&lt;&#x2F;code&gt;. A simple code
generator could map this 1-to-1 to the CPU&#x27;s ordinary load instruction, using
the register holding &lt;code&gt;v5&lt;&#x2F;code&gt; as an address. This would certainly be correct.
However, we might be able to do better: for example, on AArch64, the available
addressing modes include a two-register sum &lt;code&gt;ldr x0, [x1, x2]&lt;&#x2F;code&gt; or a register
with a constant offset &lt;code&gt;ldr x0, [x1, #64]&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The &quot;operand tree&quot; might be drawn like this:&lt;&#x2F;p&gt;
&lt;p&gt;&lt;img src=&quot;&#x2F;assets&#x2F;2020-09-10-load-operands.svg&quot; alt=&quot;Figure: operand tree for load instruction&quot; &#x2F;&gt;&lt;&#x2F;p&gt;
&lt;p&gt;We stop at &lt;code&gt;v2&lt;&#x2F;code&gt; and &lt;code&gt;v0&lt;&#x2F;code&gt; because they are block parameters; we don&#x27;t know with
certainty which instruction will produce these values. We can replace &lt;code&gt;v3&lt;&#x2F;code&gt; with
the constant &lt;code&gt;64&lt;&#x2F;code&gt;. Given this view, the lowering process for the load
instruction can fairly easily choose an addressing mode. (On AArch64, the code
to make this choice is
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;cb306fd514f34e7dd818bb17658b93fba98e2567&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;isa&#x2F;aarch64&#x2F;lower.rs#L653&quot;&gt;here&lt;&#x2F;a&gt;;
in this case it would choose the register + constant immediate form, generating
a separate add instruction for &lt;code&gt;v0 + v2&lt;&#x2F;code&gt;.)&lt;&#x2F;p&gt;
&lt;p&gt;Note that we do not actually explicitly construct an operand tree during
lowering. Instead, the machine backend can query each instruction input, and
the lowering framework will provide &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;cb306fd514f34e7dd818bb17658b93fba98e2567&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;machinst&#x2F;lower.rs#L176&quot;&gt;a
struct&lt;&#x2F;a&gt;
giving the producing instruction if known, the constant value if known, and the
register that will hold the value if needed. The backend may traverse up the
tree (via the &quot;producing instruction&quot;) as many times as needed. If it cannot
combine the operation of an instruction further up the tree into the root
instruction, it can simply use the value in the register at that point instead;
it is always safe (though possibly suboptimal) to generate machine instructions
for only the root instruction.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;lowering-an-instruction&quot;&gt;Lowering an Instruction&lt;&#x2F;h3&gt;
&lt;p&gt;Given this matching strategy, then, how do we actually do the translation?
Basically, the backend provides a function that is called once per CLIF
instruction, at the &quot;root&quot; of the operand tree, and can produce as many machine
instructions as it likes. This function is essentially just a large &lt;code&gt;match&lt;&#x2F;code&gt;
statement over the opcode of the root CLIF instruction, with the match-arms
looking deeper as needed.&lt;&#x2F;p&gt;
&lt;p&gt;Here is a simplified version of the match-arm for an integer add operation
lowered to AArch64 (the full version is
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;cb306fd514f34e7dd818bb17658b93fba98e2567&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;isa&#x2F;aarch64&#x2F;lower_inst.rs#L75&quot;&gt;here&lt;&#x2F;a&gt;):&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;rust&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-keyword&quot;&gt;match&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; op&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;    &#x2F;&#x2F; ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;    Opcode&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;::&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;Iadd&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; =&amp;gt;&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt; {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;        let&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; rd&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; =&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; get_output_reg&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;ctx&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; outputs&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;[&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant z-numeric&quot;&gt;0&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;]);&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;        let&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; rn&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; =&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; put_input_in_reg&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;ctx&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; inputs&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;[&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant z-numeric&quot;&gt;0&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;]);&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;        let&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; rm&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; =&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; put_input_in_rse_imm12&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;ctx&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; inputs&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;[&lt;&#x2F;span&gt;&lt;span class=&quot;z-constant z-numeric&quot;&gt;1&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;]);&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-storage&quot;&gt;        let&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; alu_op&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt; =&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt; choose_32_64&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;ty&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span&gt; ALUOp&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;::&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;Add32&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span&gt; ALUOp&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;::&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-type&quot;&gt;Add64&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;);&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;        ctx&lt;&#x2F;span&gt;&lt;span class=&quot;z-keyword z-operator&quot;&gt;.&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;emit&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-entity z-name z-function&quot;&gt;alu_inst_imm12&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;(&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt;alu_op&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; rd&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; rn&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;,&lt;&#x2F;span&gt;&lt;span class=&quot;z-variable z-other&quot;&gt; rm&lt;&#x2F;span&gt;&lt;span class=&quot;z-punctuation&quot;&gt;));&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;    }&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation z-definition z-comment z-comment&quot;&gt;    &#x2F;&#x2F; ...&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span class=&quot;z-punctuation&quot;&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;There is some magic that happens in several helper functions here.
&lt;code&gt;put_input_in_reg()&lt;&#x2F;code&gt; invokes the proper methods on the &lt;code&gt;ctx&lt;&#x2F;code&gt; to look up the
register that holds an input value. &lt;code&gt;put_input_in_rse_imm12()&lt;&#x2F;code&gt; is more
interesting: it returns a
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;cb306fd514f34e7dd818bb17658b93fba98e2567&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;isa&#x2F;aarch64&#x2F;lower.rs#L63&quot;&gt;&lt;code&gt;ResultRSEImm12&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;,
which is a &quot;register, shifted register, extended register, or 12-bit
immediate&quot;. This set of choices captures all of the options we have for the
second argument of an add instruction on AArch64. The helper looks at the node
in the operand tree and attempts to match either a shift or zero&#x2F;sign-extend
operator, which can be incorporated directly into the add. It also checks
whether the operand is a constant and if so, could fit into a 12-bit immediate
field. If not, it falls back to simply using the register input.
&lt;code&gt;alu_inst_imm12()&lt;&#x2F;code&gt; then breaks down this enum and chooses the appropriate
&lt;code&gt;Inst&lt;&#x2F;code&gt; arm (&lt;code&gt;AluRRR&lt;&#x2F;code&gt;, &lt;code&gt;AluRRRShift&lt;&#x2F;code&gt;, &lt;code&gt;AluRRRExtend&lt;&#x2F;code&gt;, or &lt;code&gt;AluRRImm12&lt;&#x2F;code&gt;
respectively).&lt;&#x2F;p&gt;
&lt;p&gt;And that&#x27;s it! No need for legalization and repeated code editing to match
several operations and produce a machine instruction. We have found this way of
writing lowering logic to be quite straightforward and easy to understand.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;backward-pass-with-use-counts&quot;&gt;Backward Pass with Use-Counts&lt;&#x2F;h3&gt;
&lt;p&gt;Now that we can lower a single instruction, how do we lower a function body
with many instructions? This is not quite as straightforward as looping over
the instructions and invoking the match-over-opcode function described above
(though that would actually work). In particular, we want to handle the
many-to-1 case more efficiently. Consider what happens when the
add-instruction logic above is able to incorporate, say, a left-shift
operator into the add instruction. The &lt;code&gt;add&lt;&#x2F;code&gt; machine instruction would then use
the &lt;em&gt;shift&lt;&#x2F;em&gt;&#x27;s input register, and completely ignore the shift&#x27;s output. If the
shift operator has no other uses, we should avoid doing the computation
entirely; otherwise, there was no point in merging the operation into the add.&lt;&#x2F;p&gt;
&lt;p&gt;We implement a sort of reference counting to solve this problem. In particular,
we track whether any given SSA value is actually used, and we only generate
code for a CLIF instruction if any of its results are used (or if it has a
side-effect that must occur). This is a form of &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Dead_code_elimination&quot;&gt;dead-code
elimination&lt;&#x2F;a&gt; but
integrated into the single lowering pass.&lt;&#x2F;p&gt;
&lt;p&gt;To know whether a value is used, we simply track a counter per value,
initialized to zero. Whenever the machine backend uses a register input (as
opposed to using a constant value directly, or incorporating the producing
instruction&#x27;s operation), it notifies the lowering driver that this register
has been used.&lt;&#x2F;p&gt;
&lt;p&gt;We must see uses before defs for this to work. Thus, we iterate over
the function body &quot;backward&quot;. Specifically, we iterate in
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Depth-first_search#Vertex_orderings&quot;&gt;postorder&lt;&#x2F;a&gt;;
this way, all instructions are seen before instructions that
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Dominator_(graph_theory)&quot;&gt;dominate&lt;&#x2F;a&gt;
them, so given SSA form, we see uses before defs.&lt;&#x2F;p&gt;
&lt;p&gt;Finally, we have to consider side-effects carefully. This matters in two ways.
First, if an instruction has a side-effect, then we must lower it into VCode
even if its result(s) have no uses. Second, we cannot allow an operation to be
merged into another if this would move a side-effecting operation over another
or alter whether it might execute. We ensure side-effect correctness with a
&quot;coloring&quot; scheme (in a forward pass, assign a color to every instruction, and
update the color on every side effect and on every new basic block); the
producing instruction is only considered for possible merging with its
consuming instruction if it has no side-effects (hence can always be moved) or
if it has the same color as the consuming instruction (hence would not move
over another side effect).&lt;&#x2F;p&gt;
&lt;p&gt;The lowering procedure is as follows (&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;main&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;machinst&#x2F;lower.rs#L693&quot;&gt;full version
here&lt;&#x2F;a&gt;):&lt;&#x2F;p&gt;
&lt;ol&gt;
&lt;li&gt;Compute instruction colors based on side-effects.&lt;&#x2F;li&gt;
&lt;li&gt;Allocate virtual registers to all SSA values. It&#x27;s OK if we don&#x27;t use some;
an unused virtual register will not be allocated any real register.&lt;&#x2F;li&gt;
&lt;li&gt;Iterate in postorder over instructions. If the instruction has a
side-effect, or if any of its results are used, call into the
machine backend to lower it.&lt;&#x2F;li&gt;
&lt;li&gt;Reverse the VCode instructons so that they appear in forward order. &lt;sup class=&quot;footnote-reference&quot;&gt;&lt;a href=&quot;#3&quot;&gt;3&lt;&#x2F;a&gt;&lt;&#x2F;sup&gt;&lt;&#x2F;li&gt;
&lt;&#x2F;ol&gt;
&lt;p&gt;Easy!&lt;&#x2F;p&gt;
&lt;h3 id=&quot;examples&quot;&gt;Examples&lt;&#x2F;h3&gt;
&lt;p&gt;Let&#x27;s see how this works in real life! Consider the following CLIF code:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;function %f25(i32, i32) -&amp;gt; i32 {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;block0(v0: i32, v1: i32):&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  v2 = iconst.i32 21&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  v3 = ishl.i32 v0, v2&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  v4 = isub.i32 v1, v3&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  return v4&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;We expect that the left-shift (&lt;code&gt;ishl&lt;&#x2F;code&gt;) operation should be merged into the
subtract operation on AArch64, using the reg-reg-shift form of ALU instruction,
and indeed this happens (here I am showing the debug-dump format one can see
with &lt;code&gt;RUST_LOG=debug&lt;&#x2F;code&gt; when running &lt;code&gt;clif-util compile -d --target aarch64&lt;&#x2F;code&gt;):&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;VCode {&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  Entry block: 0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;Block 0:&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  (original IR block: block0)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  (instruction range: 0 .. 6)&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  Inst 0:   mov %v0J, x0&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  Inst 1:   mov %v1J, x1&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  Inst 2:   sub %v4Jw, %v1Jw, %v0Jw, LSL 21&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  Inst 3:   mov %v5J, %v4J&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  Inst 4:   mov x0, %v5J&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;  Inst 5:   ret&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;}&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This then passes through the register allocator, has a prologue and epilogue
attached (we cannot generate these until we know which registers are clobbered),
has redundant moves elided, and becomes:&lt;&#x2F;p&gt;
&lt;pre class=&quot;giallo z-code&quot;&gt;&lt;code data-lang=&quot;plain&quot;&gt;&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;stp fp, lr, [sp, #-16]!&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;mov fp, sp&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;sub w0, w1, w0, LSL 21&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;mov sp, fp&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;ldp fp, lr, [sp], #16&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;
&lt;span class=&quot;giallo-l&quot;&gt;&lt;span&gt;ret&lt;&#x2F;span&gt;&lt;&#x2F;span&gt;&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;which is a perfectly valid function, correct and callable from C, on
AArch64! (We could do better if we knew that this were a leaf function and
avoided the stack-frame setup and teardown! Alas, many optimization
opportunities remain.)&lt;&#x2F;p&gt;
&lt;p&gt;There are many other examples of interesting instruction-selection cases in our
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;tree&#x2F;main&#x2F;cranelift&#x2F;filetests&#x2F;filetests&#x2F;vcode&#x2F;aarch64&quot;&gt;filetests&lt;&#x2F;a&gt;.
One of our favorite pastimes lately is to stare at disassemblies and find inefficient
translations, improving the pattern-matching as required, so these are slowly
getting better (my brilliant colleague Julian Seward has built a custom tool
that dumps the hottest basic blocks from a given JIT execution and has found
quite a number of improvements in our AArch64 and x86-64 backends).&lt;&#x2F;p&gt;
&lt;h2 id=&quot;next-efficient-code-generation-passes-and-checking-the-register-allocator&quot;&gt;Next: Efficient Code-Generation Passes, and Checking the Register Allocator&lt;&#x2F;h2&gt;
&lt;p&gt;I&#x27;ve covered a lot of ground in this post, but there&#x27;s still a lot more to say
about the new Cranelift backend framework!&lt;&#x2F;p&gt;
&lt;p&gt;In the second post, I&#x27;d like to talk about how we designed the passes &lt;em&gt;after&lt;&#x2F;em&gt;
VCode lowering to be as efficient as possible. In particular this will involve
the way in which we simplify branches, which avoids the more usual step-by-step
process of removing empty basic blocks and flipping branch conditions and
taking advantage of fallthrough paths, instead doing last-minute edits as the
binary code is being emitted (see the
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;blob&#x2F;main&#x2F;cranelift&#x2F;codegen&#x2F;src&#x2F;machinst&#x2F;buffer.rs&quot;&gt;&lt;code&gt;MachBuffer&lt;&#x2F;code&gt;&lt;&#x2F;a&gt;
implementation for all the details).&lt;&#x2F;p&gt;
&lt;p&gt;Then, in the third post, I&#x27;ll talk about how I&#x27;ve used abstract interpretation
to build a symbolic checker for our register allocator, which has been
effective at finding several interesting bugs while fuzzing.&lt;&#x2F;p&gt;
&lt;p&gt;Stay tuned!&lt;&#x2F;p&gt;
&lt;p&gt;In the meantime, for any and all discussions about Cranelift, please feel free
to join us on our &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bytecodealliance.zulipchat.com&#x2F;&quot;&gt;Bytecode Alliance Zulip
chat&lt;&#x2F;a&gt; (here&#x27;s a
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;bytecodealliance.zulipchat.com&#x2F;#narrow&#x2F;stream&#x2F;217117-cranelift&#x2F;topic&#x2F;blog.20post.20on.20new.20backend&quot;&gt;topic&lt;&#x2F;a&gt;
for this post)!&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;p&gt;&lt;em&gt;Thanks to Julian Seward and Benjamin Bouvier for reviewing this post and
suggesting several additions and corrections.&lt;&#x2F;em&gt;&lt;&#x2F;p&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;1&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;1&lt;&#x2F;sup&gt;
&lt;p&gt;Note that this description skips several quite important steps that come
after instructions have encodings. Most importantly, we still must perform
&lt;em&gt;register allocation&lt;&#x2F;em&gt;, which chooses machine registers to hold each value in
the IR. This may involve inserting instructions as well, when values need to
be spilled to or reloaded from the stack or simply moved between registers.
Then, after several other housekeeping tasks (such as resolving branches and
optimizing their forms for the actual machine-code offsets), we can actually
use the encodings to emit machine code.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;2&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;2&lt;&#x2F;sup&gt;
&lt;p&gt;We also support a &quot;mod&quot; (modify) type of register reference that is both
a use and def, while ensuring that the same register is allocated for the
use- and the def-points. This replaces an earlier mechanism known as &quot;tied
operands&quot; that introduced an ad-hoc constraint to the register allocator.
Mods instead are handled by simply extending the live-range through the
instruction.&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
&lt;div class=&quot;footnote-definition&quot; id=&quot;3&quot;&gt;&lt;sup class=&quot;footnote-definition-label&quot;&gt;3&lt;&#x2F;sup&gt;
&lt;p&gt;The reversal scheme is actually a bit more subtle than this. We want to
emit instructions in forward order within the lowering for a single CLIF
instruction, but we visit CLIF instructions backward. To make this work, we
keep a buffer of lowered VCode instructions per CLIF instruction in forward
order; at the end of a single CLIF instruction, these are copied in reverse
order to a buffer of lowered VCode instructions for the basic block. Because
we visit instructions within the block backward, this buffer contains the
VCode sequence for the basic block in reverse order. Then, at the end of the
block, we reverse it again onto the tail of the VCode buffer. The end result
is that we see VCode instructions in forward order for each CLIF instruction
in forward order, contained within basic blocks in forward order (phew!).&lt;&#x2F;p&gt;
&lt;&#x2F;div&gt;
</content>
    </entry>
    <entry xml:lang="en">
        <title>blog.cfallin is live!</title>
        <published>2020-09-17T00:00:00+00:00</published>
        <updated>2020-09-17T00:00:00+00:00</updated>
        <author>
            <name></name>
        </author>
        <link rel="alternate" type="text/html" href="https://cfallin.org/blog/first-post/"/>
        <id>https://cfallin.org/blog/first-post/</id>
        <content type="html" xml:base="https://cfallin.org/blog/first-post/">&lt;p&gt;Hello, and welcome to &lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;cfallin.org&#x2F;blog&#x2F;&quot;&gt;blog.cfallin&lt;&#x2F;a&gt;! I&#x27;ve thought for
a while that it might be nice to share, occasionally, some thoughts on
whatever technical tidbits interest me. This blog will likely be home to
assorted ramblings on compilers, runtimes, and the like; you can find a bit
more about my background at &lt;a href=&quot;&#x2F;about&#x2F;&quot;&gt;&#x27;About&#x27;&lt;&#x2F;a&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;My first post, coming soon, will be about the new compiler backend framework
I&#x27;ve developed (along with extremely capable co-conspirators) for
&lt;a rel=&quot;external&quot; href=&quot;https:&#x2F;&#x2F;github.com&#x2F;bytecodealliance&#x2F;wasmtime&#x2F;tree&#x2F;main&#x2F;cranelift&quot;&gt;Cranelift&lt;&#x2F;a&gt;,
a compiler in Rust that will soon be used in production in Firefox, among other
places. Stay tuned.&lt;&#x2F;p&gt;
</content>
    </entry>
</feed>
