Name</th>	Data Required</th>	JS opcode dispatch</th>	ICs</th>	Optimization scope</th>	CacheIR dispatch</th>	Codegen at runtime</th></tr></thead>
Generic (C++) interpreter</td>	JS bytecode</td>	Interpreter (C++)</td>	None</td>	None</td>	N/A</td>	No</td></tr>
----------------------------------</td>	-----------------------------------</td>	--------------------------------------------</td>	------------------</td>	----------------------------------------</td>	------------------</td>	----------------------------------</td></tr>
Baseline interpreter</td>	JS bytecode + IC data structures</td>	Interpreter (generated at startup)</td>	Dynamic dispatch</td>	Special cases within one opcode via IC</td>	Compiled</td>	Yes (IC bodies, interp body)</td></tr>
----------------------------------</td>	-----------------------------------</td>	--------------------------------------------</td>	------------------</td>	----------------------------------------</td>	------------------</td>	----------------------------------</td></tr>
Baseline compiler</td>	JS bytecode + IC data structures</td>	Compiled function body (1:1 with bytecode)</td>	Dynamic dispatch</td>	Special cases within one opcode via IC</td>	Compiled</td>	Yes (IC bodies, function bodies)</td></tr>
----------------------------------</td>	-----------------------------------</td>	--------------------------------------------</td>	------------------</td>	----------------------------------------</td>	------------------</td>	----------------------------------</td></tr>
Optimizing compiler (WarpMonkey)</td>	JS bytecode + warmed-up IC chains</td>	Compiled function body (optimized)</td>	Inlined</td>	Across opcodes / whole function</td>	Compiled</td>	Yes (optimized function body)</td></tr> </tbody></table> Can we Use ICs in a WASI Program?</h2> Given that we have a means to speed up execution beyond that of a generic interpreter, namely, inline caches (ICs), and given that SpiderMonkey supports ICs, surely we can simply make use of this feature in a build of SpiderMonkey for WASI (i.e., when running inside of a Wasm module)?</p> Not so fast! There are two basic problems:</p> As designed, the IC stubs can only be run as compiled code</em>. Even the "baseline interpreter" above will invoke a pointer to an IC stub of machine code compiled with a dedicated IC-stub compiler.</p> This works well for SpiderMonkey on a native platform -- the fastest way to implement a fast-path is to produce a purpose-built sequence of a handful of machine instructions -- but is not compatible with WebAssembly's inability to add new code at runtime that we noted above. This is because SpiderMonkey only knows what the fast-paths should be after it starts executing, which is too late to add code to the Wasm module.</p> </li> Less fundamental, but still a roadblock: the "baseline interpreter" in SpiderMonkey is also</em> JIT-compiled, albeit once at JS engine startup rather than as code is executing. This is more of an implementation/engineering tradeoff, wherein the SpiderMonkey authors realized they could reuse the baseline compiler backend</a> to cheaply produce a new tier (a brilliant idea!), but again is not compatible with the WASI environment.</p> </li> </ul> You might already be thinking: the above two points are not laws of nature -- nothing says that we can't interpret</em> whatever code we would have JIT-compiled and executed in native SpiderMonkey. And you would be right: in fact, that's the starting point for the Portable Baseline Interpreter (PBL)!4</a></sup></p> A Baseline without JIT: Portable Baseline</h2> Here we can now introduce the Portable Baseline Interpreter</em>, or PBL for short. PBL is a new execution tier</a></em> that replaces the native "baseline interpreter" described above. Its key distinguishing feature is that it does not require any runtime code generation (JIT-compilation). Thus, it is suitable for use in a Wasm/WASI program, or in any other environment where runtime codegen is prohibited.</p> The key design principle of PBL is to stick as closely as possible</em> to the other baseline tiers. In SpiderMonkey, significant shared machinery exists for the (existing) baseline interpreter and baseline compiler: there is a defined stack layout and execution state, there is code that understands how to garbage-collect, introspect, and unwind this state, and there are mechanisms to track the inline caches associated with baseline execution. PBL's goal at a technical level is to perform exactly the work that the (native) baseline interpreter would do, except in portable C++ code rather than runtime-generated code.</p> To achieve this goal, the two major tasks were:</p> Implementing a new interpreter loop over JS opcodes. We cannot use the generic interpreter tier's main loop</a> (what SpiderMonkey calls the "C++ interpreter"), because the actions for each opcode in that implementation are "generic" -- they do not use ICs to specialize on types or other kinds of fast-paths -- and so are not suitable for our purposes. Likewise, we cannot use the baseline interpreter's main loop because it is generated at startup</a> using the JIT backend, and so is not suitable for use in a context where we can only run portable C++.</p> Instead, we need to implement a new interpreter loop</a> whose actions for each opcode invoke ICs where appropriate -- exactly the actions that the baseline interpreter does, but written in portable code. This is superficially "simple", but turns out to require careful attention to many subtle details, because handwritten JIT-compiled code can control some aspects of execution much more precisely than C++ ordinarily can. (More on this below!)</p> </li> Implementing an interpreter for CacheIR</a>, the intermediate representation in which the "fast-path code" for IC stubs is represented. CacheIR opcodes encode the "guards", or preconditions necessary for a fast-path to apply, and the actions to perform. There are many specialized CacheIR opcodes to particular data structures or runtime state -- it is a heavily custom IR -- but this tight fit to SpiderMonkey's design is exactly what gives it its ability to concisely encode many fast-paths.</p> </li> </ul> In principle, developing an interpreter for an IR that already has two compilers (to machine code</a> and optimizing compiler IR (MIR)</a>) should be relatively straightforward: we transliterate the actions that the compiled code is performing into a direct C++ implementation. In a system as complex as a JavaScript engine, though, nothing is ever quite "simple". Challenges encountered in implementing the CacheIR interpreter fall into two general categories: aspects of execution that cannot be directly replicated in C++ code, so need to be "emulated" in some way; and achieving practical performance by keeping the "virtual machine" model lightweight and playing some other tricks too. We'll give a few examples of each kind of challenge below.</p> Challenge: Baseline Stack Layout</h3> The first challenge that arose consists of emulating the stack</em> as the JIT-compiled code would have managed it. SpiderMonkey's baseline tiers build a series of stack frames with a carefully-defined format</a> that can be traversed for various purposes: finding (and updating) garbage-collection roots, handling exceptions and unwinding execution, producing stack backtraces, providing information to the debugger API and allowing the debugger to update state, and so on.</p> The format of a single stack frame consists of a JitFrame</code> that looks a lot like a "normal" function prologue's frame -- return address, previous frame pointer -- but also includes a "callee token" that the caller pushes, describing the called function at the JS level, and a "receiver" (the this</code> value in JavaScript). The BaselineFrame</code> below that records the JS bytecode virtual machine state in a known format, so it can be introspected: current bytecode PC, current IC slot, and so on. Below that, the JS bytecode VM's operand/value stack is maintained on the real machine stack. And, just before calling any other function, a "footer" descriptor is pushed: this denotes the kind of frame that just finished, so it can be handled appropriately.</p> `This format has a very important property: it has no gaps</em>. It is not simply a linked list of fixed-size descriptor or header structures. If it were, we could potentially place BaselineFrame</code> / JitFrame</code> instances on the C++ stack in PBL, and link them together with the previous-FP fields as normal. But this won't work: rather, every machine word of the baseline-format stack is accounted for.</p>` `</p>` `This works fine for JIT-compiled code, because we control the code that is emitted and can maintain whatever stack-format we define. But because the C++ compiler owns and manages the machine stack layout when we are running in C++ code, PBL is not able to maintain the actual machine stack in this format.</p>` `Thus, we instead define an auxiliary stack</em>, build a series of real baseline frames on it, and maintain this in parallel to the executing C++ code's actual machine stack. When we enter a new frame at the C++ level, we push a new frame on the auxiliary stack; when we return, we pop a frame.5</a></sup> This auxiliary stack is what the garbage collector, exception unwinder, debugger, and other components introspect: we store pointers to its frames in the JIT state, and so on. As far as the rest of the engine is concerned, it is the real stack. The only major difference is that all return addresses are nullptr</code>s: we don't need them, because we still manage control flow at the C++ level.</p>` Challenge: Unwinding</h3> A second issue that arises from the differences between a native machine model and that of PBL is unwinding</em>. In JIT code, where we have complete control over emitted instructions, the call stack is just a convention and we are free to skip over frames and jump to any code location we please. The exception unwinder uses this to great effect: when an exception is thrown, the runtime walks the stack and looks for any appropriate handler. This might be several call-frames up the stack. When one is found, it sets the stack pointer and frame pointer to within that frame -- effectively popping all deeper frames in one action -- and jumps directly to the appropriate program counter in that handler's frame. Unfortunately, this is not possible to do directly in portable C++ code.6</a></sup></p> Instead, starting from the invariant that one C++ frame in the PBL interpreter function "owns" one (or more as an optimization -- see below) baseline frames, we implement a linear-time</em> unwinding scheme: each C++ interpreter invocation remembers its "entry frame"; when unwinding, after an exception or otherwise, we compare the new frame-pointer value to this entry frame; if "above" (higher in address, for a downward-growing stack), we return from the interpreter function with a special code indicating an unwind is happening. The caller instance of the PBL interpreter function then performs the same logic, propagating upward until we reach the correct C++ frame. The following figure contrasts the native and PBL strategies:</p> </p> Thus, we do not have the same asymptotic O(1)</code> unwind-efficiency guarantee that native baseline execution does, but we remain completely portable, able to execute anywhere that standard C++ runs.</p> Challenge: VM exits</h3> A third issue that often arose was that of emulating VM exits</em>. On the native baseline platform, when JIT code is executing, the stack is "under construction", in a sense: the innermost frame is not complete (there is no footer descriptor word) and is not reachable from the VM data structures. JIT code can call back into the runtime only via a carefully-controlled "VM exit" mechanism, which pushes a special kind of "exit frame", records the end of the series of contiguous JIT frames (the "JIT activation"), and then invokes C++ code:</p> </span> # JIT code:</span></span> </span> ...</span></span> push arg2</span></span> push arg1 # trampoline</span></span> call VMHelper -----> push exit frame</span></span> cx->exitFP = fp</span></span> call VMHelperImpl -----> walkStack(cx->exitFP)</span></span> doStuff(cx)</span></span> pop exit frame <----- ret</span></span> ... <----- ret</span></span></code></pre> which results in a stack layout that looks like:</p> </p> While executing within the C++ PBL interpreter function, it is very tempting to simply call into the rest of the runtime as required. This results in a stack that looks like the below, and unfortunately breaks in all sorts of exciting and subtle ways: it may appear to work, but frames are missing and GC roots are not updated after a moving GC; or if the dangling exit FP is not null, an entirely bogus set of stack frames may be traced. Either way, various impossible-to-find bugs arise.</p> </p> PBL thus requires extreme discipline in separating "JIT-code mode" (or its emulation, in a portable C++ interpreter) and "runtime mode". To make this distinction clearer, I designed a type-enforced mechanism that leverages an important idiom in SpiderMonkey: every function that might perform a GC or otherwise introspect overall VM state will take a JSContext</code> parameter. In the PBL interpreter function, we hide the JSContext</code> (rename the local and set it to nullptr</code> normally). We then have a helper RAII class that pushes an exit frame and does everything that a "VM exit" trampoline would do, then behaves as a restricted-scope local</em> that implicitly converts to the true JSContext</code>. This looks like the below:</p> CASE</span>(</span>Opcode</span>) {</span></span> Value arg0 </span>=</span> POP</span>().</span>asValue</span>()</span>;</span> // POP() is a macro that uses the `sp` local.</span></span> Value arg1 </span>=</span> POP</span>().</span>asValue</span>()</span>;</span></span> // here, `sp` is the top-of-stack for our in-progress frame in our</span></span> // in-progress activation. We are "in JIT code" from the engine's</span></span> // perspective, even though this is still C++.</span></span> </span></span> {</span></span> // This macro completes the activation and creates a `cx` local</span></span> // that gives us the JSContext* for use.</span></span> PUSH_EXIT_FRAME</span>()</span>;</span> </span></span> </span></span> if</span> (</span>!</span>DoEngineThings</span>(</span>cx</span>,</span> arg0</span>,</span> arg1</span>)) {</span></span> goto</span> error</span>;</span></span> }</span></span> </span></span> }</span> // pops the exit frame.</span></span> </span></span> // ...</span></span> }</span></span></code></pre> This idiom works fairly well in practice and statically prevents us from making most kinds of stack-frame-related mistakes.</p> Optimization: Avoiding Function-Call Overhead</h3> At this point, we have introduced techniques to enable PBL to run correctly; we now have a functioning JavaScript interpreter that can invoke ICs. (Take a breath and celebrate!) Unfortunately, I arrived at this point and found that performance still lagged behind that of the generic interpreter. How could this be, when ICs directly encode fast-paths and allow us to short-circuit expensive runtime calls?</p> The first realization came after profiling both a native build of PBL, and especially a Wasm build: C++ function calls can be expensive</em>. The basic PBL design consisted of a JS interpreter that invoked the IC interpreter for every opcode with an IC -- a majority of them, in most programs (all numeric operators, property accesses, function calls, and so on!). Thus function calls are extremely frequent. Their high cost is for a few basic reasons:</p> When the interpreter function is large and has a lot of context (live variables), register pressure is high; when the called function is similar, we effectively have a full "context switch" (save all register values and use for new variables) on every call/return.</p> </li> Splitting logic across multiple functions precludes optimizations that span the logic of both functions. For example, the IC interpreter "reified control flow as data" by returning an enum value that the JS interpreter then switched on. Combining the two functions would allow us to embed the switch-bodies directly where the return code is set.</p> </li> On many Wasm implementations, including Wasmtime</a> (my VM of choice and the main optimization target for our WASI port), function prologues have some extra cost: the generated code needs to check for stack overflow, and may need to check for interruption or preemption. This is a part of the cost of sandboxing that can only be avoided by staying within a single function frame.</p> </li> </ul> Thus, it is very important to avoid function-call overhead whenever possible. I optimized this in two ways. First, the IC interpreter is aggressively inlined into the JS interpreter. This produces one super-interpreter that can run both kinds of bytecode -- JS bytecodes and CacheIR -- without extra frame setup at every IC site.</p> Second, and more important in practice, multiple JS frames</em> are handled by one C++ frame (interpreter invocation). In a technique borrowed from SpiderMonkey's generic interpreter</a>, when certain conditions are met, we handle</a> a JS call opcode by pushing a JS frame and dispatching directly to the callee's first opcode without any C++-level call. (This may be an obvious implementation to anyone who has written an interpreter virtual machine before, but disentangling C++ frames and JS frames is actually not trivial at all, given the prologue/epilogue logic -- hence the required conditions!) This interacts in subtle ways with unwinding described above: it means that the mapping from JS to C++ frames is 1-to-many, and thus requires some care. (As a silver lining, however, the logic for a "return" is substantially similar to that for "unwind": we can use the same conditions to know when to return at the C++ level.)</p> Optimization: Hybrid ICs</h3> Having implemented all of the above techniques, I was still finding PBL to have somewhat disappointing performance numbers. Fortunately, one final insight came: perhaps the tradeoffs related to which operations are profitable to fast-path change</em>, when the cost of the fast-path mechanism (an IC) itself changes?</p> For example: in native baseline execution, every arithmetic operator uses ICs to dispatch to type-specific behavior. The +</code> operator, our favorite example, has possible fast-paths for integers, floating-point numbers, strings, and more. This is profitable in "native baseline" because the cost of an IC is extremely low: the JIT controls register allocation so it can effectively do global allocation across the function body and IC stubs by using special-purpose registers and a custom calling convention, and it can avoid generating any prologue/epilogue in the IC stubs themselves. As a result, ICs can literally be a handful of instructions: call, check type tag in registers 0 and 1, integer add, return. PBL, in contrast, is both emulating virtual-machine state (rather than using an optimized IC calling convention), and paying the interpreter-dispatch cost for every IC opcode.</p> So I ran a simple experiment: in a native PBL build, I added rdtsc</code> (CPU time counter)-based timing measurements around execution of each JS opcode both in the generic interpreter and in PBL's interpreter loop, and binned the results by opcode type. The results were fascinating: property accesses (e.g., GetProp</code>) were significantly faster with ICs, for example, but many simpler operators, like Add</code>, were twice as slow.</p> Then given this data, I developed the "hybrid ICs" approach, namely: use ICs only where they help! For the Add</code> operator, the PBL interpreter now has specific cases for integer and floating-point addition</a>, and then invokes the generic interpreter's behavior (AddOperation</code>); it never invokes the IC chain, but rather skips over it entirely. This behavior is configurable -- with faster IC mechanisms in the future, we may be able to use ICs for these opcodes again, so the code for both strategies remains.</p> The results were striking: PBL was finally showing significant speedups on almost all benchmarks. The final "hybrid IC" set includes:</p> Property accesses.</em> These are extremely common in most JavaScript code, and can benefit from fast-path behavior whenever objects usually have the same "shape", or set of properties, at a given point. This is because the engine can encode a fast-path that directly accesses a particular memory offset in the object in memory, without looking up a property by name.</p> </li> Calls.</em> This is somewhat less intuitive: for an ordinary call to another JavaScript function, there is not much an IC can do -- we just need to update interpreter state to the callee and start dispatching. But for calls to built-in functions, as described above, the benefits can be huge: string and array operations, for example, transform from an expensive call into the runtime (through several layers of generic JS function-call logic) into just a few direct field accesses or other operations on a known object type.</p> </li> </ul> Every other JS opcode is executed with generic logic.</p> Results</h2> Enough description -- how well does it perform?</p> The best test of any language runtime or platform is a "real-world" use-case, and PBL has been fortunate to see some early adoption, where two real applications saw wall-clock CPU time reductions of 42% and 17%, respectively, when executing on a Wasm/WASI platform. That is quite significant and exciting, and is motivating adoption and further work of PBL.</p> While developing PBL, I did most of my benchmarking with Octane</a>, which is deprecated</a> but still useful when hacking on the core of a JS engine (one just needs to give the appropriate caveats that benchmark speedups will have an uncertain correlation to real-world speedups). On Octane, PBL currently sees a 1.26x speedup (that is, throughput is 26% higher; or, equivalently, there is a runtime reduction of 1 - 1/1.26</code>, or 21%). That is quite something as well, for a new engine tier that remains completely portable as a pure interpreter!</p> Because of these exciting results, and our future plans below, we have worked with the SpiderMonkey team themselves to plan upstreaming</em> -- incorporating PBL into the main SpiderMonkey tree. This will ease maintenance because it will allow PBL to be updated and evolved (i.e., kept compiling and running) as SpiderMonkey itself does, will allow us to use SpiderMonkey without a heavy patch-stack on top, and will make PBL available for others to use as well. We believe it could be useful beyond the Wasm/WASI world: for example, high-security contexts that disallow JIT could benefit as well. The upstreaming code-review</a> is in-progress and we look forward to completing it!</p> Future: Compiled Code</h2> Note: this section describes my own thoughts and plans, but goes beyond what is currently being upstreamed into SpiderMonkey, and is not necessarily endorsed yet by upstream. My plan and hope is to develop the ideas to maturity and, if results hold up, propose additional upstreaming -- but that is further out.</em></p> PBL has an attractive simplicity as a pure interpreter, and has surprised us with speedups even under that restriction. However, the larger question, for me at least, has always been: how can we compile</em> JS ahead-of-time in a performant way?</p> Recall that the main restriction of the WebAssembly platform is not that we can't generate code at all; it's just that all code, no matter the producer (the traditional Wasm compiler toolchain or our own JS tools), needs to be generated before any execution occurs.</p> SpiderMonkey's native baseline tiers hint at a way forward here. PBL as described above is roughly equivalent to the baseline interpreter (modulo the way</em> that ICs are executed). Can we (i) produce compiled code for ICs, and (ii) do the equivalent of the baseline compiler</em>, generating a specialized Wasm function for every JS function body?</p> In principle, this should be possible without information from execution, because it handles the type-specific specialization with the runtime dispatch</em> inherent in the IC chains. In other words, types are late-binding, so we retain late-binding control flow to match.</p> This still requires us to know what possible</em> ICs we might need, but here we can play a trick: we can collect many IC bodies ahead of time, and generate straight-line compiled Wasm functions for these IC bodies. This is more-or-less the trick we described in our post two years ago</a>.</p> But all of this is still implying the development of a Wasm compiler</em> backend. How does PBL help us at all? Isn't it a dead-end, if we are eventually able to compile JS source (which we typically have available ahead-of-time -- performance-critical eval()</code> usage is rare) straight to specialized Wasm, with late-bound ICs?</p> The answer to that lies in partial evaluation</a>. Over the past year I have developed a tool called weval</a> that takes an interpreter in a Wasm module, with a few minor intrinsic-call annotations (to specify what to specialize, and to denote that memory storing bytecode is "constant" and can be assumed not to self-modify dynamically), and generates a Wasm module with specialized functions appended. This gives us a compiler "for free" once we have an interpreter, and PBL has been designed to be that interpreter.</p> In particular, the JS opcode and IC opcode interpreters in PBL were designed carefully to work efficiently with weval, and in a next step to the project (development branch</a>), I have the whole thing working. Whereas pure-interpreter PBL got a 1.26x speedup on Octane, PBL with weval gets a 1.58x speedup, up to 2.4x or so, with a bunch of low-hanging fruit remaining that will hopefully push that number further.</p> This combination isn't quite ready for production use yet, but I continue to polish it, and we hope sometime early next year it will be ready, taking us to "conceptual parity" (if not engineering fine-tuning parity!) with SpiderMonkey's native baseline compiler. We have some more thoughts on going beyond that -- inlining ICs like WarpMonkey does, hoisting guards, and all the rest -- but more on that in due time.</p> Given all of that, one could compare PBL and PBL+weval to SpiderMonkey's existing tiers. Recall our table above:</p> Name</th> Data Required</th> JS opcode dispatch</th> ICs</th> Optimization scope</th> CacheIR dispatch</th> Codegen at runtime</th></tr></thead> Generic (C++) interpreter</td> JS bytecode</td> Interpreter (C++)</td> None</td> None</td> N/A</td> No</td></tr> ----------------------------------</td> -----------------------------------</td> --------------------------------------------</td> ------------------</td> ----------------------------------------</td> -------------------</td> ----------------------------------</td></tr> Baseline interpreter</td> JS bytecode + IC data structures</td> Interpreter (generated at startup)</td> Dynamic dispatch</td> Special cases within one opcode via IC</td> Compiled</td> Yes (IC bodies, interp body)</td></tr> ----------------------------------</td> -----------------------------------</td> --------------------------------------------</td> ------------------</td> ----------------------------------------</td> -------------------</td> ----------------------------------</td></tr> Baseline compiler</td> JS bytecode + IC data structures</td> Compiled function body (1:1 with bytecode)</td> Dynamic dispatch</td> Special cases within one opcode via IC</td> Compiled</td> Yes (IC bodies, function bodies)</td></tr> ----------------------------------</td> -----------------------------------</td> --------------------------------------------</td> ------------------</td> ----------------------------------------</td> -------------------</td> ----------------------------------</td></tr> Optimizing compiler (WarpMonkey)</td> JS bytecode + warmed-up IC chains</td> Compiled function body (optimized)</td> Inlined</td> Across opcodes / whole function</td> Compiled</td> Yes (optimized function body)</td></tr> </tbody></table> To which we could add the row:</p> Name</th> Data Required</th> JS opcode dispatch</th> ICs</th> Optimization scope</th> CacheIR dispatch</th> Codegen at runtime</th></tr></thead> PBL (interpreter)</td> JS bytecode + IC data structures</td> Interpreter (C++)</td> Dynamic dispatch</td> Special cases within one opcode via IC</td> Interpreter (C++)</td> No</td></tr> </tbody></table> And then, with weval and pre-collected ICs (but no profiling of the JS code!), we could have:</p> Name</th> Data Required</th> JS opcode dispatch</th> ICs</th> Optimization scope</th> CacheIR dispatch</th> Codegen at runtime</th></tr></thead> PBL (wevaled)</td> JS bytecode + IC data structures</td> Compiled function body (1:1 with bytecode)</td> Dynamic dispatch</td> Special cases within one opcode via IC</td> Compiled</td> No (!!)</td></tr> </tbody></table> which one will note is identical to the baseline-compiler row above, except that no runtime codegen is required.</p> Finally, if we have reliable profiling information, such as from a profiling run at build time, we could use this profile (just as one does in a standard C/C++ "PGO" or "profile-guided optimization</a>" build) to inline</em> the ICs. Note that this could be done in a way that is completely agnostic to the underlying interpreter</em>, because IC invocations are just indirect calls: that is, it is also a semantics-preserving, independently-verifiable transform. Having done that, we would then have:</p> Name</th> Data Required</th> JS opcode dispatch</th> ICs</th> Optimization scope</th> CacheIR dispatch</th> Codegen at runtime</th></tr></thead> PBL (wevaled + inlined ICs)</td> JS bytecode + IC data structures + warmed-up IC chains</td> Compiled function body (optimized)</td> Inlined</td> Across opcodes / whole function</td> Compiled</td> No (!!)</td></tr> </tbody></table> which approximates WarpMonkey. Note that this will require significant additional engineering -- SpiderMonkey's native JITs, after all, embody engineer-centuries of effort (much of which we leverage by reusing its well-tuned CacheIR sequences, but much which we can't) -- but is a clear path to allow for optimized JS without runtime code generation.</p> The thing that excites me most about this direction is that it is, in some sense, "deriving a JIT from scratch": we are writing down the semantics of the opcodes, and we're explicitly extracting fast-paths, but we're using semantics-preserving tools to go beyond that. (Weval's semantics are that it provides a function pointer to a specialized function that behaves identically to the original.) That allows us to decouple the correctness aspects of our work from performance, mostly, and makes life far simpler -- no more insidious JIT bugs, or divergence between the interpreter and compiler tiers. More to come!</p> Many, many thanks to Luke Wagner</a>, Nick Fitzgerald</a>, Trevor Elliott</a>, Jamey Sharp</a>, L Pereira</a>, Michelle Thalakottur</a>, and others with whom I've discussed these ideas over the past several years. Thanks to Luke, Nick, Lin Clark</a>, and Matt Gaudet</a> for feedback on this post. Thanks also to Trevor Elliott and Jake Champion</a> for help in getting PBL integrated with other infrastructure, Jamey Sharp for ramping up efforts to fill out PBL's CacheIR opcode support, and the Mozilla SpiderMonkey team</a> for graciously hearing our ideas and agreeing to upstream this work.</em></p> ^{1</sup> WebAssembly running in a browser could implement runtime code generation and loading by calling out to external JavaScript. In essence, it would generate a new Wasm module in memory, then call JS to instantiate that module into an instance that shares the current linear memory, and call into it. However, this is fundamentally a feature of the Web platform and not built-in to Wasm; and many Wasm platforms, especially those designed with security among untrusted co-tenants in mind, do not allow this and strictly enforce ahead-of-time compilation of fixed code instead.</p> </div>}^{2</sup> Honestly, even after writing a new interpreter tier</em> for SpiderMonkey, I couldn't tell you the answer to this. (I run the bytecode, I don't lower to it!) The language's semantics are something to marvel at, in the edge cases, and this is all the more reason to centralize on a few well-tested, well-engineered shared implementations.</p> </div>}^{3</sup> Note that there are some details here omitted for simplicity. Most importantly, once inlining ICs, we need to hoist the guards</em>, or the conditions that exist at the beginning of every IC, so that they are shared in common (or removed entirely, if we can prove they will always succeed). Consider a function that operates on floating-point numbers: every IC will be some form of "check if inputs are floats, do operation, tag result as float" but instead we could check that the function arguments are floats once</em>, then propagate from "produce float" in one IC to "check if float" in the next. ICs enable</em> this by expressing the necessary preconditions (checks) and postconditions (produced values and their types) for each operator, and inlining is necessary as well because it places everything in one code-path so it is in scope to be cross-optimized; but guard-hoisting and -elimination are JIT-compiler-specific optimizations.</p> </div>}^4</sup>One can see this idea as a variant of the interpreter quickening</a> idea, in a way: the CacheIR sequences are a shorter or more efficient implementation of particular behavior that we rewrite the interpreted program to use (via the pluggable IC sites) as we learn more about its execution.</p> </div> ^{5</sup> The correspondence isn't actually 1-to-1, unfortunately (that would have been much simpler!): instead, we sometimes push an additional frame for VM exits, and we also handle some calls "inline", pushing the frame and going right to the top of the dispatch loop again. The actual invariant is that every auxiliary stack frame is "owned" by one C++ function invocation, but there may be several such frames. It is thus a 1-to-many relationship.</p> </div>}^6</sup>Strictly speaking, we could have used setjmp()</code>/longjmp()</code> to implement similar constant-time unwinding. However, this interacts poorly with C++ destructors, and is also problematic -- that is, does not exist -- on WebAssembly with WASI. Eventually the exception handling proposal</a> for Wasm may be directly usable for this purpose, but it is not finalized yet.</p> </div> Cranelift's Instruction Selector DSL, ISLE: Term-Rewriting Made Practical 2023-01-20T00:00:00+00:00 Today I'm going to be writing about ISLE</strong>, or the "instruction selection/lowering expressions" domain-specific language (DSL</a>), which over the past year we have designed, improved, and fully adopted in the Cranelift</a> compiler project. ISLE</a> is now used to express both our instruction-lowering patterns for each of four target architectures, and also machine-independent optimizing rewrites. It allows us to develop these parts of the compiler in an extremely productive way: we can write the key idea -- that one opcode or instruction should map to another -- in a concise way, while maintaining type-safety with an expressive type system, and allowing us to use the declarative patterns for many different purposes.</p> The goal of this blog post is to illustrate the requirements and the design-space that led to ISLE's key ideas, and especially its departures from other "term-rewriting systems</a>" and blending of ideas from backtracking languages like Prolog</a>. In particular, we'll talk about how ISLE has a strong type system, with terms of distinct types (as opposed to one "value" type); how it matches on abstract "extractors" provided by an embedding environment, which effectively define a virtual "input term" without ever reifying it; and how it was explicitly designed to have a simple "FFI</a>" with Rust for easy-to-understand, no-magic interactions with the rest of the compiler. All of these properties allowed us to chart an incremental path from a fully handwritten compiler backend to one with all lowering logic in the DSL (in 27k lines of DSL code), with a relatively low defect rate over the year-long migration and with significant correctness improvements along the way.</p> The ISLE project was also a really interesting moment in my career personally: it was very much</em> a "research project", in that it required synthesis of existing approaches and careful thought about the domain and requirements, and invention of slightly new takes on old ideas in order to make the whole thing practical. At the same time, there was quite a lot of work put into the incrementalist</em> and pragmatic</em> aspect of the design (something I also talk about in an earlier post on regalloc2</a>), and a lot of care and feeding to a 12-month migration effort, curation of our understanding of the language and how to use it, and nurturing of ongoing ideas for improvement. I feel pretty fortunate to have (i) been given the space to wrestle the problem space down to its essential kernel and find a working design, and (ii) a cohort of really great coworkers who ran with it and made it real. (Especially thanks to my brilliant colleague Nick Fitzgerald (@fitzgen)</a> who completed the Cranelift integration of ISLE and who pioneered the idea of rewrite DSLs in Cranelift with his Peepmatic</a> project, which inspired many parts of ISLE and primed the project for this effort.) I wrote more about the benefits we've seen so far from ISLE, and anticipate to see, here</a>; suffice it to say that we're happy we've gone this way and we look forward to additional results that this work has enabled.</p> This post covers material and repeats some arguments I made early in ISLE's pre-RFC</a> and RFC</a>; I recommend reading those documents as well for further background, if desired. The ISLE language reference</a> is the canonical definition of the DSL.</p> Let's first go through some background on Cranelift, on compiler DSLs in general, and motivate the case for a DSL in Cranelift; then we'll get into the details of ISLE proper.</p> Context: New Cranelift Backends and Handwritten Code</h2> In the Cranelift project, starting in 2020, we developed a new framework for the machine backends -- the part of the compiler that takes the final optimized machine-independent IR (intermediate representation)</a> and converts it to instructions for the target instruction set, allocates registers, lowers control flow, and emits machine code. (I described more details in an earlier post series: first</a>, second</a>, third</a>.)</p> Prior to introducing ISLE, we build three backends in this framework: AArch64</a>, x86-64</a>, and s390x (IBM Z)</a>. The general experience was quite positive -- the simplicity that we aimed to achieve by focusing on a "straightforward handwritten lowering pass" design allowed us to quickly implement quite complete support for WebAssembly (core and SIMD) and support other users of Cranelift (like cg_clif</a>). In March 2021, we made the new x86-64 backend the default</a>, and at the end of September 2021, we were able to remove the old x86 backend</a> with its legacy, complex, and generally slower framework.</p> The Need for a DSL</h2> However, as with many engineering problems, there is a tradeoff point in the design space. The simplicity of the "just write the match</code> on the opcode and emit the instructions" approach</a> eventually becames a downside: we had an increasingly-deeply-nested</a> lowering function with many conditions matching on types, sub-expressions, and special cases, and keeping track of it all became more and more difficult. In total, we found we were running into three major problems</a>:</p> It became very tedious</em> to write what was essentially a longhand form of a list of lowering patterns. If adding a special lowering for a combination of IR operators requires understanding control in handwritten code, and writing out the checks for special conditions or searches for other combining operators by hand, then we are much less likely to improve the compiler backends: the incentives instead point toward keeping the code as simple and as minimalistic as possible and discouraging change.</p> </li> It became difficult to refactor the code at all: the compiler "lowering API" was ossifying as more and more handwritten backend code came to depend on its subtle details, and refactors became very hard or impossible. This became especially apparent with the regalloc2 work.</p> </li> It became more and more difficult to maintain correctness, and ensure that the backends were generating the code we expected. While writing code against the lowering API, one had to keep in mind the subtle correctness invariants for when one can "combine" instructions, when one can "sink" an operation, and so on; and the rules for how to use registers and temporaries properly. Even when generating some kind of correct code, it was easy to miss a corner of the state space and, say, omit a lowering for a particular combination of input types, or skip an intended optimized lowering and use a general one instead, in an accidental and hard-to-reason-about way due to complex control flow. As we'll see below, a DSL allows us to solve both of these problems with (i) principled strongly-typed abstractions in the DSL, and (ii) "overlap checks", respectively.</p> </li> </ol> It became clear that generating lowering patterns from some meta-description would lead to overall clearer and more maintainable compiler source, and would give us more flexibility if we wanted to change any details of or optimize the translation, as well. Hence, our realization that we probably needed a domain-specific language</em> (DSL) to generate this part of Cranelift.</p> DSLs in Compilers and Term-Rewriting</h2> There is a long history of DSLs to specify compilers -- the "metacompiler" or "compiler-compiler" concept -- going back to at least the 1970s. The general idea of a DSL-based instruction selection stage is to declaratively describe a list of patterns</em> -- combinations of operators in the program -- and for each pattern, when it matches, a series of instructions that can implement that pattern. This makes it easier to reason about what the compiler is doing, to modify and improve it, and to apply systematic optimizations across the backend by changing how the DSL is used to generate the compiler backend itself.</p> The "pattern rewriting" approach to a compiler backend can be seen as a kind of term-rewriting system</a>: that is, a formal framework in which rules operate on a data representation (in this case, the program to be compiled). Term-rewriting is an old idea: the lambda calculus</a>, one of the original mathematical models of computation, operates via term-rewriting. It is general enough to be useful yet also concise and expressive enough to be very productive when the goal is to transform structured data.</p> One reason why term-rewriting is such a good fit for a compiler in particular is that "terms", or nodes in an AST, represent values in the program being compiled; given this, a rewrite rule is an expression of an equivalence</em>. An "integer addition" operator in a compiler IR is equivalent</em> to (or produces an equivalent result to) the integer addition instruction in a given CPU's instruction set; so we can replace one with the other. One might write this rule as something like (add x y) => (x86_add x y)</code>, for example. Likewise, many compiler optimizations can be expressed as rules, in a way that is familiar to any student of algebra: for example, x + 0 == x</code>, or in an AST notation, (add x 0) => x</code>.</p> Examples of such pattern-rules abound in production compilers. For example, in the Go compiler, a set of rules</a> define how the IR's operators are converted into x86-64 instructions:</p> // Lowering arithmetic</span></span> (Add(64\|32\|16\|8) ...) => (ADD(Q\|L\|L\|L) ...)</span></span> </span> // combine add/shift into LEAQ/LEAL</span></span> (ADD(L\|Q) x (SHL(L\|Q)const [3] y)) => (LEA(L\|Q)8 x y)</span></span> </span> // Merge load and op</span></span> ((ADD\|SUB\|AND\|OR\|XOR)Q x l:(MOVQload [off] {sym} ptr mem)) &&</span></span> canMergeLoadClobber(v, l, x) &&</span></span> clobber(l) =></span></span> ((ADD\|SUB\|AND\|OR\|XOR)Qload x [off] {sym} ptr mem)</span></span></code></pre> Similar kinds of descriptions exist in the LLVM x86 backend</a> and the GCC x86 backend</a>, using the TableGen</a> language (LLVM) and the Machine Description DSL</a> (gcc), respectively.</p> There is a large and well-explored design-space for auto-generated compiler backends from such rules. A classical design is the BURS</a> (bottom-up rewrite system) technique. I won't attempt a deeper introduction here; further descriptions can be found in e.g. the Dragon Book1</a></sup> or Muchnick's textbook2</a></sup>. For this post, it suffices to know that these systems find a "covering" of tree patterns such as the above over an input expression tree.</p> Term Rewriting for Lowering... Maybe?</h2> Given the above precedent -- several mainstream compilers adopting a pattern-matching-based scheme, with clear benefits -- it would seem that our path ahead is well-defined. Why, then, is there so much of this post remaining? What more could be said?</p> It turns out that three main problems arise when considering how to adopt a typical term-rewriting scheme. First, there is a basic question: do we actually reify the input tree as a tree? For example, if we have a pattern</p> (add x y) => (x86_add x y)</span></span></code></pre> meaning that an add</code> operator on two operands is lowered to an x86 ADD</code> instruction, does that imply that our IR literally contains a node for the add</code>?</p> That may seem like a silly question to raise, as CLIF, Cranelift's IR, does indeed have an operator for add</code> (actually iadd</code> for "integer add") that takes two arguments.</p> But directly matching on a "tree of operators" implies several properties of the IR that have deep impact. One of these properties is that the "value" or "result" of the operator is its unique identifier. In CLIF, this isn't the case: each instruction has its own identifier (Inst</code>) and each Inst</code> can have any number of Value</code> result. Another is that it implies that the tree that the matching process should</em> see is exactly what is in the IR. However, quite a few considerations</a> are involved when a backend wants to "merge" the handling of operators by looking deeper into the tree. None of these impedance mismatches are fatal to the approach, but they do imply extra work to build the tree in memory "as matched".</p> Second, there is the problem of how to incorporate additional information</em> beyond the raw tree of operators. One could see this as a question of "side-tables" or "auxiliary information", or of supporting various "queries" on the input tree.</p> For example, we might want to represent the ability to encode an integer immediate in certain ISA-specific forms as a term. AArch64 has several such forms: a "regular" 12-bit immediate, and a rather clever "logical immediate"</a> format designed to efficiently encode common kinds of bitmasks in only 13 bits. We might represent these with terms, and have rules that translate (iconst K)</code> into (aarch64_imm12 bits)</code> or (aarch64_logicalimm bits)</code> and subsequent rules that match on these terms to encode immediate-using instruction forms. The problem then comes: how do we know which of these intermediate rewrites to do before we attempt to match any instruction forms? Do we do both, and represent both forms?</p> The net effect of this requirement is that the matching pattern for a rewrite rule starts to look less like a tree of terms and more like a sequence of custom queries. The Go compiler's rules</a> add predicates to rewrite rules to handle these cases, but this is awkward and makes the language harder to reason about. It would be better if we could represent the ISA concepts at the term-rewrite level as well.</p> Third, there is the question of how to interact with the rest of the compiler as we make these queries on the input representation. In the most straightforward implementation, a rewrite system has knowledge of the "tree nodes" that terms in the pattern match and that terms in the rewrite expression produce. But building the glue between the rewrite system and the IR data structures may be nontrivial, especially if custom queries (as above) are also involved.</p> All of this raises the question: is there a better way to think about the execution semantics of the rewrite rules? What if the DSL were not involved in ASTs or rewrites in a direct way at all?</p> Sequential Execution Semantics and "External Extractors/Constructors"</h2> At this point, I hit upon ISLE's first key idea: what if all interactions with the rest of the compiler were via "virtual" terms, implemented by Rust functions? In other words, rather than build a system that matches on literal AST data structures and rewrites or produces new output ASTs, all of the pattern matching would "bottom out" in a sort of FFI that would invoke the rest of the compiler, in handwritten Rust. The DSL itself knows nothing about the rest of the compiler, or "ASTs", or any other IR-specific concept. (This is ISLE's main secret: it is not actually an instruction-selection DSL, but actually (or at least aspirationally) a more general language.)</p> Sequential Semantics for Matching</h3> One could imagine a rule like (iadd (imul a b) c) => (aarch64_madd a b c)</code> to "compile" to a series of "match operations" like the following invented operations for some matching engine:</p> $t0, $t1 := match_op $root, iadd</span></span> $t2, $t3 := match_op $t0, imul</span></span> $t4 = create_op aarch64_madd, $t2, $t3, $t1</span></span> return $t4</span></span></code></pre> In this "matching VM", we execute match_op</code> operations by trying to unpack a tree node into its children (arguments), given the expected operator. Any step in this sequence of match operators might "fail", which causes us to try the next rewrite rule instead. If we can match the iadd</code> from the input tree root, and the imul</code> from its first argument, then the compiled form of this rule builds the aarch64_madd</code> ("multiply-add") term.</p> Programmable Matching</h3> Rather than a fixed set of operators like match_op</code>, what if we allowed for environment-defined operators? What if operators like the aarch64_logicalimm</code> above were "match operators" as well, such that they "matched" if the given u64</code> could be encoded in the desired form and failed to match otherwise?</p> This is the essence of the "external extractor" idea (and the dual to it, the "external constructor") in ISLE. Once we allow user-defined operators for the left-hand side ("matching pattern") of a rule, we actually no longer need any built-in notion of AST matching at all; this becomes just another thing we can define in the "standard library" or "prelude" of our DSL!3</a></sup></p> The basic idea is to introduce the ability to define</em> a term like iadd</code> or imul</code> and associate an external Rust function with it. When appearing on the left-hand side of a rewrite rule, terms match "outside in": that is, (iadd (imul a b) c)</code> takes the root of the input, tries to use iadd</code> to destructure it to two arguments, and tries to use imul</code> to destructure it further. (This outside-in, reversed order is the opposite of what one might expect if this were a tree of function calls, because we are destructuring</em> (extracting) rather than constructing</em>. We'll explore the analogy to functions more below.)</p> So, skipping straight to the real ISLE syntax now, we can define the term like</p> (decl (iadd Value Value) Value)</span></span></code></pre> and then associate a Rust function with it like</p> (extern extractor iadd my_iadd_impl)</span></span></code></pre> and this implies the existence of a Rust function that the generated matching code will invoke:</p> fn</span> my_iadd_impl</span>(</span>&mut self</span>,</span> input</span>:</span> Value</span>)</span> -></span> Option</span><(</span>Value</span>,</span> Value</span>)>;</span></span></code></pre> Likewise, we can define a term to be used on the right-hand side and associate an implementation like</p> (decl (aarch64_madd Value Value Value) Value)</span></span> (extern constructor aarch64_madd my_madd_impl)</span></span></code></pre> with the Rust function</p> fn</span> my_madd_impl</span>(</span>&mut self</span>,</span> a</span>:</span> Value</span>,</span> b</span>:</span> Value</span>,</span> c</span>:</span> Value</span>)</span> -></span> Value</span>;</span></span></code></pre> and then use it like</p> (rule (iadd (imul a b) c)</span></span> (aarch64_madd a b c))</span></span></code></pre> and the generated code will invoke the external extractor</em> my_iadd_impl</code>; if it returns Some</code> (matches), will invoke whatever external extractor is associated with imul</code>; if it also returns Some</code>, then invoke aarch64_madd</code> to "construct" the result. These Rust functions can do whatever they like</em>: we have abstracted away the need to actually query a reified AST and mutate it.</p> Extractors and the Execution-Driven View</h3> Another important consequence of this design is that we can define arbitrary</em> extractors and constructors, and they can have arbitrary</em> types. (ISLE is strongly-typed, with sum types that lower 1:1 to Rust enums.) This neatly addresses the "metadata or side-table" question above: we don't need to generate auxiliary nodes in a real AST to represent information about a value, and we don't need to know when to compute them; such computations can be driven by demand on the matching side, and don't need to reify any actual node.</p> Let's take a step back and understand what we have done here. We have taken a rule that expresses a tree rewrite -- like (iadd (imul a b) c) => (madd a b c)</code> -- and allowed for the terms in the left-hand side and right-hand side to compile to Rust function calls, with well-defined semantics. The DSL itself deals only with pattern-matching; ASTs and compiler IRs are wholly outside of its understanding. Nevertheless, if we ascribe formal semantics to iadd</code>, imul</code> and madd</code>, we can still reason about the rewrite at the term-rewriting system level, and this is essential to formal verification efforts for our compiler backends (see below!). We have thus allowed for integration with existing, handwritten Rust code while raising the abstraction level to allow for more declarative reasoning.4</a></sup></p> It's worth dwelling for a moment on the shift from an explicit AST data structure, traversed and built by a rule-matching engine, to calls to external extractors and constructors. As a consequence of this shift, the term nodes corresponding to extractor matches need not ever actually exist in memory. The ISLE rewrite flow thus works something like the "visitor pattern</a>": it introduces a level of abstraction that decouples the consumption and production of data from its representation, allowing more flexibility.</p> The execution-driven view of term rewriting gives us our rewrite procedure for free, as well: rather than an engine that takes some specification of patterns and rewrites, we compile rules to sequential code that invokes extractors and constructors. "Rewriting" a top-level term is equivalent to invoking a Rust function. If we define a term, lower</code>, corresponding to instruction lowering, then (lower (iadd ...))</code> is a term that will be rewritten to whatever ISA-specific terms the rules specify. This rewriting is done by a Rust function</em> that implements</em> lower</code>; we invoke it with the term's arguments, and the body will match on extractors as needed, then invoke constructors to build the rewritten expression.</p> Explicit Types (and Implicit Conversions) in ISLE</h2> The second key differentiator in ISLE as compared to most other term-rewriting systems is its strong type system</em>. It might not be too surprising that a DSL that compiles to Rust would incorporate a type system that mirrors Rust's type system to some degree, e.g. with sum types (enum variants). But this is actually a bit of a departure from conventional compiler-backend rule systems, and allows significant expressivity and safe-encapsulation wins, as we'll see below.</p> Why Types?</h3> A conventional rewrite system operates on an AST whose nodes all represent values in the program. In other words, every term has the same type (at the DSL level): we can replace iadd</code> with x86_add</code> because both have type Value</code>.</p> While this works fine for simple substitutions, it quickly breaks down when various ISA complexities are considered. For example, how do we model an addressing mode? We might wish to have a node x86_add</code> that accepts a "register or memory" operand, as x86 allows; and if a memory operand, the memory address can have one of several different forms ("addressing modes"): [reg]</code>, [reg + offset]</code>, [reg + reg + offset]</code>, [reg + scalereg + offset]</code>.</p> We could impose some ad-hoc structure on the AST in order to model this: for example, an x86_add</code> with an x86_load</code> in its second argument (or alternately, a separate opcode x86_add_from_memory</code>) could represent this case. Then we would need to have rules for the address expression: if another x86_add</code> node (but only with register arguments!), we could absorb that into the instruction's addressing mode.</p> Ad-hoc structure like this is fragile, though, especially when transformed by optimization passes. As a general guideline, as well, the more we can put program invariants into the type system (at any level), the more likely we are to be able to maintain the structure across refactors or unexpected interactions.</p> Types in ISLE</h3> ISLE thus allows terms to have types that resemble function types. One can define a term</p> (decl lower_amode (Value) AMode)</span></span></code></pre> that takes one argument, a Value</code>, and produces an AMode</code>. This type can then be defined as</p> (</span>type</span> AMode (enum (Reg (reg Reg))</span></span> (RegReg (ra Reg)</span></span> (rb Reg))</span></span> (RegOffset (base Reg)</span></span> (offset u32))))</span></span></code></pre> and so on. The AMode</code> compiles to an enum in Rust with the specified enum variants, making interop with Rust code (in external extractors and constructors) via these rich types straightforward. In our machine backends, machine instructions are defined as enums and constructed directly in the ISLE.</p> Typed Terms as Functions</h3> Now that we have seen how to give a "signature" to a term, it's worth discussing how one can see terms -- both extractors and constructors -- as functions, albeit in opposite directions. In particular, with a term</p> (decl F (A B C) R)</span></span></code></pre> that has arguments (or AST child nodes) of types A, B, C</code>, with a type R</code> for the term itself, one can see:</p> F</code> as a function from</em> A, B, C</code> to</em> R</code>, when used as a constructor; or</li> F</code> as a function from</em> R</code> to</em> A, B, C</code>, when used as an extractor.</li> </ul> In other words, given a tree of terms in a pattern (left-hand side), where terms are interpreted as extractors, we can see each term as a function invocation from the "outer" type (the thing being destructured) to the "inner" types (the pieces that are the result of the destructuring). Conversely, given a tree of terms in an expression (right-hand side), we can see each term as a function from the "inner" types (the arguments of the new thing being constructed) to the "outer" type (the return value). This is another way of seeing the "execution-driven" view of ISLE semantics described above.</p> Note that the extractor form of F</code> above is ordinarily a partial</em> function -- that is, it is allowed to have no</em> mapping for a particular value. This is how we formally think about the "doesn't match" case when searching for a particular kind of node in an AST, or any other matcher on the left-hand side of a pattern. In contrast, the constructor is normally total</em> -- cannot fail -- unless explicitly declared to be partial. (Partial constructors are useful for if-let</code> clauses</a>.)</p> Types for Invariants</h3> Support for arbitrary types lets us much more richly capture invariants as well, by encapsulating values (on the input side) or machine instructions (on the output side) so that they can only be used or combined in legal ways. For example:</p> Many instruction-set architectures have a "flags" register that is set by certain operations with bits that correspond to conditions (result was zero, result was negative, etc.) and used by conditional branches and conditional moves. This is "global" or "ambient" state and one has to be careful to use the flags after computing them, without overwriting them in the meantime. To ensure exact correspondence of a particular flag-producer and flag-consumer, certain instruction constructors create ProducesFlags</code> and ConsumesFlags</code> values</a> rather than raw instructions. These can then be emitted together</em>, with no clobber in the middle, with the with_flags</code></a> constructor.</p> </li> There is a distinction between an IR-level Value</code> and a machine-level value in a register, which we denote with Reg</code>. When a lowering rule requires an input to be in a register, it can use the put_in_reg</code> constructor, which takes a Value</code> and produces (rewrites to) a Reg</code>. This provides a way for us to do bookkeeping (note that the value was used, and ensure that we codegen its producer), but also allows us to distinguish how</em> to place the value in the register: one may wish to sign- or zero-extend</a> the values.</p> </li> There is a distinction between an IR-level Value</code> and the instruction that produces it. Not every Value</code> is defined by an instruction; some are block parameters</a>. Furthermore, at a given point in the lowering process, we may not be allowed</em> to see the producer of a value, if we cannot "sink" its effect (merge it into an instruction generated at the current point). Thus, instructions have Value</code>s as operands, but one goes from a Value</code> to an Inst</code> (instruction ID) with def_inst</code></a>, which may or may not match depending on whether there is an Inst</code> and we can see/merge it.</p> </li> </ul> Implicit Conversions</h3> Type-safe abstractions allow for well-defined and safe interfaces, but can lead to verbose code. After several months of experience with ISLE, we were finding that we wrote rules like:</p> (rule (lower (iadd (def_inst (imul x y)) z))</span></span> (madd</span></span> (put_in_reg x)</span></span> (put_in_reg y)</span></span> (put_in_reg z)))</span></span></code></pre> when we would prefer to write more natural rules like</p> (rule (lower (iadd (imul x y) z))</span></span> (madd x y z))</span></span></code></pre> as in our original term-rewriting examples above. At first it seemed we had to choose one or the other: the shorter form would require abandoning some of the type distinctions we were making. But in actuality, there is some redundancy. Consider: when typechecking the left-hand side pattern, we know that iadd</code>'s arguments (which the extractor will produce, if it can destructure the Inst</code> type) have type Value</code>, but the inner imul</code> expects an Inst</code>. In our prelude we have one canonical term that converts from one to the other: def_inst</code>. Likewise, in the right-hand side, bindings x</code>, y</code> and z</code> have type Value</code>, but the madd</code> constructor that builds a machine instruction requires Reg</code> types (which are virtual registers in pre-regalloc machine code). We likewise have one term, put_in_reg</code>, that can do this conversion. If there is only one</em> canonical way, or usual way, to make the conversion, and if the types on both</em> sides of the conversion are already known, why can't we rectify the types by inserting necessary conversions automatically?</p> ISLE thus has one final trick up its sleeve to improve ease-of-use: implicit conversions. By specifying converter terms for pairs of types like</p> (convert Inst Value def_inst)</span></span> (convert Value Reg put_in_reg)</span></span></code></pre> the typechecking pass can expand the pattern and rewrite expression ASTs as necessary. This makes writing lowering rules much more natural with less boilerplate, and we have a fairly rich set of implicit conversions</a> defined in our prelude to facilitate this.</p> Putting it All Together: AST Patterns in ISLE</h2> Now that we've taken a tour of the various features of ISLE, let's review what we have built. We have started with a desire to express high-level rewrite rules that equate AST nodes -- such as (iadd (imul x y) z)</code> and (madd x y z)</code> -- and to have a system that performs such rewrites, in a way that interoperates with the existing compiler infrastructure and has predictable and comprehensible behavior.</p> ISLE allows high-level patterns to be compiled to straightforward Rust pattern-matching code. A strong type system with sum types (enums) ensures that terms in patterns and rewrites match the expected schema, and allows for expressing high-level invariants. Implicit conversions leverage these types to remove redundancy in the patterns, allowing for more natural high-level forms while retaining the useful type-level distinctions. The ability to arbitrarily define "extractors" to use in patterns allows us to build up a rich pattern-language in our prelude, matching trees of operators, values, and pieces of the input program with various properties in a programmable way. And the well-defined mapping to Rust and an execution scheme that maps terms directly to function calls, rather than an incrementally-rewritten AST allows for code as efficient as what one would write by hand.</p> We have thus gone from (iadd (imul x y) z) => (madd x y z)</code> to something like:</p> Match the input Inst</code> against the iadd</code> enum variant, getting argument Value</code>s if so;</li> Get the Inst</code> that produced the first Value</code>, if we're allowed to merge it, via def_inst</code>;</li> Match that Inst</code> against the imul</code> enum variant, getting its argument Value</code>s if so;</li> Put all three remaining Value</code>s in registers with calls to put_in_reg</code>; and</li> Emit an madd</code> instruction with these registers</li> </ul> all via bindings defined in ISLE itself</em> and without any knowledge of "instruction lowering" in the ISLE DSL compiler</em>. As concrete evidence that the last point is valuable, we have been able to use ISLE for CLIF-to-CLIF rewrite rules in our new mid-end optimization framework as well, simply by defining a new prelude (see below!).</p> Ongoing and Future Benefits of Declarative Rules</h2> The next most exciting thing about writing lowering rules declaratively -- after the expressivity and productivity win while developing the compiler itself -- is that being able to reason about the rules as data</em> lets us analyze them in various ways and check that they satisfy desirable properties.</p> Correctness and Formal Verification</h3> As an example, during the development of our rulesets, we found that it was sometimes unintuitive which rule would fire first. We have a priority mechanism to allow this to be controlled in an arbitrary way, but the default heuristics (roughly, "more specific rule first") were sometimes counter-intuitive.</p> We thus invented the idea of an overlap checker</a>, initially implemented by my brilliant colleague Trevor Elliott</a> and subsequently redesigned with a new internal representation and algorithm</a> by my other brilliant colleague Jamey Sharp</a>. The key idea is to define "rule overlap" such that two rules overlap if a given input to the pattern-matching could cause either rule to fire. In these (and only these) cases, priority and/or default ordering heuristics determine which rule actually does fire. Then we decided that in such cases, we would require</em> the ISLE author to use the priority mechanism to explicitly choose one of the rules. In other words, no two overlapping rules can have the same priority. Through a series of PRs to fix up our existing rules, we were able to actually find several cases where rules were fully "shadowed", or unreachable because some other more general rule was always firing first. We turned on enforcing mode for non-overlap among same-priority rules</a> and non-shadowing of rules by higher-priority rules</a> after fixing several cases, and as a result, we now have more confidence in the correctness of our rulesets.</p> On a broader scale, writing rules as equivalences from one AST to another lets us verify that the two sides are, well, actually equivalent! There is an ongoing collaboration with some formal-verification researchers who are adding annotations</em> to external extractors and constructors that describe their semantics (mostly in terms of an SMT checker's theory-of-bitvectors primitives). Given these "specs", one could lower each ISLE rule to SMT clauses rather than executable Rust code, and search for cases where it is incorrect. I won't steal any thunder here -- the work is really exciting (also still in progress) and the researchers will present it in due time -- but it's an example of what declarative DSLs allow.</p> The ISLE-to-Rust compiler (metacompiler) is also a fairly complex tool in its own right, and has had bugs in the past. What makes us confident that we are generating code that correctly implements the rules -- even if the rules themselves are verified? To answer that question, we have considered</a> how to verify the translation of the ISLE rules</em>. Our current plan is to modify the ISLE compiler to generate both the production backend, with intelligently-scheduled matching operations, and a "naive" version that runs through rules sequentially in priority order. If the latter picks the same rule as the former, then we know we have a faithful implementation of the left-hand-side matching (and the translation of the right-hand-side expression to constructor calls is straightforward in comparison, so we trust it already). Then we can trust that our verified-correct rules are being applied as written.</p> Optimizing the Instruction Selector</h3> Next, my colleague Jamey is working on a new ISLE metacompiler backend</a> that lowers ISLE rules to a planned sequence of matching ops more efficiently. The ability to change the way that compiler-backend code matches the IR was a theoretical benefit of a DSL-based approach, recognized and evaluated as we weighed the pros and cons of ISLE, but admittedly was a bit speculative -- no one knew if we would actually find a better way to generate code from the rules than the initial ISLE compiler and its "trie"-based approach. I am very excited that Jamey actually did</em> manage to do this (and we should look forward to a hopeful future blog post in which he can describe this in more detail!).</p> Mid-end Optimizations</h3> Finally, as mentioned above, we were able to find a second</a> use</a> for ISLE as part of the egraph-based mid-end optimizer</a> work, which we just enabled by default</a>. (I hope to write a blog post about this soon too!) This was excellent and very satisfying validation to me personally that ISLE is more general than just Cranelift backends: we were able to write a new prelude (and actually share a bunch of extractors too) so that rules could specify IR-to-IR rewrites, in addition to IR-to-machine-instruction lowerings. This will allow us to iterate on and improve the compiler's suite of optimizations more easily in the future, and it will also have follow-on benefits in terms of shared infrastructure: verification tools that we build for backend lowering rules can also be adapted to verify mid-end rules. In addition, there are potentially other ways that putting all of the compiler's core program-transform logic into the same DSL will allow us to blur the lines, combine stages, or move logic around in ways we can't yet anticipate today; but it seems like a worthwhile investment. In any case, ISLE proved to be a quite useful tool in developing pattern-matching Rust code with less boilerplate and tedium than before!</p> Conclusion</h2> ISLE has been great fun to design, build, and use; while we have learned a lot and made several language adjustments and extensions over the past year, I think that there is general consensus that it has made the compiler backends easier to work on. I'm excited to see how the ongoing work (verification, new ISLE codegen strategy) turns out, and how the language evolves in general. And as noted above, ISLE's secret is that it is actually more general than instruction selection, or Cranelift: if you find another way to use it, I'd be very interested in hearing about it!</p> Thanks to Jamey Sharp and Nick Fitzgerald for reading and providing very helpful feedback on a draft of this blog post, and to bjorn3 and Adrian Sampson for feedback and typo fixes after publication.</em></p> ^{1</sup> A Aho, R Sethi, J Ullman, M S Lam. Compilers: Principles, Techniques, and Tools. 2006.</p> </div>}^{2</sup> S Muchnick. Advanced Compiler Design and Implementation. 1997.</p> </div>}^{3</sup> ISLE does</em> have special knowledge about Rust enums, and the ability to match on them efficiently with match</code> expressions in the generated Rust code, because to miss this optimization would be very costly. But in principle it could have been built without this, involving only Rust function calls and control flow around them.</p> </div>}^4</sup>There is a parallel to the Prolog</a> language here, in that it also allows for high-level, declarative expression of rule-matching with backtracking while also having a well-defined sequential execution semantics with FFI to an imperative world. In fact Prolog was a central inspiration for ISLE's design. The key difference(s) are that (i) ISLE does not have full backtracking -- once a left-hand side matches, we cannot backtrack, as the right-hand sides are infallible -- and (ii) there is no unification, and all dataflow in a term is unidirectional, from input (value to be destructured) to output (arguments). We used to have "argument polarity", which was closer to unification in that it allowed a configurable (but fixed) input/output direction for each argument to an extractor. We discarded this feature, however, in favor of a more general if-let</code> clause</a>.</p> </div> Cranelift, Part 4: A New Register Allocator 2022-06-09T00:00:00+00:00 This post is the fourth part of a three-part series1</a></sup> describing work that I have been doing to improve the Cranelift</a> compiler. In this post, I'll describe the work that went into regalloc2</a>, a new register allocator I developed over the past year. The allocator started as an effort to port IonMonkey's register allocator</a> to Rust and a standalone form usable by Cranelift ("how hard could it be?"), but quickly evolved during a focused optimization effort to have its own unique design and implementation aspects.</p> Register allocation</a> is a classically hard (NP-hard!</a>) problem, and a good solution is mainly a question of concocting a suitable combination of heuristics and engineering high-performance data structures and algorithms such that most</em> cases are good enough</em> with few enough</em> exceptions. As I've found, this rabbithole goes infinitely deep and there is always more to improve, but for now we're in a fairly good place.</p> We recently switched over to regalloc2</a>, and Cranelift 0.84 and the concurrently released Wasmtime 0.37 use it by default. Some measurements</a> show that it generally improves overall compiler speed (which was and is dominated by regalloc time) by 20%-ish, and generated code performance improves on register pressure-impacted benchmarks up to 10-20% in Wasmtime. In Cranelift's use as a backend for rustc via rustc_codegen_cranelift</a> runtime performance improved by up to 7%</a>. The allocator seems to have generally fewer compile-time outliers than our previous allocator, which in many cases is a more important property than 10-20% improvements. Overall, it seems to be a reasonable performance win with few downsides. Of course, this work benefits hugely from the lessons learned in developing that prior allocator, regalloc.rs</a>, which was work primarily done by Julian Seward and Benjamin Bouvier; I learned enormous amounts talking to them and watching their work on regalloc.rs in 2020, and this work stands on their shoulders as well as IonMonkey's.</p> This post will make a whirlwind tour through several topics. After reviewing the register allocation problem and why it is important, we will learn about regalloc2's approach: its abstractions, its key features, and how its passes work. We'll then spend a good amount of time on "lessons learned": how we attained reasonable performance; how we managed to make anything work at all in reasonable development time; how we migrated a large existing compiler codebase to new foundational types and invariants; and some perspective on ongoing tuning and refinements.</p> A design document</a> for the allocator exists as well, and this blogpost is meant to be complementary: we'll walk through some interesting bits of the architecture, but anyone hoping to actually grok the thing in its entirety (and please talk to me if this is you!) is advised to dig into the design doc and the source for the full story.</p> Finally, a fair warning: this post has become a bit of a book chapter; if you're looking for a tl;dr, you can skip to the Lessons</a> section or the Conclusions</a>.</p> Register Allocation: Recap</h2> First, let's recap what register allocation is and why it's important.2</a></sup></p> The basic problem at hand is to assign storage locations</em> to program dataflow</em>. We can imagine our compiler input as a graph of operators, each of which consumes some values and produces others (let's ignore control flow for the moment):</p> </p> Some compilers directly represent the program in this way (called a "sea of nodes" IR) but most, including Cranelift, linearize</em> the operators into a particular program order. And in fact, by the time that the register allocator does its work, the "operators" are really machine instructions, or something very close to them, so we will call them that: we have a sequence of instructions, and program points</em> before and after each one. Even in this new view, we still have the dataflow connectivity that we did above; now, each edge corresponds to a particular range of program points</em>:</p> </p> We call each of these dataflow edges, representing a value that must flow from one instruction to another, a liverange</em>. We say that virtual registers -- the names we give the values before regalloc -- have a set of liveranges.</p> With control flow, liveranges might be discontiguous from a linear-instruction-order point of view, because of jumps; for example:</p> </p> Each instruction requires its inputs to be in registers and produces its outputs in registers, usually.3</a></sup> So, the job of the register allocator is to choose one of a finite set of machine registers to convey each of the liveranges from its definition(s) to all of its use(s).</p> Why is this hard? In brief, because we may not have enough registers. We thus enter a world of compromise</em>: if more values are "alive" (need to be kept around for later use) than the number of registers that the CPU has, then we have to put some of them somewhere else, and bring them back into registers only when we actually need to use them. That "somewhere else" is usually memory in the function's stack frame that the compiler reserves (a "stackslot"), and the process of saving values away to free up registers is called "spilling".</p> One more concept before we go further: we may want to choose to place a liverange in different</em> places throughout its lifetime, depending on the needs of the instruction sequence at certain points. For example, if a value is produced at the top of a function, then dormant (but live) for a while, and then used frequently in a tight loop at the bottom of the function, we don't really</em> want to spill it, and reload it from memory every loop iteration. In other words, given this program:</p> v0 := ...</span></span> </span></span> v1 := ... // lots of intermediate defs that use all regs</span></span> v2 := ...</span></span> ...</span></span> vN := ...</span></span> </span></span> </span></span> loop:</span></span> v100 := ...</span></span> v101 := add v0, v100</span></span> store v101, ...</span></span> jmp loop</span></span></code></pre> we ideally do not want to assign a stack slot to the value v0</code> and then produce machine code like</p> add rax, rbx ;; `v0` stored in `rax`</span></span> mov [rsp+16], rax ;; spill `v0` to a stack slot</span></span> ...</span></span> </span></span> loop:</span></span> ...</span></span> mov rax, [rsp+16] ;; load `v0` from stack on every iteration -- expensive!</span></span> add rcx, rax</span></span> mov [...], rcx</span></span> jmp loop</span></span></code></pre> but if we only choose a location</em> per liverange, we either choose a register, or a stackslot -- no middle ground. Intuitively, it seems like we should be able to put the value in a different place while it is "dormant" (spill it to the stack, most likely), then pick an optimal location during the tight loop. To do so, we need to refer to the two parts of the liverange separately, and assign each one a separate location. This is called liverange splitting</em>.</p> If we split liveranges, we can then do something like:</p> add rax, rbx ;; `v0` stored in `r0`</span></span> mov [rsp+16], rax ;; spill `v0` to a stack slot</span></span> ...</span></span> </span></span> mov rdx, [rsp+16] ;; move `v0` from stackslot to a new liverange in `rdx`</span></span> loop:</span></span> ...</span></span> add rcx, rdx ;; no load within loop</span></span> mov [...], rcx</span></span> jmp loop</span></span></code></pre> This seems quite powerful and useful -- so why doesn't every register allocator do this? In brief, because it makes the problem much much harder</em>. When we have a fixed number of liveranges, we have a fixed amount of work, and we assign a register per liverange, probably in some priority order. And then we are done.</p> But as soon as we allow for splitting, we can increase</em> the amount of work we have almost arbitrarily: we could split every liverange into many tiny pieces, greatly multiplying the cost of register allocation. A well-placed split reduces the constraints in the problem we're solving, making it easier, but too many splits just increases work and also the likelihood that we will unnecessarily move values around.</p> Splitting is thus the kind of problem that requires finely-tuned heuristics. To be concrete, consider the example above: we showed a split outside of the tight inner loop. But a naive splitting implementation might just split before the use, putting a move from stack to register inside the inner loop. Some sort of cost model is necessary to put splits in "cheap" places.</p> With all of that, hopefully you have some feel for the problem: we compute liveranges, we might split them, and then we choose where to put them. That's (almost) it -- modulo many tiny details.</p> regalloc2's Design</h2> At a high level, regalloc2 is a backtracking</em> register allocator that computes precise liveranges, performs merging</em> according to some heuristics into "bundles", and then runs a main loop that assigns locations to bundles, sometimes splitting</em> them to make the allocation problem easier (or possible at all). Once every part of every liverange has a location, it inserts move instructions to connect all of the pieces.</p> Let's break that down a bit:</p> regalloc2 starts with precise liveranges</em>. These are computed according to the input to the allocator, which is a program that refers to virtual registers</em> and may be in SSA</a> form (one definition per register) or non-SSA (multiple definitions per register).</p> </li> It then merges</em> these liveranges into larger-than-liverange "bundles". If done correctly, this reduces work (fewer liverange bundles to process) and also gives better code (when merged, it is guaranteed that the related pieces will not need a move instruction to join them).</p> </li> It then builds a priority queue of bundles, and processes them until every bundle has a location. (In simplified terms, regalloc2's entire job is to "assign locations to bundles".) This processing may involve undoing</em>, or backtracking</em>, earlier assignments, and may also involve splitting</em> bundles into smaller bundles when separate pieces could attain better allocations.</p> </li> </ul> We'll explain each of these design aspects in turn below.</p> Input: Instructions with Operands</h3> First, let's talk about the input</em> to the register allocator. To understand how regalloc2 works, we first need to understand how it sees the world. (Said another way, before we solve the problem, let's define it fully!)</p> regalloc2 processes a "program" that consists of instructions that refer to virtual registers</em> rather than real machine registers. These instructions are arranged in a control-flow graph</a> of basic blocks.4</a></sup></p> The most important principle regarding the allocator's view of the program is: the meaning</em> of instructions is mostly irrelevant. Instead, the allocator cares mainly how a particular instruction accesses program values</em> as registers: which</em> values and how</em> (read or written), and with what constraints</em> on location. Let's look more at the implications of this principle.</p> Constraints</h4> The allocator views the input program as a sequence of instructions that use</em> and define</em> virtual registers. Every access to a register managed by the regalloc (an "allocatable register") must be via a virtual register.</p> Already we see a divergence from common ISAs like x86: there are instructions that implicitly access certain registers. (One form of the x86 integer multiply instruction always places its output in rax</code> and rdx</code>, for example.) Since these registers are not mentioned by the instruction explicitly, one might initially think that there is no need to create regalloc operands or use virtual registers for them. But these registers (e.g. rax</code> and rdx</code>) can also be used by explicit inputs and outputs to instructions; so the regalloc at least needs to know that the registers will be clobbered, and at some later point presumably the results will be read and the registers become free again.</p> We solve this problem by allowing constraints</em> on operands. An instruction that always reads or writes a specific physical register still names a virtual-register operand. The only difference from an ordinary instruction that can use any register is that this operand is constrained to a particular register</em>. This lets the allocator uniformly reason about virtual registers allocated to physical registers as the basic way that space is reserved; the constraint becomes only a detail of the allocation process.</p> Let's take an x86 instruction mul</code> (integer multiply) as an example to see how this works. Ordinarily, one would write the following in assembly:</p> ;; Multiplicand is implicitly in rax.</span></span> mul rcx ; multiply by rcx</span></span> ;; 128-bit wide result is implicitly placed rdx (high 64 bits)</span></span> ;; and rax (low 64 bits).</span></span></code></pre> The instruction mul rcx</code> does not tell the whole story from regalloc2's point of view, so we would instead present an instruction like so to the register allocator, with constraints annotating uses/definitions of virtual registers:</p> ;; Put inputs in v0 and v1.</span></span> mul v0 [use, fixed rax], v1 [use, any reg], v2 [def, fixed rax], v3 [def, fixed rdx]</span></span> ;; Use results in v2 and v3.</span></span></code></pre> The allocator will "do the right thing" by either inserting moves or generating inputs directly into, and using outputs directly from, the appropriate registers. The advantage of this scheme is that aside from the constraints, it makes mul</code> behave like any other instruction: it isolates complexity in one place and presents a more uniform, easier-to-use abstraction for the rest of the compiler.</p> "Modify" Operands and Reused-Input Constraints</h4> The next difference we might observe between a real ISA like x86 and a compiler's view of the world is: an operator in a compiler IR usually produces its result as a completely new value</em>, but real machine instructions often modify existing values</em>.</p> For example, on x86, arithmetic operations are written in two-operand form</em>. They look like add rax, rbx</code>, which means: compute the sum of rax</code> and rbx</code>, and store the result in rax</code>, overwriting that input value.</p> The register allocator reasons about segments of value dataflow from definitions (defs) to uses; but the use of rax</code> in this example seems to be neither. Or rather, it is both: it consumes a value in rax</code>, and it produces a value in rax</code>.</p> But we can't decompose it into a separate use and def either, because then the allocator might choose different locations for each. The encoding of add rax, rbx</code> only has slots for two register names: the input in rax</code> and output in rax</code> must be in the same register!</p> We solve this by introducing a new kind of constraint: the "reused input register" constraint.5</a></sup> At the regalloc level, we say that the add</code> above has three</em> operands: two inputs (uses) and an output (a def). It exactly corresponds to the compiler IR-level operator, with nicely separated values in different virtual registers. But, we constrain the output by indicating that it must be placed in the same register as the input</em>. We can assert that this is the case when we get final assignments from the regalloc, then emit that register number into the "first source and also destination" slot of the instruction.6</a></sup></p> So, instead of add rax, rbx</code> (or add v0, v1</code> with v0</code> a "modify" operand), we can present the following 3-operand instruction to the register allocator:</p> add v0 [use, any reg], v1 [use, any reg], v2 [def, reuse-input(0)]</span></span></code></pre> This corresponds more closely to what the compiler IR describes, namely a new value for the result of the add and non-destructive uses of both operands. All of the complexity of saving the destructive source if needed is pushed to the allocator itself.</p> Program Points, "Early" and "Late" Operands</h4> Finally, we need to go a bit deeper on what exactly it means to allocate a register "at" an instruction. To see why there may be some subtlety, let's consider an example. Take the instruction movzx</code> on x86: this instruction does a 16-to-64-bit zero-extend, with one input and one output. In pre-regalloc form with virtual registers, we could write movzx v1, v0</code>, reading an input in v0</code> and putting the output in v1</code>.</p> An intuitive understanding of liveranges and the allocation problem might lead us to reason: both v0</code> and v1</code> are "live" at this instruction, so they overlap, and have to be placed in different registers.</p> v0 v1</span></span> :</span></span> v1 := movzx v0 \| \|</span></span> \|</span></span> :</span></span></code></pre> But an experienced assembly programmer, knowing that v0</code> is not used again after this instruction, might reuse its register for the output. So for example, if it were initially in r13</code>, one might write movzx r13, r13w</code> (the r13w</code> is x86's archaic way of writing "the 16 bit version of r13</code>").</p> But isn't this an invalid assignment, because we have put two liveranges in the same register r13</code> when they are both "live" at this particular instruction?</p> It turns out that this will work fine, for a subtle reason: generally instructions read all of their inputs, then</em> write all of their outputs. In other words, there is a sort of two-phase semantics to most instructions. So we could say that the input v0</code> is live up to, and including, the "read" or "early" phase of this instruction, and the output v1</code> is live starting at the "write" or "late" phase of this instruction. These two liveranges don't conflict at all! So the above figure showing liveranges overlapping becomes:</p> v0 v1</span></span> :</span></span> EARLY \|</span></span> v1 := movzx v0 </span></span> LATE \|</span></span> :</span></span></code></pre> regalloc2 (along with most other register allocators) thus has a notion of "when" an operand occurs -- the "operand position" -- and it calls these two points in an instruction Early</code> and Late</code>. Along with this, throughout the allocator, we name program points (distinct points at which allocations are made) as Before</code> or After</code> a given instruction.</p> One final bit of subtlety: when a single instruction from a regalloc point of view actually emits multiple instructions at the machine level, sometimes the usual phasing of reads and writes breaks down. For example, maybe a pseudoinstruction becomes a sequence that starts to write outputs before it has read all of its inputs. In such a case, reusing one of the inputs (which is no longer live at Late</code>) as an output register could be catastrophic. For this reason, regalloc2 decouples</em> an operand's position from its kind (use or def). One could have an "early def" or a "late use". Temporary registers are also possible: these are live during both early and late points on an instruction so they do not conflict with any input or output, and can be used in sequences emitted from one pseudoinstruction.</p> regalloc2's View of Operands</h4> To summarize, each instruction can have a sequence of "operands", each of which:</p> Names a virtual register</em> that corresponds to a value in the original program;</li> Indicates whether this value is read ("used") or written ("defined"),</li> Indicates when during the instruction execution the value is accessed ("early", before the instruction executes; or "late", after it does);</li> Indicates where the value should be placed: in a machine register of a certain kind, or a specific machine register, or in the same register that another operand took, or in a slot in the stack frame.</li> </ul> Stage 1: Live Ranges</h3> We've described what the register allocator expects as its input. Now let's talk about how the input is processed into an "allocation problem" that can be solved by the main algorithm.</p> The input is described as a graph of blocks of instructions, with operands; but most of the allocator works in terms of liveranges</em> and bundles of liveranges</em> instead.</p> In brief, a liverange</em> (originally "live range", but we say it so often it has become one word!) is a span of program points -- that is, a range of "before" and "after" points on instructions -- that connects a program value in a virtual register from a definition to one or more uses. A liverange represents one unit of needed storage, either as a register or a slot in the stackframe.</p> The basic strategy of regalloc2 is to reduce the input into liveranges as soon as possible, and then operate mostly on liveranges, translating back to program terms (inserted moves and assigned registers per instruction) only at the very end of the process. This lets us reason about a simpler "core problem" that is actually quite concisely specified:</p> A liverange is a span of program points, which can be numbered consecutively;</li> A liverange has constraints at certain points that arise from program uses/defs;</li> We must assign locations to liveranges, such that: At any point, at most one liverange lives in a given location;</li> At all points, a liverange's constraints are satisfied;</li> </ul> </li> We are allowed to split a liverange into pieces and assign different locations to each piece. However, moves between pieces have a cost, and we must minimize cost.</li> </ul> And that's it! No need to reason about ISA specifics, or the way that regalloc2 generates moves, or anything else. We'll worry about generating moves to "reify" (make real) the assignments later. For now, we just need to slot the liveranges into locations and avoid conflicts.</p> Computing Liveness</h4> To build up our set of liveranges, we first need to compute liveness</em>. This is a property of any particular virtual register at a program point indicating that it has a value that will eventually be used.</p> Liveness analysis</a> is an iterative dataflow analysis</a> that is computed in the backward direction: any use of a virtual register propagates liveness backward ("upward" in the program), and a definition of that virtual register's value ends the liveness (when scanning upward), because the old value (from above) is no longer relevant.</p> Thus the first thing that regalloc2 does with the input program is to run a worklist algorithm to compute precise liveness</a>. This produces a bitset7</a></sup> that, at each basic block entry and exit, gives us the set of live virtual registers.</p> Once we know which registers are live into and out of each basic block, we can perform block-local processing</a> to compute actual liveranges with each use of the register properly noted. This is another backward scan, but this time we build the data structures</a> we'll use for the rest of the allocation program.</p> Normalization, and Saving Fixups for Later</h4> We mentioned above that one way to see the liverange-building step is as a simplification of the problem to its core essence, in order to more easily solve it. "Ranges that may overlap" is certainly simpler than "instructions that access registers with certain semantics". However, even the constraints on the liveranges can be made simpler in several ways.</p> A good example of a complex set of constraints is the following:</p> inst v0 [use, fixed r0], v0 [use, fixed r1], v1 [def, any reg]</span></span></code></pre> This is an instruction that has two inputs, and takes the inputs in fixed physical registers r0</code> and r1</code>. This is completely reasonable and such instructions exist in real ISAs (see, e.g., x86's integer divide instruction, with inputs in rdx</code> and rax</code>, or a call with an ABI that passes arguments in fixed registers). If the two inputs happen to be given the same program value, here virtual register v0</code>, then we have created an impossible constraint: we require v0</code> to be in both</em> r0</code> and r1</code> at the same time.</p> As we have formulated the problem, a liverange is in only one place at a time; and in fact this is a very useful simplifying invariant, and a simpler model than "there are N copies of the virtual register at once" (which one(s) are up-to-date, if we allow multiple defs?).</p> We can "simplify to a previously solved problem" in this case with a neat trick: we keep a side-list of "fixup moves" to add back in, after we complete allocation, and we insert such a fixup move from r0</code> to r1</code> just before this instruction. Then we delete the constraint</em> on the second operand that uses v0</code>. The rest of the allocation will proceed as if v0</code> were only required in r0</code>; it will end up in that location; and the fixup move will copy it to r1</code> as well.</p> We perform a similar rewrite for reused-input constraints. These seem as if they would be fairly fundamental to the core allocation loop, because they tie one decision to another; now we have to deal with dependent allocation decisions. But we can do a simpler thing: we edit the liveranges</em> so that (i) the output that reuses the input has a liverange that starts at the early</em> (input) phase, and (ii) the input has a liverange that ends just before the instruction, not overlapping. (In other words, we shift both back by one program point.) Then we insert a fixup move from input to output. The figure below illustrates this rewrite.</p> </span> INITIAL</span></span> v0 v1 v2</span></span> : :</span></span> \| \|</span></span> EARLY use use</span></span> add v2, v0, v1</span></span> LATE def reuse(0)</span></span> \|</span></span> :</span></span> </span></span> REWRITTEN</span></span> v0 v1 v2</span></span> : :</span></span> \| \|</span></span> \| \|</span></span> LATE \| \|</span></span> \| (implicit copy: v2 := v0)</span></span> EARLY use def</span></span> add v2, v0, v1 \|</span></span> LATE \|</span></span> \|</span></span> :</span></span></code></pre> One may object that this pessimizes all reused-input allocations -- haven't we removed all knowledge of the constraint, so we will almost always get different registers at input and output, and cause many new moves to be inserted? The answer to this issue comes in the bundle merging</em>, which we discuss below (basically, we try to rejoin the two parts if no overlap would result).</p> In general, this is a powerful technique: whenever some complexity arises from a constraint or feature, it is best if the complexity can be kept as close to the outer boundary</em> of the system as possible. Rewrites or lowerings into a simpler "core form" are common in compilers, and it so happens that considering regalloc constraints in this light is useful too.8</a></sup></p> Step 2: Bundles and Merging</h3> Once we have created a list of liveranges with constraints, we could in theory begin to assign locations right away, finding available locations that fulfill constraints and splitting where necessary to do so. However, such an approach would almost certainly run more slowly, and produce worse code, than most state-of-the-art allocators today. Why is that?</p> A key observation about liveranges in real programs is that there are clusters of related liveranges connected by moves</em>. Several examples are the liveranges on either side of an SSA block parameter (or phi-node), or on either side of a move instruction, or the input and reused-register-constrained output of an instruction.9</a></sup> These liveranges often would benefit if they were in the same register: in all three cases, it would mean one fewer move instruction in the final program.</p> Processing such related liveranges together, as one unit of allocation, would guarantee that they would be assigned the same location. (If impossible, the merged liveranges could always be split again.) Attaining this result some other way would require reasoning about "affinity" for locations between related liveranges, which is a much more complex question.</p> Furthermore, processing multiple liveranges together brings all the usual efficiency benefits of batching: the more progress we can make with a single decision, the faster the register allocator runs.</p> We thus define a "bundle" of liveranges as the unit of allocation. After computing liveranges in the initial input program scan, we merge liveranges into bundles according to a few simple heuristics: across SSA block parameters, across move instructions, and from inputs to outputs of instructions with "reused-input" constraints.</p> The one key invariant is: all liveranges in a bundle must not overlap</em>. We greedily grow a bundle with the above heuristics, testing at each step whether another liverange can join.</p> Beyond this point in the allocation process, we will reason about bundles: we enqueue them in the priority workqueue, we process them one at a time and assign locations or split. At the end of the process, we'll scan the liveranges in the bundle and assign each the location that the bundle received.</p> CORE ALLOCATION PROBLEM:</span></span> </span> bundle0 bundle1 bundle2 bundle3</span></span> \|</span></span> \| \| \|</span></span> \| \|</span></span> \|</span></span> \| \|</span></span> \| \|</span></span> \|</span></span> \| \|</span></span> </span></span> ==></span></span> </span></span> bundle0: r0</span></span> bundle1: r1</span></span> bundle2: r0</span></span> bundle3: r2</span></span></code></pre>Step 3: Assignment Loop and Splitting Heuristics</h3> The heart of the allocator is the main loop that allocates locations to bundles</em>. This is at least conceptually simple: pull a bundle off of a queue, "probe" potential locations one at a time to see if it will fit (has no overlap with points in time for which that location is already reserved), assign it the first place it fits. But there is significant complexity in the details, as always.</p> The key data structures are: (i) an "allocation map" for each physical register, kept as a BTree for fast lookups, that indicates whether the register is free or occupied at any program point and the liverange that occupies it; and (ii) a queue of bundles to process. (The design document</a> describes several others, such as the second-chance allocation queue and the structures used for stackslots, which we skip here for simplicity.)</p> The core part of the allocator's processing occurs here: we pull one bundle at a time from the queue and attempt to place it in one of the registers (again we're ignoring stackslot constraints for simplicity).</p> For each bundle, we can perform one of the following actions:</p> If we find a register with no overlapping allocations already in place, we can allocate the bundle to the register; then we're done! This is the best case.</p> </li> Otherwise, we can pick a register where some bundles with a lower "spill cost" (determined as a sum of some heuristic values for each use of a liverange in a bundle) and evict</em> those already-allocated bundles, punting them back to the queue, then put our present bundle in this register instead. We do this only if the present bundle has a higher spill cost.</p> </li> If this is also not an option, we can split our present bundle into pieces and try again. Heuristically, we find it works well to split at the first conflict point; in other words, allocate as much as would have fit in any register, and then put the remainder back in the queue.</p> </li> </ul> TO ALLOCATE: GIVEN:</span></span> </span> bundle0 r0 r1 r2 r3</span></span> \| \|b1 \|b4</span></span> \| \| \|</span></span> \| \|</span></span> \|b2 \|</span></span> \| \|</span></span> \|</span></span> \| \|b3 \|</span></span> \| \| \|</span></span> </span></span> </span></span> OPTION 1: Take a free register (r3)</span></span> </span></span> - Possible if no overlap. Easiest option!</span></span> </span></span> OPTION 2: Evict, if bundle0's spill cost is higher than evicted bundles</span></span> and if no completely free register exists:</span></span> </span></span> </span> bundle1 bundle2 r0 r1 r2</span></span> \| \|b0 \|b4</span></span> \| \| \|</span></span> \| \|</span></span> \| \|</span></span> \| \|</span></span> \|</span></span> \|b0 \|b3 \|</span></span> \| \| \|</span></span> </span></span> (b1 and b2 are re-enqueued)</span></span> </span></span> OPTION 3: Split!</span></span> </span></span> </span></span> r0 r1 r2 </span></span> --> \|b1 \|b0 \|b4</span></span> \| \| \|</span></span> \| \|</span></span> \|b2 \|</span></span> \| \|</span></span> \|</span></span> --> \|b0 \|b3 \|</span></span> \| \| \|</span></span></code></pre> The presence of eviction</em> as an option is what makes regalloc2 a backtracking</em> allocator. It's not clear why the allocator should always finish its job, if it is allowed to undo work. In fact many</em> bundles may be evicted in order to place just one</em> bundle instead -- isn't this backward progress?</p> The key to maintaining forward progress is that we only evict bundles of lower spill weight</em>, together with the fact that spill weight monotonically decreases when splitting</em>. Eventually, if bad luck continues far enough, a bundle will be split into individual pieces around each use, and these can always be allocated because (if the input program does not have fundamentally conflicting constraints on one instruction) these single-use bundles have the lowest possible spill weight.</p> Step 4: Move Handling</h3> Finally, once we have a series of locations assigned to each bundle, we have "solved the problem", but... we still need to convey our solution back to the real world, where a compiler is waiting for us to provide a series of move, load, and store instructions to place values into the right spots.</p> We split the overall problem into two pieces for the usual simplicity reasons: first, we allow ourselves to cut liveranges into as many pieces as needed, and put each piece in a different place, at a single instruction granularity. We assume that we can edit the program somehow to connect these pieces back up. That allowed the above liverange/bundle processing to become a tractable problem for a solver core to handle. Now, need to connect those liverange fragments. This is the second half of the problem: generating moves.</p> All-in-One: Liverange Connectors, Program Moves, and Edge Moves</h4> The abstract model for the input to this stage of the allocator is that between each pair of instructions, we perform some arbitrary permutation</em> of liveranges in locations. One way to see this permutation is as a parallel move</em>: a data-movement action that reads values in all of their old locations (inputs of the permutation), then in parallel, writes the values to all of their new locations (outputs of the permutation).</p> EARLY</span></span> inst1 r2, r0, r1</span></span> LATE</span></span> </span></span> { r4 := r0 } <--- regalloc-inserted moves</span></span> </span></span> EARLY</span></span> inst2 r0, r2, r3</span></span> LATE</span></span> </span></span> { r6 := r5, r5 := r6 } <--- multiple moves in parallel!</span></span> (arbitrary permutations)</span></span> </span></span> EARLY</span></span> inst3 r5, r4, r2</span></span> LATE</span></span></code></pre> This is why we make a distinction between the "After" point of instruction i</em> and the "Before" point of instruction i+1</em>, though a traditional compiler textbook would tell you that there is only one program point between a pair of instructions. We have two, and between these two program points lies the parallel move.10</a></sup></p> The process for generating these moves is: we scan liveranges, finding points at which they have been split into pieces where the value must flow from one piece to the next. We also account for CFG edges and block parameters at this point, as well as for move instructions in the input program. Once we have accumulated the set of moves that must happen, in parallel, at a given priority at a given location, we resolve these into a sequence of individual move/load/store instructions using the algorithm we describe in the next section.</p> One thing to note about this design is that we are handling all</em> value movement in the program with a single resolution mechanism: regalloc-induced movement but also movement that was present in the original program. This is valuable because it allows the moves to be handled more efficiently. In contrast, we have observed issues in the past in allocators that lower moves in stages -- e.g., SSA block parameters to moves prior to regalloc, then regalloc-induced moves during regalloc -- where chains of moves occur because each level of abstraction is not aware of what other levels below or above it are doing.</p> Parallel Move Resolution</h4> The actual problem of resolving a permutation such as:</p> { r0 := r1 ; r1 := r2 ; r2 := r0 }</span></span></code></pre> into a sequence of moves</p> scratch := r0</span></span> r0 := r1</span></span> r1 := r2</span></span> r2 := scratch</span></span></code></pre> is a well-studied one, and is known as the "parallel moves problem". The crux of the solution</a> is to understand the permutation as a kind of dependency graph, and sort its moves so that we pull an old value out of a given register before overwriting it. When we encounter a cycle, we can use a scratch register as above.</p> One might think that something like Tarjan's algorithm</a> for finding strongly-connected components is needed, but in fact there is a nice property of the problem that greatly simplifies it. Because any valid permutation has at most one writer for any given register, we can only have simple cycles</em> of moves, with other uses of old values in the cycle handled before realizing the cyclic move. Some more description</a> is available in our implementation. In fact, this is such a nice observation that we later discovered a paper</a> by Rideau et al. that names the resulting dependency graphs "windmills" for their shape (see figure below -- there can be a simple cycle in the middle, and only acyclic outward moves from cycle elements in a tree of outward shifts) and, delightfully, describes more or less the same algorithm to "tilt at windmills" and resolve the moves.</p> </p> Scratch Registers and Cycles</h4> The above algorithm works, but has one serious drawback: it requires a scratch register whenever we have a cyclic move. The simplest approach to this requirement is to set aside one register permanently (or actually, one per "register class": e.g., an integer register and a float/vector register). Especially on ISAs with relatively few registers, like x86-64 with 16 each of integer and float registers, this can impact performance by increasing register pressure and forcing more spills.</p> We thus came up with a scheme</a> to allow use of all registers but still find a scratch when needed for a cyclic move. The approach begins with an idea borrowed from IonMonkey</a>, namely to look for a free register to use as a scratch by actually probing the allocation maps. This often works: the need for a cyclic move doesn't necessarily imply that we will have high register pressure, and so there are often plenty of free registers available.</p> What if that doesn't work, though? In the above PR, we take another seemingly-simplistic approach: we use a stackslot as the scratch instead! This means that we will resolve the cyclic move into a sequence including stores and loads, but this is fine, because we're already in a situation where all registers are full and we need to spill something</em>.</p> We're not quite done, though: there is another very important use of the scratch register in a simplistic design, namely to resolve memory-to-memory moves! This arises because our move resolution handles both registers and stackslots in a uniform way, so some cycle elements may be stackslots (memory locations). Using a stackslot as a scratch above just compounds the problem. So we translate, in a separate second phase, memory-to-memory moves into a pair</em> of a load (from memory into scratch) and a store (from scratch into memory).</p> So to recap, we may find a cyclic move permutation to be necessary, and no registers to be free to use as scratch; so we use a stackslot instead. But some of the original move cycle may have been between stackslots, so we need another</em> scratch to do make these stackslot-to-stackslot moves possible. But we're already out of scratch registers!</p> The solution to this last issue is that we can do a last-ditch emergency spill of any</em> register, just for the duration of one move. So we pick a "victim" register of the right kind (integer or float), spill it to a second stackslot, use this victim register for a memory-to-memory move (a load and store pair), then reload the victim.</p> This cascading series of solutions, each a little more complex but a little rarer, is an example of a complexity-for-performance tradeoff. Overall, it is far better to allow the program to use all registers; this will reduce spills. And most parallel moves are not</em> cyclic, so scratch registers are rarely needed anyway. And when a cyclic move is</em> needed, we often have a free register, because this condition is mostly orthogonal to high register pressure. It is only when all of the bad cases line up -- cycle, no free registers, and memory-to-memory moves -- that we reach for the highest-cost approach (decomposing one move into four), and so the most important aspect of this fallback is not that it is fast but that it is correct and can handle all cases.</p> Everything Else</h3> This has been a not-so-whirlwind tour of the allocator pipeline in regalloc2, but despite my longwindedness, we had to skip many details! For example, the way in which stackslots are allocated for spilled values, the way in which split pieces of a single original bundle share a single spill location ("spill bundles"), the way in which we clean up after move insertion with Redundant Move Elimination (a sort of abstract interpretation that tracks symbolic locations of values), and more, are skipped here but are all described in the design document. One could truly write a book on the engineering of a register allocator, but the above will have to suffice; now, we must move on and draw some lessons!</p> Four Lessons</h2> Performance</h3> Cache Locality and Scans</h4> One enduring theme in the regalloc2 architecture is data structure design for performance</em>. As I began the project by transliterating IonMonkey code, building Rust equivalents to the data structures in the original C++, I found several things:</p> The original data structures were heavily pointer-linked</em>. For example, liveranges within bundles and uses within liveranges were kept as linked lists, to allow for fast insertion and removal in the middle, and splicing. A linked list is the classical CS answer to these requirements.</p> </li> There were quite a few linear-time queries of these data structures. For example, when generating moves between liveranges of a virtual register, a scan would traverse the linked list of these liveranges, observe the range covering one end of a control-flow transition, and do a linear-time scan</em></a> (through the linked list) for the liverange at the other end!</p> </li> </ul> These two design trends combine to make CPU caches exceptionally unhappy. First there is the algorithmic inefficiency, then there is the cache-unfriendly demand access to random liveranges, each of which is a pointer-chasing scan.</p> regalloc2 adopts two general themes that work against these problems:</p> The overall data structure design consists of contiguous-in-memory inline structs</em> rather than linked lists. For example, the list of liveranges in a bundle is a SmallVec<[LiveRangeListEntry; 4]></code>, i.e. a list with up to four entries inline and otherwise heap-allocated, and the entry struct contains the program-point range inline. Combining this more compact layout with certain invariants</em> -- usually, some sort of sorted-order invariant -- allows for efficient lookups and list merges even without linked-list splicing.</p> </li> At a higher level, regalloc2 tries to avoid random lookups as much as possible</em>. Sometimes this is unavoidable, but where it is not, a linear scan that produces some output as it goes is much more cache-friendly.</p> </li> </ul> It is worth examining the particular technique we use to resolve moves across control-flow edges. This requires looking up where a virtual register is allocated at either end of the edge -- two arbitrary points in the linear sequence of instructions. The problem is solved in IonMonkey (as we linked above) by scanning over ranges to find basic block ends and then doing a linear-time linked-list traversal to find the "other end", for overall quadratic time.</p> Instead we scan the liveranges for a virtual register once and produce "half-moves" into a Vec</code>.</a> These "half-moves" are records of either the "source" side of a move, at the origin point of a CFG edge, or the "destination" side of a move, at the destination point of a CFG edge. After our single scan, we sort the list of half-moves by a key (the vreg and destination block) so that the source and destination(s) appear together. We can then scan this</em> list once and generate all moves in bulk.</p> If that sounds something like MapReduce</a>, that is not an accident: the technique of leveraging a sort with a well-chosen key was invented to allow for efficient parallel computation, and here allows the two "ends" of the move to be processed independently.</p> This technique provides better algorithmic efficiency, much better cache residency (we have two steps that boil down to "scan input list linearly and produce output list linearly"), and leans on the standard-library implementation of sort()</code>, which is likely to be faster than anything we can come up with. Profiling of regalloc2 runs shows sometimes up to 10% or so of runtime spent in sort()</code>, but this is far better than the alternative, in which we do a random pointer-chasing lookup at every step.</p> Compact Data</h4> Another lesson learned over and over during regalloc2 optimization is this: data compactness matters! A single struct</code> growing from 16 to 24 bytes could lead to significant slowdowns if a large input leads to allocation and traversals over an array of 10,000 such structs. Every improvement in memory footprint is a reduction in cache misses.</p> We play many games with bitpacking to achieve this. For example, regalloc2 puts its Operand</code> in 32 bits</a>, and this includes a virtual register number, a constraint, a physical register number to possibly go with that constraint, the position (early/late), kind (def/use), and register class of the operand. Some of this optimization requires compromise: as a result of our encoding scheme, for example, we can allow only 2M (2^{21</sup>) virtual registers per function body. But in practice most applications will have other limits that take effect before this matters. (And in any case, many compilers play these same sorts of tricks, so megabytes-large function bodies are problematic in all sorts of ways.) And we sometimes}find ways to pack a few more bits</a> (more such PRs are always welcome!).</p> We play similar tricks with program points, spill weights (we store them as bfloat16</a> because spill weights need not be too precise, only relatively comparable, and using only 16 bits lets us pack some flags in the upper 16 and save a u32</code>), and more.</p> Finally, trading off indirection and data-inlining is important: e.g., a LiveRangeList</code></a> keeps the program-point range (32 + 32 bits) inline, then a 32-bit index to indirect to everything else about the liverange, because checking for bundle overlap is the most common reason for traversing this list and reducing cache misses in this inner loop is paramount.</p> Reducing Work</h4> One final performance technique that at once both sounds completely obvious and superficial, yet is quite powerful, is: "simply do less work!"</p> One can often get lost in profiler results, wondering how to shave off some hotspots by compacting some data or reworking some inner-loop logic, only to miss that one is implicitly assuming that the actual computation to be done is invariant. In other words, one might look for the fastest way to compute a particular subproblem or framing of the problem, rather than the ultimate problem at hand (in this case, the register allocation).</p> In the case of regalloc2, this primarily means that we can improve performance by reducing the number of bundles and liveranges</em>. In turn, this means that we can get outsized wins by improving our merging and splitting heuristics.</p> Early in the optimization push, I realized that regalloc2 was often finding an abnormally large number of conflicts between bundles, and splitting far too aggressively. It turned out that the liveness analysis was initially approximate</em>, in an intentional, if premature, efficiency tradeoff to avoid a fixpoint loop in favor of a single-pass loop-backedge-based algorithm that overapproximated liveness (which is fine for correctness). The time that this saved was more than offset by the large increase in iterations of the bundle processing loop. So I reworked this into a precise analysis</a> that iterates until fixpoint. It is worthwhile to pay that extra analysis cost upfront to get exact liveness in order to make our lives (and our runtime) better later.</p> The way in which we compute that precise liveness itself also raises an interesting way of reducing work: by carefully choosing invariants. We perform the liverange-building scan in such a way that we always observe liveranges in (reverse) program order</em>. This lets us build the liverange data structures, which are normally sorted, with simple appends, merging with contiguous sections from adjacent blocks. This is in contrast to the original IonMonkey allocator's equivalent function</a> to add liveranges during analysis, which essentially does an insertion sort and merge, leading to O(n²) behavior. Note that the IonMonkey code has a CoalesceLimit</code> constant that caps the O(n²) behavior at some fixed limit. In contrast our liverange build in regalloc2 is always linear-time.</p> The final way in which one can reduce work, related to data-structure and invariant choice, is by designing the input (API or data format) correctly in order to efficiently encode the problem. The register allocator that preceded regalloc2, regalloc.rs, did not have a notion of register constraints in instructions' use of virtual registers. Instead, it required the user to use move instructions: reused-input constraints become a move prior to the instruction, and fixed-register constraints become moves to/from physical registers. It then relied on a separate move-elision analysis to try to eliminate these moves. regalloc2 has a smaller input because constraints are carried on every operand. It can still generate these moves when needed, but they often are not. This results in faster allocation as well as often better generated code.</p> Correctness: "Design for Test" and Fuzzing-First Development</h3> The next set of lessons to come from regalloc2 have to do with how to attain correctness in complex programs</em>.</p> I believe that regalloc2 is maybe the most intrinsically complex</em> program I have written: its operation relies on many interlocking invariants across the allocation process, and there are many, many edge cases to get right. It is >10K lines of very dense Rust code. There should be approximately zero chance for any human to get this correct, working on real inputs, in any reasonable timeframe. And relying on something this complex to uphold security guarantees that rely on correct compilation should be terrifying.</p> And yet somehow it seems to work, and we haven't found any miscompiles caused by RA2 itself since we switched Cranelift to use regalloc2 in April. More broadly, there was one</a> issue where constraints generated by Cranelift could not be handled in some cases, resulting in a panic11</a></sup>; and another</a> where spillslots were not reused as they should be, resulting in worse performance; neither could result in incorrect generated code. In the integration of RA2 into Cranelift, there were two</a> bugs</a> that could, but both were found within 24 hours by the fuzzers. (That doesn't mean there won't be any more of course -- but things have been surprisingly boring and quiet!)</p> The main superpower, if one can call it that, that enabled this to work out is fuzzing</em>. And in particular, a step-by-step approach to fuzzing in which I built fuzzing oracles, test harnesses, and fuzz targets as I built the allocator itself, and drove development with it. Until about 4 months in when I wired up the first version of the Cranelift integration, regalloc2 had only</em> ever performed register allocation for fuzz-target-generated inputs. It still doesn't have a test harness for manually-written tests; there seems to be no need, as the fuzzer is remarkably prescient at finding bugs.</p> I find it helpful to think of this philosophy in terms of the design-for-test</a> idea from digital hardware design. In brief, the idea is that one builds additional features or interfaces into the hardware specifically so its internal state is visible and it can be tested in controlled, systematic ways.</p> The first thing that I built in the regalloc2 tree was a function body generator</a> that produces arbitrary control flow, either reducible or irreducible, and arbitrary uses and defs according to what SSA allows. I then built an SSA validator</a>, and finally, fuzzed one against the other</a>. This way I built confidence that I had fuzzing input that included interesting edge cases. This would become an important tool for testing the whole allocator, but it was important to "test the tester" first and cross-check it against SSA's requirements. Of course, checking SSA requires one to compute flowgraph dominance on the CFG, and that can be fuzzed too</a>, using a from-first-principles definition of graph dominance. So the test-tester has itself been tested in this additional way.</p> Once I had built enough tools with the lower-level tools, and sharpened them all against each other, it was time to write the register allocator itself. Once each major piece was implemented, I first fuzzed it with the SSA function generator to check for panics (assertion failures, mostly). Getting a clean run, given the relatively generous spread of asserts throughout the codebase, gave some confidence that the allocator was doing something</em> reasonable. But to truly be confident that the results were semantically correct answers, we needed to lean more heavily on some program analysis techniques.</p> In another blog post</a> I detailed our "register allocator checker". In brief, this is a symbolic verification</em> engine that checks that the resulting register allocations produce the same dataflow connectivity as the original, pre-regalloc program. To fully verify regalloc2, I ported the checker over, and drove the whole pipeline -- SSA function generator, allocator, and checker -- with a fuzz target</a>.</p> This workflow was remarkably (sometimes maddeningly!) effective. I started with a supposedly complete allocator, and ran the fuzzer. Within a few seconds it found a "counterexample" where, according to the checker, regalloc2 produced an incorrect allocation. I built annotation</a> tooling to produce views of the allocator's liveranges and other metadata</a> over the original program. I pored over this and debug-log output of the allocator's various stages, eventually worked out the bug (often some corner-case I had not considered, or sometimes an unexpected interaction between two different parts of the system) and came up with a fix. With the particular fuzz-bug fixed, I started up the main fuzzer again. libFuzzer's startup seems to run over the entire corpus before generating new inputs, so sometimes my bugfixes would quickly cause regressions in other cases I had already handled before. After juggling solutions and finding some way to maintain correctness in all cases, I would let the fuzzer run again, usually finding my next novel fuzzbug within a few minutes.</p> This was my life for a month or so. Fuzzers, especially over complex programs with strict oracles, are relentless</em>: they leave no rock unturned, they find every bug you could imagine and some you can't, and they accept no excuses. But one day... you run the fuzzer and you find that it keeps running. And running. Three hours later, it's still running. There is no better feeling in the software-engineering universe, and frankly fuzzing with a strong oracle (like symbolic checking or differential execution fuzzing) is probably the second-strongest assurance one will get that one's code is correct</em> (with respect to the "spec" implied by the testcase generator and oracles, mind!) short of actual formal verification. This was the project that changed my opinion on fuzzing from "nice to have supplemental correctness technique" to "the only way to develop complex software".</p> Compatibility and Migration Path</h3> The last lesson I want to draw from my regalloc2 experience is how one might think about compatibility and migrations, in the context of large "replace a whole unit" updates to software projects.</p> The regalloc2 effort occurred within the context of the Cranelift project, and was designed primarily for use in Cranelift (though it can be used, and apparently is being used, as a standalone library elsewhere as well). As such, a primary design directive for regalloc2 could be "do whatever is needed to fit into Cranelift's assumptions about the register allocator".</p> On the other hand, conforming to the imprint left by the last register allocator is a good way to sacrifice a rare chance to explore different corners of the design space. The design of the API of regalloc.rs made in 2020 was quite good for the time -- simple, easy to use, and purpose-built for Cranelift -- but we subsequently learned several lessons. For example, regalloc.rs required the program to be already lowered out of SSA, resulting in somewhat inefficient interactions between blockparam-generated moves and regalloc-generated moves. Ideally we wanted to do something better here.</p> A timeline for context: regalloc2 proper was working, with its fuzzer as its only client, after about 6 weeks of initial implementation (late March to early May 2021). I cheerfully dove into a refactoring of Cranelift at that point to adapt to the new abstractions.</p> Less cheerfully after a few weeks of effort, I stopped this direct-port effort at around 547 type errors remaining (having never gotten past a full typecheck). There was simply too much changing all at once, and it was clearly not going to be a reasonable single diff to review or to trust for correctness. I had underestimated how much would have to change; pulling one string loosened three others.</p> It was clear that some sort of transition would need to happen in multiple stages, so I next built a compatibility shim</a> as a new "algorithm" in regalloc.rs that was a thin wrapper around regalloc2. This involved significant work in regalloc2 to expand its range of accepted inputs: support for non-SSA code, support for "modify" operands as well as uses and defs, and explicit handling of program-level moves with integration into the move generation logic. This was working by August of 2021. Performance results were not as good as initially expected with "native" regalloc2 API usage, but were a promising intermediate step nonetheless.</p> However, for somewhat complicated reasons, review of that PR stalled, and I spent time in other parts of Cranelift (the ISLE</a> DSL and instruction-selector backends using it). When I eventually came back to RA2, in February 2022, several things had changed: some refactoring (as a result of ISLE) made adaptation to "SSA-like" form in x86 instructions easier, and the enhancements to regalloc2 as part of the regalloc.rs compatibility shim also let us use RA2 directly and migrate away from "modify" operands, moves, etc., in an incremental way.</p> So I made a second attempt at porting Cranelift to use regalloc2 directly, this time succeeding</a>, to fairly good results</a>. We've been using RA2 since that PR merged in mid-April 2022, about a year after RA2 began.</p> I learned a few valuable lessons from this saga, but the main one is: incremental migration paths are everything. The above PR may look horribly scary but much of the churn was "semantically boring": RA2 supported, in the end, most of the same abstractions as regalloc.rs, with only blockparam handling changing fundamentally. This is a sort of hybrid of the "compatibility shim" and "direct use of new API" approaches: new API, but supporting a superset of the semantic demands of the old API. One can then migrate single API use-sites at a time away from "legacy semantics" and eventually delete the warts (e.g., "modify" operands in addition to pure uses/defs) if one desires, but that is decoupled from the main atomic switchover. I indeed hope to do such cleanup in Cranelift, in due time.</p> Along with that, it is useful to think of finite budget for semantic/design-level cleanup per change. Rewrites are opportune times to push a project into a better design-space and benefit from lessons learned, sometimes in ways that would be hard or impossible to do with a truly incremental approach. However, at the margins where the rewrite connects to the outside world, this shift causes tension and so is fundamentally constrained or else has to pull the whole world along with it. I am happy that regalloc2 pulls responsibility for SSA lowering into the allocator; it can be handled more efficiently there. Likewise I am happy that the compatibility-shim effort filled in support for regalloc.rs features that made the rest of the transition easier.</p> Unending and Unwinnable Nature of Heuristic-Tuning</h3> The final lesson I wish to pull out of this experience is one that has become apparent in the time since the initial transition to RA2: any program that solves an NP-complete problem in a complex way, with a hybridized ball of hundreds of individual heuristics and techniques that somehow works most of the time, is always</em> going to make someone unhappy in some case and at some point unambiguous wins become very hard to find. That is not at all to say that it's not worth continuing attempts at optimization; sometimes improvements do become apparent. But they become much rarer after an initial hillclimb to the top of a "competent implementation of one point in design-space" local maximum.</p> While looking for more performance, I experimented with many different split heuristics. Especially difficult are splits' relationship to loops: when one has a hot inner loop, one really</em> wants to place a split-point that implies an expensive move (load or store) outside</em> the inner loop. But getting this right in all cases is subtle, because the winning tradeoff depends on register pressure inside the loop, how many values are live across the loop and to the following code, how many uses occur in the loop and how frequently (rare path vs. common path), and so on. In the end, I actually abandoned a number of more complex cost heuristics (an example is in this never-merged commit</a>) and went with several simple heuristics: minimize the cost of the implied move at a split</a>, and explicitly hoist split-points outside of loops</a>. This worked best overall, but did leave a little performance unclaimed in some microbenchmarks.</p> Sometimes clearer improvements are still possible. One example of a recent investigation: in #3785</a>, we noticed that switching to RA2 had caused an extra move instruction to appear in a particular sequence. This seems minor, but it is always good to understand why</em> it might have occurred and if it points to some deeper issue. After some investigation</a> it became apparent that the splitting heuristics were suboptimal in the particular case of a liverange that spans from a register-constrained use to a stack-constrained use. The details are beyond the scope of this post (thank goodness, it's long enough already!); but empirically I found that trimming liveranges around a split-site in a slightly different way tended to improve results.</p> So, some changes will be an unmitigated win, but not every tradeoff is so. At the very least, the nature of a register allocator is that one will likely have an unending stream of "could work better in this case" sorts of issues. Can't win 'em all (but keep trying nonetheless!).</p> Conclusions</h2> We're finally at the conclusions -- thanks to all who have persisted in reading this far!</p> regalloc2 has been an immensely rewarding project for me, despite (or perhaps because of) the ups-and-downs inherent in building an honest-to-goodness, actually-works, somewhat-competitive-with-peer-compilers register allocator. It was a far larger project than I had anticipated: when I began, I told my manager it would probably be a few weeks to evaluate scope, maybe a month of work total. Witness Hofstadter's Law</a> in action: that is, it will always take longer than you think it will, even when accounting for Hofstadter's Law.</p> I hope some of the above lessons have been illuminating, and perhaps this post has given some sense of how many interesting problems the register-allocator space contains. It's a well-studied area for at least 40 years now, with countless approaches and clever tricks to learn and to combine in new ways; the work is far from over!</p> Acknowledgments</h2> Many, many thanks to: Julian Seward</a> and Benjamin Bouvier</a> for numerous discussions about register allocation throughout 2020, and Julian for several followup discussions after regalloc2 started to exist; Julian Seward and Amanieu d'Antras</a> for initial code-review of regalloc2 proper; Amanieu for a number of really high-quality PRs to improve RA2 and add feature support; and Nick Fitzgerald</a> for code-review of the (quite extensive) Cranelift refactoring to use regalloc2. Enormous thanks to Nick for reading over this entire post and providing feedback as well.</p> ^1</sup>which is to say, the original three</a>-part</a> series</a> covered a range of topics summarizing the goals and ideas of Cranelift's new backend design, but we haven't stopped working to improve things since then! The series is now four-thirds complete; by the time I'm done it may be five-thirds or more...</p> </div> ^{2</sup> In fact, it is perhaps the most important problem to solve for a fast Wasm-focused compiler, because most other common compiler optimizations will have been done at least to some degree to the Wasm bytecode; register allocation is the main transform that bridges the semantic gap from stack bytecode to machine code.</p> </div>}^{3</sup> Other sorts of constraints are possible too; in general, a liverange is constrained by all of the "register mentions" in instructions that touch the liverange's vreg, and we have to satisfy all of these constraints at once. A constraint may be "any register of this kind", or "this particular physical register", or "a slot on the stack", or "the same register as given to another liverange", for example. And beyond constraints, we may have soft "hints" as well, which if followed, reduce the need to move values around.</p> </div>}^4</sup>regalloc2 supports arbitrary control flow (i.e., does not impose any restrictions on reducibility</a>); its only requirement is that critical edges</a> are split, which Cranelift ensures by construction during lowering.</p> </div> ^{5</sup> Full credit for this idea, as well as most of the constraint design in regalloc2, goes to IonMonkey's register allocator.</p> </div>}^{6</sup> As we'll note under "Lessons" below, during development of a compatibility layer that allowed regalloc2 to emulate regalloc.rs, an earlier register allocator, we actually added a "modify" kind of operand that directly corresponds to the semantics of rax</code> above, namely read-then-written all in one register. We subsequently used it in several places while migrating Cranelift. But for simplicity we hope to eventually remove this (once all uses of it are gone).</p> </div>}^7</sup>It's actually a sparse bitset</a> that, when large enough, stores a hashmap whose values are contiguous 64-bit chunks</em> of the whole bitset. This is because, for large functions with thousands of virtual registers, keeping thousands of bits per basic block would be impractical. However, the naive sparse approach, where we keep a HashSet<VReg></code> or equivalent, is also costly because it spends 32 bits per set element (plus load-factor overhead). We observed that the live registers at a given point are often "clustered": there are some long-range live values from early in the function, and then a bunch of recently-defined registers. (This depends also on virtual registers being numbered roughly in program order, which is generally a good heuristic to rely on.) So we have a few u64</code>s and pay the sparse map cost for those, then have a dense map within each 64-bit chunk.</p> </div> ^8</sup>Credit must go to IonMonkey for this trick</a> as well, though the details of how to edit the liveranges appropriately to get the right interference semantics were far from clear</a> and the path to our current approach was "paved by fuzzbug failures", so to speak.</p> </div> ^{9</sup> Some literature on SSA form calls the connected set of liveranges via phi-nodes or block parameters "webs". Our notion of a bundle encompasses this case but is a bit more general; in principle we can merge any liveranges into a bundle as long as they don't overlap.</p> </div>}^10</sup>Actually, there are up to seven</a> parallel moves between instructions, at priorities according to the way that various constraint edge-cases are lowered. For example, when a single vreg must be placed in multiple physical registers due to multiple uses with different fixed-register constraints, the move that makes this happen occurs at MultiFixedReg</code> priority, which comes after the main inter-instruction permutation (it is logically part of the input setup for the following instruction). And ReusedInput</code> moves happen after that, because any one of the fixed-register inputs could be reused as an input. The detailed reasoning for the order here is beyond the scope of this blogpost, but suffice it to say that the fuzzer helped immensely in getting this ordering right!</p> </div> ^{11</sup> Pertinent to the broader point about fuzzing, this combination of constraints was not generated by RA2's fuzz target, which is why the resulting corner cases were not seen during development. As soon as the fuzzing testcase generator was extended to do so, the fuzzer found a counterexample within a few seconds, and helped to verify the constraint rewrites in RA2's frontend that fixed this issue.</p> </div>} Cranelift, Part 3: Correctness in Register Allocation 2021-03-15T00:00:00+00:00 This post is the last in a three-part series about Cranelift</a>. In the first post</a>, I covered overall context and the instruction-selection problem; in the second post</a>, I took a deep dive into compiler performance via careful algorithmic design.</p> In this post, I want to dive into how we engineer for and work to ensure correctness</em>, which is perhaps the most important aspect of any compiler project. A compiler is usually a complex beast: to obtain reasonable performance, one must perform quite complex analyses and carefully transform an arbitrary program in ways that preserve its meaning. It is likely that one will make mistakes and miss subtle corner cases, especially in the cracks and crevices between components. Despite all of that, correct code generation is vital</em> because the consequences of miscompilation are potentially so severe: basically any guarantee (security-related or otherwise) that we make at a higher level of the system stack relies on the (quite reasonable!) assumption that the computer will execute the source code we have written faithfully. If the compiler translates our code to something else, then all bets are off.</p> There are ways that one can apply good engineering principles to reduce this risk. An extremely powerful technique derives from the insight that checking a result</em> is usually easier than computing</em> it, and if we randomly generate many inputs, run our compiler (or other program) on these inputs, and check its output, we can get to a statistical approximation</em> of the claim "for all inputs, the compiler generates the correct output". The more random inputs we try, the stronger this statement becomes. This technique is known as fuzzing</a></em> with a program-specific oracle</a></em>, and I could write a lengthy ode to its uncanny power to find bugs (many others have, already).</p> In this post, I will cover how we worked to ensure correctness in our register allocator, regalloc.rs</a>, by developing a symbolic checker</a> that uses abstract interpretation to prove correctness for a specific register allocation result. By using this checker as a fuzzing oracle, and driving just the register allocator with a focused fuzzing target, we have been able to uncover some very interesting and subtle bugs, and achieve a fairly high confidence in the allocator's robustness.</p> What is Register Allocation?</h2> Before we dive in, we need to cover a few basics. Most importantly: what is the register allocation problem</a>, and what makes it hard?</p> In a typical programming language, a program can have an arbitrary number of variables or values in scope. This is a very useful abstraction: it is easiest to describe an algorithm when one does not have to worry about where to store the values.</p> For example, one could write the program:</p> void f() {</span></span> int x0 = compute(0);</span></span> int x1 = compute(1);</span></span> // ...</span></span> int x99 = compute(99);</span></span> </span></span> // ---</span></span> </span></span> consume(x0);</span></span> consume(x1);</span></span> // ...</span></span> consume(x99);</span></span> }</span></span></code></pre> At the midpoint of the program (the ---</code> mark), there are 100 int</code>-sized values that have been computed and are later used. When the compiler produces machine code for this function, where are those values stored?</p> For small functions with only a few values, it is easy to place every value in a CPU register. But most CPUs do not have 100 general-purpose registers for storing integers1</a></sup>; and in general, most languages either do not place limits on the number of local variables or else impose limits that are much, much higher than the typical number of CPU registers. So we need some approach that scales beyond, say, about 16 values (x86-64) or about 32 values (aarch64) in use at once.</p> A very simple answer is to allocate a memory</em> location for each local variable. In fact this is exactly what the C programming model provides: all of the xN</code> variables above semantically</em> live in memory, and we can take the address &xN</code>. If one does this, one will find that the addresses are part of the stack</em>. When the function is called, it allocates a new area on the stack called the stack frame</em> and uses it to store local variables.</p> This is far from the best we can do, though! Consider what this means when we actually perform some operation on the locals. If we read two locals, perform an addition, and store the result in a third, like so:</p> x0 = x1 + x2;</span></span></code></pre> then in machine code, because most CPUs do not have instructions that can read two in-memory values and write back a third in-memory result, we would need to emit something like the following:</p> ld r0, [address of x1]</span></span> ld r1, [address of x2]</span></span> add r0, r0, r1 // r0 := r0 + r1</span></span> st r0, [address of x0]</span></span></code></pre> Compiling code in this way is very fast</em> because we need to make almost no decisions: a variable reference always</em> becomes a memory load, for example. This is how a "baseline JIT compiler</a>" typically works, actually: for example, in the SpiderMonkey JS and Wasm JIT compiler, the baseline JIT tier -- which is meant to produce passable code very, very quickly -- actually keeps a stack of values in memory that correspond one-to-one to the JS bytecode or Wasm bytecode's value stack. (You can read the code here</a>: it actually keeps a few of the most recent values, at the top of operand stack, in fixed registers and the rest in memory.)</p> Unfortunately, accessing memory multiple times for every operation is very slow. What's more, it is often the case that values are reused soon after being produced</em>: for example, we might have</p> x0 = x1 + x2;</span></span> x3 = x0 2;</span></span></code></pre> When we compute x3</code> using x0</code>, do we reload x0</code>'s value from memory immediately after storing it? A smarter compiler should be able to remember that it had just computed the value, and should keep it in a register, avoiding the round-trip through memory altogether.</p> This is register allocation</em>: it is assigning a value in the program to a register for storage. What makes register allocation interesting is that (as noted above) there are fewer CPU registers than the number of allowable program values, so we have to choose some subset of values to keep in registers. This is often constrained in certain ways: for example, an add</code> instruction on RISC-like CPUs can only read from, and write to, registers, so a value's storage location must be a register immediately before it is used by a +</code> operator. Fortunately, the location assignments can change over time, so that at different points in the machine code, a register can be assigned to hold different values. The job of the register allocator is to decide how to shuffle values between memory and registers, and between registers, so that at any given time the values that need to be in registers are so.</p> In our design, the register allocator will accept as input a type of almost-machine-code called "virtual-register code", or VCode</code>. This has a sequence of machine instructions, but registers named in the instructions are virtual</em> registers: the compiler can use as many of them as it needs. The register allocator will (i) rewrite the register references in the instructions to be actual machine register names, and (ii) insert instructions to shuffle data as needed. These instructions are called spills</em> when they move a value from a register to memory; reloads</em> when the move a value from memory back to a register; and moves</em> when they move values between registers. The memory locations where values are stored when not in registers are called spill slots</em>.</p> An example of the register-allocation problem is shown below on a program with four instructions:</p> </p> This allocation is performed onto a machine with two registers (r0</code> and r1</code>). On the left, the original program is written in an assembly-like form with virtual registers</em>. On the right, the program has been modified to use only real registers</em>.</p> Between each instruction, we have written a mapping from virtual registers to real registers. The register allocator's task is just ("just"!) to compute these mappings and then edit the instructions, taking their register references through these mappings.</p> Note that the program, at one point, has three</em> live values, or values that still must be preserved because they will be used later: between the first and second instructions, all of v0</code>, v1</code> and v2</code> are live. The machine has only two registers, so it cannot hold all live values in them; it must spill at least one. This is the reason for the spill instruction</em>, written as a store to the stack slot [sp+0]</code>.</p> How Hard is Register Allocation?</h2> In general, the register allocator will first analyze the program to work out which values are live at which program points. This liveness information and related constraints specify a combinatorial optimization</a> problem: certain values must be stored somewhere</em> at each point, constraints limit which choices can be made and some choices will conflict with some others (e.g., two values cannot occupy a register at the same time), and a set of choices implies some cost (in data movement). The allocator will solve this optimization problem as well as it can using heuristics of some sort, depending on the register allocator.</p> Is this a hard problem? In fact, it is not only hard in a colloquial sense, but NP-complete</a>: this means that it is as hard as any other NP problem, for which we know only exponential-time brute-force algorithms in the worst case.2</a></sup> 3</a></sup> The reason is that the problem does not have optimal substructure</a>: it cannot be decomposed into non-interacting parts that can each be solved separately and then built up into an overall solution; rather, decisions at one point affect decisions elsewhere, potentially anywhere else in the function body. Thus, in the worst case, we can't do better than a brute-force search if we want an optimal solution.</p> There are many good approximations</em> to optimal register allocation. A common one is linear-scan register allocation</a>, which can run in almost-linear time (with respect to the code size). Allocators that can afford to spend more time are more complex: for example, in regalloc.rs</a>, in addition to the linear-scan implementation</a> (written by my brilliant colleague Benjamin Bouvier), we have a "backtracking" algorithm</a> (written by my other brilliant colleague Julian Seward) that can edit and improve its choices as it discovers higher-priority uses for registers.</p> The details of how these algorithms work do not really matter here, except to say that they are very</em> complicated and hard to get right. An algorithm that appears relatively simple at the conceptual level or in pseudocode quickly runs into interesting and subtle considerations as real-world constraints creep in. The regalloc.rs codebase is about 25K lines of deeply-algorithmic Rust code; any reasonable engineer would expect this to include at least several bugs! Compounding the urgency here, a register-allocation bug can result in arbitrary</em> incorrect results, because the register allocator is in charge of "wiring up" all of the dataflow in the program. If we exchange one arbitrary value with another arbitrary value in the program, anything could happen.</p> How to Verify Correctness</h2> So we want to write a correct register allocator. How do we even start on a task like this?</p> It might help to break down what we mean by "correct". Note that the register allocation problem has a nice property: the programs both before</em> and after</em> allocation have a well-defined semantics. In particular, we can think of register allocation as a transformation that converts programs running on an infinite-register machine</em> (where we can use as many virtual registers as we want) to a finite-register machine</em> (where the CPU has a fixed set of registers). If the original program on the infinite-register machine yields the same result as the transformed (register-allocated) program on the finite-register machine, then we have achieved a correct register allocation.</p> How do we test this equivalence?</p> Single-Program, Single-Input Equivalence</h3> The simplest way to test whether two programs are equivalent is to run them and compare the results! Let's say we do this: for a single program, choose some random inputs, and run the virtual-registerized program alongside its register-allocated version on the appropriate interpreters. Compare register and memory state at the end.</p> What does it mean if the final machine states match? It means that for this one program</em>, our register allocator produces a transformed program that is correct for this one program input</em>. Note the two qualifications here. First, we have not necessarily shown that the register allocation is correct given another program input. Perhaps a different input causes a branch to go down another program path, and the register allocator introduced an error on that path. Second, we have not shown anything for any other program; we have only tested a single program and its register-allocated output.</p> We can attempt to address the first limitation -- correctness only under one input -- by taking more sample points. For example, we could choose a thousand random program inputs, and even drive this random choice with some sort of feedback that tries to maximize control-flow coverage or other "interesting" behavior (as fuzzers do). We could probably achieve reasonable confidence that this single register allocation result is correct, given enough test cases.</p> </p> However, this is still very expensive</em>: we are asking to run the whole program N times to get a sample size of N. Even a single execution may be expensive: the program on which we have performed register allocation might be a compiler, or a videogame, for example.</p> Single-Program, For-all-Input Equivalence</h3> Can we avoid the need to run the program at all</em> to test that its register-allocated version is correct?</p> The answer is surprisingly simple: yes, we can, by simply altering the domain</em> that the program executes on. Ordinarily we think of CPU registers as containing concrete numbers -- say, 64-bit values. What if they contained symbols</em> instead?</p> By generalizing over program values with symbols, we can often represent the state of the system in terms of inputs without caring what those inputs are. For example, given the program:</p> ld v0, [A]</span></span> ld v1, [B]</span></span> ld v2, [C]</span></span> add v3, v0, v1</span></span> add v4, v2, v3</span></span> return v4</span></span></code></pre> register-allocated to:</p> ld r0, [A]</span></span> ld r1, [B]</span></span> ld r2, [C]</span></span> add r0, r0, r1</span></span> add r0, r2, r0</span></span> return r0</span></span></code></pre> without symbolic reasoning, we could store arbitrary integers to memory locations A</code>, B</code> and C</code> and simulate the program's execution before and after register allocation, never seeing a mismatch, but this would not prove anything unless we iterated through all possible values. However, if we suppose that after the three loads, r0</code> contains v0</code> (as a symbolic value, whatever it is), r1</code> contains v1</code>, and r2</code> contains v2</code>, and that r0</code> contains v3</code> after the first add and v4</code> after the second add, we can see the correspondence by matching up the symbols.</p> </p> This is a very simple example, and perhaps under-sells the insight and power of this approach; we will come back to it later when we talk about Abstract Interpretation</em> below.</p> In any case, what we have shown is that for a single instance of the register-allocation problem, we can prove</em> that it transformed the program in a correct way. Concretely, this means that the machine code that we generate will execute just as if we were interpreting the virtual-register code; if we can correctly generate virtual-register code, then our compiler is correct. That's excellent! Can we go further?</p> For-all-Programs Equivalence</h3> We could prove a-priori that the register allocator will always</em> transform any</em> program in a way that is correct. In other words, we could abstract not only over the input values to the program, but over the program</em> itself.</p> If we can prove this, then we have no need to run any sort of check at runtime. Abstracting over program inputs lets us avoid the need to run the program; we know the register allocation is correct for all inputs. In an analogous way, abstracting over the program to be register-allocated would let us avoid the need to run the register allocator; we know the register allocator</em> is correct for all programs</em> and for all inputs</em> to those programs.</p> One can imagine that this is much harder. In fact, it has been done, but is a significant proof-engineering effort, and is a realm of active research: this basically requires writing a machine-verifiable proof that one's compiler algorithms are correct. Such proven-correct compilers exist: e.g., CompCert</a> has been proven to compile C correctly to machine code for several platforms. Unfortunately, such efforts are strongly limited by the proof-engineering effort that is required, and thus this approach is unlikely to be feasible for a compiler unless it is their primary goal.</p> Our Approach: Allocator with Checker</h3> Given all of the above, we choose what we believe is the most reasonable tradeoff: we build a symbolic checker</em> for the output</em> of the register allocator. This does not let us make a static claim that our register allocator is correct, but it does</em> let us prove</em> that it is correct for any given compiler run; and if we use this as a fuzzing oracle, we can build statistical confidence</em> that it is correct for all compiler runs.</p> Checking the Register Allocator</h2> Our overall flow is pictured below:</p> </p> There are two ways in which we can add a register-allocator checker into the system. The first, on the left, we call "runtime checking": in this mode, every register allocator execution is checked and the machine code using the allocations is not permitted to execute (i.e. the compiler does not return a result) until the checker verifies equivalence. This is the safest mode: it provides the same guarantees as a proven-correct allocator ("for-all-programs equivalence" above). However, it imposes some overhead on every compilation, which may not be desirable. For this reason, while running the register allocator with the checker is a supported option in Cranelift, it is not the default.</p> The second mode is one in which we apply the checker to a fuzzing</em> workflow, and is the approach we have generally preferred (we have a fuzz target</a> in regalloc.rs that generates arbitrary input programs and runs the checker on each one; and we are running this continuously</a> as part of Wasmtime's membership in Google's oss-fuzz</a> continuous-fuzzing initiative). In this mode, we use the checker as an application-specific oracle for a fuzzing engine: as the fuzzing engine generates random programs (test cases), we run the register allocator over these programs, run the checker on the result, and tell the engine whether the register allocator passed or failed. The fuzzer will flag any failing test cases for a human developer to debug. If the fuzzer runs for a long time without finding any issues, we can then have more confidence that the register allocator is correct, even without running the checker; and the longer the fuzzer runs, the greater our confidence becomes. The application-specific oracle sigificantly improves over more generic fuzzer feedback mechanisms, such as program crashes or incorrect output: a register-allocator bug may not immediately manifest in incorrect execution, or when it does, the resulting crash may have no obvious connection to the actual mis-allocated register. The checker is able to point to a specific register use at a specific instruction and say "this register is wrong". Such a result makes for much smoother debugging!</p> Let's now walk through how we build the "checker" whose goal is to verify a particular register allocation is correct. We will come at the solution in stages, first reasoning about the easiest case -- straight-line code -- and then introducing control flow. At the end, we'll have a simple algorithm that runs in linear time (relative to code size) and whose simplicity allows us to be reasonably confident in its guarantees.</p> Symbolic Equivalence and Abstract Interpretation</h3> Recall that we described above a sort of symbolic interpretation of execution: one can reason about CPU registers containing "symbolic" values, where each symbol represents a virtual register in the original code. For example, we can take the code</p> mov v0, 1</span></span> mov v1, 2</span></span> add v2, v0, v1</span></span> return v2</span></span></code></pre> and a register-allocated form of that code</p> mov r0, 1</span></span> mov r1, 2</span></span> add r0, r0, r1</span></span> return r0</span></span></code></pre> and somehow</em> find a set of substitutions that makes them equivalent:</p> mov r0, 1</span></span> [ r0 = v0 ]</span></span> mov r1, 2</span></span> [ r1 = v1 ]</span></span> add r0, r0, r1</span></span> [ r0 = v2 ]</span></span> return r0</span></span></code></pre> But how do we solve for these substitutions? Recall that above we hinted at a form of execution that operates on symbols rather than values. We can simply take the semantics of the original instruction set, and reformulate it to operate on symbolic values instead, and then step through the code to find a representation of all executions</em> at once. This is called symbolic execution, and with some enhancements described below, is the basis of abstract interpretation</a>4</a></sup>. It is a very powerful technique!</p> What are the semantics of the instruction set that are relevant here? It turns out, because the register allocator does not modify any of the program's original instructions5</a></sup>, we can understand each instruction as mostly</em> an arbitrary, opaque operator. The only important pieces of information are which registers it reads</em> (before its operation) and which it writes</em> (after its operation).6</a></sup></p> It turns out that to verify the output of the register allocator when it spills</em> values, and when it moves</em> values between registers, we need to have special knowledge of spills, reloads, and moves. Hence, we can reduce the input program to a sort of minimal ISA that captures only what is important for symbolic reasoning (the real definition is here</a>; we simplify a bit for this post):</p> Spill <spillslot>, <CPU register></code>: copy data (symbol representing virtual register) from a register to a spill slot.</p> </li> Reload <CPU register>, <spillslot></code>: copy data from a spill slot to a register.</p> </li> Move <CPU register>, <CPU register></code>: move data from one CPU register to another (N.B.: only</em> regalloc-inserted moves are recognized as a Move</code>, not moves in the original input program.)</p> </li> Op read:<CPU register list>, read_orig:<virtual register list> write:<CPU register list> write_orig:<virtual register list></code>: some arbitrary operation that reads some registers and writes some other registers.</p> </li> </ul> The last instruction is the most interesting: notice that it carries the original</em> virtual registers as well as the post-register-allocation CPU registers</em> for the instruction. The need for this will become clearer below, but the intuition is that we need to see both</em> in order to establish the correspondence</em> between the two.</p> We can produce the above instructions while the register allocator is scanning over the code and editing it; that part is a straightforward translation</a>. Once we have the abstracted</em> program, we can "execute" it over the domain of symbols. How do we do this? With the following rules:</p> We maintain some state</em>, just as a real CPU does: for each CPU register, and for each location in the stack frame, we track a symbol</em> (rather than an integer value). This symbol can be a virtual-register name, if we know that the storage location currently contains that register's value. It can also be Unknown</code>, if the checker doesn't know, or Conflicted</code>, if the value could be one of several virtual registers. (The difference between the latter two will become clear when we discuss control-flow below. For now it's enough to see that we abstract the state to: either we know the slot contains a program value, symbolically, or we know nothing.)</p> </li> When we see a Spill</code>, Reload</code>, or Move</code>, we copy the symbolic state from the source location (register or spill slot) to the destination location. In other words, we know that these instructions always move the integer value of a register or memory word, whatever it may be; so if we have knowledge about the source location, symbolically for all possible executions, then we can extend that knowledge to the destination as well.</p> </li> When we see an Op</code>, we do some checks then some updates:</p> For each read</em> (input) register, we examine the symbolic value stored in the given CPU register (post-allocation location). If that symbol matches the virtual register that the original instruction used, then the allocator has properly conveyed the virtual register's value to its use here, and thus the allocation is correct</em> (preserves program dataflow). If not, we can signal a checker error, and look for the bug in our register allocator. We know for sure</em> it must be a bug (i.e., there are no false positives), because we only track a symbol for a storage location when we have proven (for all executions!) that that storage must contain that virtual register.</p> </li> For each write</em> (output) register, we set the symbolic value stored in the given CPU register to be the given (pre-allocation) virtual register. In other words, each write produces</em> a symbol. This symbol then flows through the program, moving via spills/reloads/moves, until it reaches consumers.</p> </li> </ul> </li> </ul> And that's it! We can prove in a fairly straightforward way that this is exactly correct -- produces no false positives or false negatives -- for straight-line code (code with no jumps). We can do this by induction: if the symbolic state is correct before an instruction, then the above rules just encode the data movement that the concrete program performs, and the symbolic state will be updated in the same way, so the symbolic state after the instruction is also correct.</p> Note that this is linear</em> as well -- so it's very fast, with a single scan over straight-line code. This is possible because we have help</em> from the register allocator: we know about spills, reloads, and register allocator-inserted moves, and we have pre- and post-allocation registers for all other instructions. Consider what we would have to do if we did not know about these, but only saw machine instructions. In that case, any load, store or move instruction could have come from the allocator or from the original program. We would have nothing but a graph of operators with connectivity between them, and we would have to solve a graph isomorphism</em> problem. That is much harder, and much slower!</p> So are we done? Not quite: we have only considered straight-line code. What happens when we encounter a jump?</p> Control-Flow Joins, Lattices, and Iterative Dataflow Analysis</h3> Control-flow makes analysis interesting because it allows for multiple possibilities</em>. Consider a simple program with an if-then-else pattern (a "control-flow diamond", as it is sometimes called, due to its shape):</p> </p> Let's say that a symbolic analysis decides that on the left branch, r0</code> has symbolic state A</code>, and on the right branch, it has symbolic state B</code>. What state does it have in the lower block, after the two paths re-join?</p> We can give a precise answer if we are allowed to "predicate", or make the answer conditional on some other program state. For example, if we knew that the if-condition were represented by some symbol C</code> that has a boolean type, we could invent an abstract expression language and then write if C { A } else { B }</code>.</p> However, this quickly becomes untenable. We will find that programs with loops lead to unbounded</em> symbolic expressions. (To see this, consider that a symbolic representation can have a size larger than its inputs. Any cyclic data dependency around a loop could thus generate an infinitely-large symbolic representation.) Even with only acyclic control flow, path-sensitive symbolic expressions can grow exponentially with program size: consider that a program with N</code> basic blocks and no loops can have O(2^N)</code> paths through those blocks, and fully precise symbolic expressions would need to capture the effects of each of those paths.</p> We thus need some way to approximate</em>. Note that an abstract interpretation of a program need not precisely capture all of the program's behavior losslessly. For example, we might perform a simple abstract interpretation analysis that only tracks possible numeric signs (positive, negative, unknown) for integer variables. So it is always fine to "summarize" and drop detail to remain tractable. Let us thus consider how we might "merge" state when multiple possibilities exist.</p> It turns out that there is a very nice mathematical object that captures the notion of "merging" in a way that is very useful: the lattice</a>. A lattice consists of a set of elements and a partial order</em> between them, together with a least element "bottom" and a greatest element "top", an operator called "meet" that finds the "greatest lower bound" for any two elements (the largest element that is less than its two operands) and a "join" that finds the "least upper bound" (the dual of the above).</p> </p> (Figure credit: Wikimedia</a>, CC BY-SA 3.0</a>)</p> An extremely useful property of lattices is that their merging operations of meet and join are commutative, associative and reflexive</em>. This is a formal way of saying that the result only depends on the set of elements "thrown into the mix", in any order and with any repetition. In other words, the meet of many elements is a function only of the set of elements, not of the order in which we process them.</p> How is this useful? If we define particular analysis states -- and as a reminder, in our specific case, these are maps from CPU registers and spillslots to symbolic virtual registers -- to be lattice elements, and define a "meet function" that somehow merges the states -- then we can use this merging behavior to implement a sort of program analysis over all programs, including</em> those with loops, without unbounded analysis growth! This is called the "meet-over-all-paths" solution and is a standard way that compilers perform dataflow analysis</a> today.7</a></sup></p> To understand how a lattice describes "merging" in a program analysis in a useful way, one can see the lattice ordering relation (the arrows in the figure above) as denoting that one state is more or less refined (contains more or less knowledge) than another. One starts at the "greatest" or "top" element: anything could be true; we know nothing. We then move to progressively more refined states. One analysis state is ordered "less than" another if it captures all the constraints we have learned in the other state, plus some new ones. The "meet" operator, which computes the greatest lower bound, will thus give us an analysis state that captures all of the knowledge in both inputs, and no more.8</a></sup></p> The general approach to performing an analysis on an arbitrary CFG is as follows:</p> We define our analysis state as a lattice</em>.</p> </li> We trace the current analysis state at each program point</em>, or point between instructions. Initially, the state at every program point is the "top" lattice element; as values meet, they move "down" the lattice, toward the "bottom" element.</p> </li> We process the effect of each instruction, computing the state at its post-program-point from its pre-program-point.</p> </li> When analysis state reaches a control-flow edge, we propagate the state across the edge, and meet</em> it with the incoming state from all other edges. This may then lead us to recompute states in the destination block.</p> </li> We run a "fixpoint" loop, processing updates as analysis states at block entries change, until no more changes occur.</p> </li> </ol> In this way, we find a solution to the dataflow problem that satisfies all of the instruction semantics for any</em> path through the program. It may not be fully precise (i.e., it may not answer every question) -- because it is often impossible to capture a fully precise answer for executions that include loops, and impractical for programs with significant control-flow -- but it is sound</em>, in the sense that any claims we make from the analysis result will be correct</em>.</p> A Register Checker as a Dataflow Analysis Problem</h2> We now have all of the pieces that we need in order to check the register-allocator output for any program. We saw above that we could model the machine state symbolically for any straight-line code, which allows us to detect register allocator errors exactly (no false negatives and no false positives) as long as there is no control flow. We then discussed the usual static analysis approach to control flow. How can we combine the two?</p> The answer is that we define a lattice</em> of symbolic register state</em>, and then walk through the same per-instruction semantics as above in a fixpoint dataflow analysis. Put simply, for each storage location (CPU register or spill slot), we have a lattice:</p> </p> The "unknown" state is the "top" lattice value. This means simply that we don't know what is in the register because the analysis hasn't converged yet (or no write has occurred).</p> The "conflicted" state is the "bottom" lattice value. This means that two or more symbolic definitions have merged. Rather than try to represent a superposition of both with some sort of predication or loop summary, we simply give up and move to a state that indicates "bad value". This is not a checker error as long as it is never used</em>, and it can be overwritten with a good value at any time; but if the value is used as an instruction source, then we flag an error.</p> The meet function, then, is very simple</a>: two registers meet to "conflicted" unless they are the same register; "unknown" meets with anything to produce that anything; and "conflicted" is contagious, in the sense that meeting any other state with "conflicted" remains "conflicted".</p> Note that we said above that our analysis state is a map</em> from registers and spill slots to symbolic states; not just a single symbolic state. So our lattice is actually a product</a> of each individual storage location's state, and we meet symbols piecewise. (The resulting map contains entries only for keys that appear in all meet-inputs; i.e. we take the intersection of the domains.)</p> With the analysis state and its meet-function defined, we run a dataflow analysis loop, allow it to converge, and look for errors; and we're done!</p> And that's it!9</a></sup></p> Effectiveness: Can it Find Bugs?</h2> The short answer is that yes, it can find some pretty subtle bugs</a>!</p> The benefit of the regalloc.rs checker is twofold:</p> It has found real bugs. In the above example, there was a conceptual error in the reference-types (precise GC rooting) support: in certain cases where a spillslot was allocated for a pointer-typed value but never used, it could be added to the stackmap (list of pointer-typed spillslots) provided to the GC. This bug needs a specific set of circumstances to happen: we have to have enough register pressure that we decide to allocate a spillslot for a virtual register, but then hit the (rare) code-path in which we don't actually need to do the spill because a register became available. We never hit this in our other, hand-written tests of GC (Wasm reference types), despite some pretty</a> extensive</a> tests</a> at least in SpiderMonkey's WebAssembly test suite driving the Cranelift backend. The fuzzer was able to drive toward full coverage, hit this rare code-path, and then allow the checker to discover the error.</p> </li> It serves as a gold-standard test while developing new</em> register allocators. Feedback while developing the linear-scan allocator (whose reference-type / precise GC rooting support came a bit later than the backtracking allocator's) indicated that the checker found many real issues and allowed for faster and more confident progress.</p> </li> </ul> Related Work</h2> It's surprisingly difficult to find prior work on checkers that validate individual runs of a register allocator. There are several fully-verified compilers in existence; CompCert</a> and CakeML</a> are two that can compile realistic languages (C and ML, respectively). These compilers have fully verified register allocators in the sense that the algorithm itself is proven correct; there is no need to run a checker on an individual compilation result. The engineering effort to achieve this is much higher than to write a checker, however (in the latter case, ~700 lines of Rust).</p> CakeML's approach to proving the register allocator correct is described by Tan et al. in "The Verified CakeML Compiler Backend</a>" (J. Func Prog 29, 2019). They appear to have nicely factored the problem so that the compilation is correct as long as a valid graph coloring or "permutation" (mapping of program values to storage slots) is provided. This allows reasoning about the core issue (dataflow equivalence before and after allocation) separately from the details of the allocator (graph coloring algorithm).</p> Proof-producing compilers exist as well: for example, Crellvm</a> is a recent extension of several LLVM passes that generates a (machine-checkable) correctness proof alongside the transformed program. This approach is conceptually at the same level as our register-allocator checker: it results in the validation of a single compiler run, but is much easier to build than a full a-priori correctness proof. This effort does not yet appear to address register allocation, however.</p> Rideau and Leroy in "Validating Register Allocation and Spilling</a>" (CC 2010) describe a similar taxonomy to ours, separating "once and for all" correctness proofs from "translation validation checks" and providing the latter. Their validator, however, defines a fairly complex transfer function that builds a set of equality constraints that must be solved. It appears that the validator does not leverage hints from the allocator, specifically w.r.t. spills, reloads and inserted moves as distinguished from stores, loads and moves in the original program; without these hints, a much more general and complex dataflow-equivalence scheme is needed.</p> Nandivada et al. in "A Framework for End-to-End Verification andEvaluation of Register Allocators</a>" (SAS 2007) describe a system very similar to our checker in which physical register contents (as virtual-register or "pseudo" symbols) are encoded into a post-regalloc IR that is then typechecked. Their typechecker can uncover the same sorts of regalloc errors that our checker can. Thus, their approach is largely equivalent to ours; the main difference is that we do not encode the problem as typechecking on a dedicated IR but rather a standalone static analysis.</p> Conclusion</h2> This post concludes the three-post series (one</a>, two</a>) describing the work we've done to develop all the pieces of Cranelift's new backend over the past year! It has been a very interesting and educational ride for me personally; I discovered an entirely new world of interesting problems to solve in the compiler backend, as distinct from the "middle end" (IR-level optimizations) that is more commonly taught and studied. Additionally, the focus on fast</em> compilation is an interesting twist, and one that I believe is not studied enough. It is easy to justify higher analysis precision and better generated code through ever-more-complex techniques; the benefit to be found in design tradeoffs for fast compilation is more subjective and more dependent on workload.</p> It is my hope that these writeups have illuminated some of the thinking that went into our design decisions. Our work is by no means done, however! The roadmap for Cranelift work in 2021</a> lists a number of ideas that we've discussed to achieve higher compiler performance and better code quality. I am excited to explore these more in the coming year; they may even result in more blog posts. Until then, happy compiling!</p> For discussions about this post, please feel free to join us on our Zulip instance in this thread</a>.</em></p> Thanks to /u/po8</a> on Reddit for several suggestions</a> which I have incorporated. Thanks also to bjorn3 for several suggestions. Finally, thanks to Fernando M Q Pereira</a> for bringing my attention to his paper</a> in SAS 2007 that proposes a very similar idea, which I've added to the related work section. Any and all feedback is welcome!</em></p> ^{1</sup> Why do CPUs have a limited number of registers? The bound is mostly due to ISA encoding limitations</em>: there are only so many bits in an instruction to name a particular register source or destination. When the CPU designer chooses how many registers to define, providing more will improve performance (up to a point) because the CPU can hold more state at one time, but will also impose an increasing cost in code size and CPU complexity.</p>}Computer architect's tangent: due to register renaming</a>, a modern high-performance out-of-order CPU will have many more physical</em> registers, with architectural register names mapped to physical registers at any given program point by the register-renaming hardware (in common parlance, the register allocation table or RAT), but the ISA encoding restrictions limit the number that have architectural names at any time. The existence of register renaming sometimes causes confusion in discussions of register allocation -- why rename onto so few registers when we have so many? -- well, we could do much better if we had more bits to refer to them all! Architectural standardization is another reason for this: we would not want to recompile code every time the PRF (physical register file) became larger. Simpler to say "x86-64 has 16 integer registers" and be done with it.10</a></sup></p> </div> ^2</sup>We don't know if exponential time is the best</em> we can do in the worst case, though most computer scientists suspect so. This is the famous P=NP</code> problem, and if you can solve it, you win a million dollars</a>.</p> </div> ^3</sup>A slight correction from /u/po8's comment</a>: register allocation on structured programs</a> can be done</a> in polynomial time, i.e., better than an exponential brute-force search. However, the problem remains quite complex!</p> </div> ^4</sup>Abstract interpretation was introduced by Radhia and Patrick Cousot in their seminal 1977 POPL paper "Abstract interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints" (pdf</a>).</p> </div> ^{5</sup> Except for move elimination, but we can ignore that for now -- it is possible to adapt the abstract interpretation rules to account for it later.</p> </div>}^{6</sup> In regalloc.rs we also have a notion of an instruction that "modifies" a register, which is like a combined read and write except that the value must be mapped to the same</em> register for both. This isn't fundamental to the point we're illustrating so we'll skip over it for now.</p> </div>}^7</sup>This dataflow analysis approach was proposed by Gary Kildall</a> in the POPL 1973 paper "A unified approach to global program optimization</a>". (He is perhaps better-known for writing the microcomputer OS CP/M</a>, a predecessor to DOS.) Kildall's Dataflow analysis builds on the control-flow graph ideas invented several years prior by Fran Allen</a>; in her 1970 paper Control Flow Analysis</a>, she proposes interval-based dataflow analysis, which is the other main approach known and used today.</p> </div> ^{8</sup> Note that we have been somewhat vague here about directionality. What does "more constrained" or "more refined" mean? There are actually two directions an analysis may work, and these have to do with how it handles imprecision. A "may-analysis", or "widening analysis", computes what the program may do. It generally begins with an "empty set" of sorts -- a variable has no possible values, a statement has no side-effects, a register contains nothing -- and then uses a union</em>-like meet operator to aggregate all possibilities</em>. The real program behavior will be some subset of these possibilities. In contrast, a "must-analysis", or "narrowing analysis", computes only what we know the program must</em> do. It generally begins with the "universe set" and then uses intersection</em>-like meet operators. The real program's behavior is a superset of this analysis's description. We can't have both, usually, because an analysis cannot generally be fully precise.</p> By convention, we always start analysis values at "top" and use "meet" to move down the lattice as the analysis converges, though we could just as well start at "bottom" and move up with "join", since flipping the lattice's order relation and swapping meet and join produces another lattice.</p> </div>}^{9</sup> Well, not quite, as you might have guessed. One significant detail I've omitted is how we handle reference types</em> and precise garbage collection</em>. Precise GC rooting entails tracking a specific kind of type information</em> for each register and spillslot: specifically, whether each storage location contains a pointer</em> that the GC should observe when it performs a garbage collection. It is important in many applications for this to be "precise", which means that we can only say that a register contains a pointer if it actually</em> does, and we must</em> include all registers that contain pointers. Precision is important because the GC will assume any root pointer it traces points to a valid object (so false positives are bad); and must know about every pointer in case it is a moving GC and relocates an object (so false negatives are bad).</p> In our particular variant of the problem, we need this information at safepoints</em>: these are points at which the GC could be invoked. (It would be too expensive to plan for a GC invocation at every point in the program.) Furthermore, we needed to support GCs that could only trace pointers on the stack (hence, spillslots), not in registers. So we needed to induce additional spills</em> around safepoints to ensure pointers were only live on the stack, not in registers.</p> To check this, we extended the abstract value lattice to note whether each virtual register is a pointer-typed value or not. Then, at every safepoint, we (i) ensure that every actual pointer-typed value in a spillslot is listed in the stackmap provided to the GC, and (ii) clear</em> any other pointer-typed stack location not listed in the stack map to an Unknown</code> state. Why the latter? Because an actual pointer-typed value in a stack slot might be "dead" (not used again), and so is legal to omit from the stackmap; instead of immediately flagging an error when one is excluded, we simply ensure that a later use of it is invalid.</p> </div>}^10</sup>Note that some computer architectures do</em> task the compiler with some form of register renaming. For example, the Intel Itanium (IA-64</a>) had a novel sort of "rotating register reference" feature for loops, and trusted the compiler with managing a full 128 integer and 128 floating-point registers. Modern GPUs also have thousands of "registers" managed by the compiler.</p> </div> Cranelift, Part 2: Compiler Efficiency, CFGs, and a Branch Peephole Optimizer 2021-01-22T00:00:00+00:00 This post is the second in a three-part series about Cranelift</a>. In the first post</a>, I described the context around Cranelift and our project to replace its backend code-generation infrastructure, and detailed the instruction-selection problem and how we solve it. The remaining two posts will be deep-dives into some interesting engineering problems.</p> In this post, I want to dive into the compiler performance</em> aspect of our work more deeply. (In the next post we'll explore correctness.) There are many interesting aspects of compilation speed I could talk about, but one particularly difficult problem is the handling of control flow</em>: how do we translate structured control flow at the Wasm level into control-flow graphs at the IR level, and finally to branches in a linear stream of instructions at the machine-code level?</p> Doing this translation efficiently requires careful attention to the overall pass structure, with the largest wins coming when one can completely eliminate a category of work. We'll see this in how we combine several passes in a traditional lowering design (critical-edge splitting, block ordering, redundant-block elimination, branch relaxation, branch target resolution) into inline transforms</em> that happen during other passes (lowering of the CLIF, or Cranelift IR, into machine-specific IR; and later, binary emission).</p> This post basically describes the MachBuffer</code></a>, a "smart machine-code buffer" that knows about branches and edits them on-the-fly as we emit them, and the BlockLoweringOrder</code></a>, which allows us to lower code in final basic-block order, with split critical edges inserted implicitly, by traversing a never-materialized implicit graph. The work was done mostly in Cranelift PR #1718</a>, which resulted in a ~10% compile-time improvement and a ~25% compile+run-time improvement on a CPU-intensive benchmark (bz2</code>).</p> Control-Flow Graphs</h2> Before we discuss any of that, we need to review control-flow graphs (CFGs)! The CFG is a fundamental data structure used in almost all modern compilers. In brief, it represents how execution (i.e., program control) may flow through instructions, using graph nodes to represent linear sequences of instructions and graph edges to represent all possible control-flow transfers at branch instructions.</p> At the end of the instruction selection process, which we learned about in the previous post</a>, we have a function body lowered into VCode that consists of basic blocks</em></a>. A basic block is a contiguous sequence of instructions that has no outbound branches except at the end, and has no inbound branches except at the beginning. In other words, it is "straight-line" code: execution always starts at the top and proceeds to the end. An example control-flow graph (CFG) consisting of four basic blocks is shown below:</p> </p> Control-flow graphs are excellent data structures for compilers to use. By making the flow of execution explicit as graph edges, rather than reasoning about instructions in order in memory as the processor sees them, many analyses can be performed more easily. For example, dataflow analysis</a> problems can be solved easily because the CFG makes traversal of possible control-flow transfers easy. Graph-based representations of the program also allow easier moving and insertion of code</em>: it is less error-prone to manipulate an explicit graph than to reason about implicit control-flow (e.g. fallthrough from a not-taken conditional branch). Finally, the graph representation factors out the question of block ordering</em>, which can be important for performance; we can address this problem separately by choosing how we serialize the graph nodes (blocks). For these reasons, most compiler IRs, including Cranelift's CLIF and VCode</code>, are CFG-based.</p> (Historical note: control-flow graphs were invented by the late Frances Allen</a>, who largely established the algorithmic foundations that modern compilers use. Her paper A catalogue of optimizing transformations</a>1</a></sup> covers essentially all of the important optimizations used today and is well worth a read.)</p> CPUs and Branch Instructions</h2> To represent a CFG's end-of-block branches at the instruction level, we can use two-way branches</em>: these are instructions that branch either to one basic-block target if some condition is true, or another if the condition is false. (Basic blocks can also end in simple unconditional single-target branches.) We wrote such a branch as if r0, L1, L2</code> above; this means that the block L0</code> will be followed in execution either by L1</code> or L2</code>, depending on the value in r0</code>.2</a></sup></p> Branches with Fallthrough</h3> However, CPUs rarely have such two-way branch instructions. Instead, conditional control-flow in common ISAs is almost always provided with a conditional branch with fallthrough</em>. This is an instruction that, if some condition is true, branches to another location; otherwise, does nothing, and allows execution to continue sequentially. This is a better fit for a hardware implementation for a number of reasons: it's easier to encode one target than two (the destination of the jump might be quite far away for some branches, and instructions have limited bits available), and it's usually the case that the compiler can place one of the successor blocks immediately afterward anyway.</p> Now, this isn't much of a problem if we just want a working compiler; instead of a two-way branch</p> if r0, L1, L2</span></span></code></pre> We can write a sequence of branches</p> br_if r0, L1</span></span> goto L2</span></span></code></pre> where br_if</code> branches to L1</code> or falls through to the unconditional goto</code>. But this is not so efficient in many cases. Consider what would happen if we laid out basic blocks in the order L0</code>, L2</code>, L1</code>, L3</code>:</p> L0:</span></span> ...</span></span> br_if r0, L1</span></span> goto L2</span></span> L2:</span></span> ...</span></span> goto L3</span></span> L1:</span></span> ...</span></span> goto L3</span></span> L3:</span></span> ...</span></span> return</span></span></code></pre> There are two redundant unconditional branches (goto</code> instructions), each of which uselessly branches to the following instruction. We can remove both of them with no ill effects, taking advantage instead of fallthrough</em>, or allowing execution to proceed directly from the end of one block to the start of the next one:</p> L0:</span></span> ...</span></span> br_if r0, L1</span></span> // Otherwise, fall through to L2 </span></span> L2:</span></span> ...</span></span> goto L3</span></span> L1:</span></span> ...</span></span> // Always fall through to L3 </span></span> L3:</span></span> ...</span></span> return</span></span></code></pre> This seems like an easy enough problem to solve: we just need to recognize when a branch is redundant and remove it, right? Well, yes, but we can do much better than that in some cases; we'll dig into this problem in significantly more depth below!</p> Machine-code Encoding: Branch Offsets</h3> So far, we've written our machine instructions in a way that humans can read, using labels</em> to refer to locations in the instruction stream. At the hardware level, however, these labels do not exist; instead, the machine code branches contain target addresses</em> (usually encoded as relative offsets</em> from the branch instruction). In other words, we do not see goto L3</code>, but rather goto +32</code>.</p> This gives rise to several complications when emitting machine code from a list of instruction struct</code>s. At the most basic level, we have to resolve labels to offsets and then patch the branches appropriately. This is analogous to (but at a lower level than) the job of a linker</a>: we resolve symbols to concrete values after deciding placement, and then edit the code according to relocations</em> to refer to those symbols. In other words, whenever we emit a branch, we make a note (a relocation, or "label use" in our MachBackend</code>) to go back later and patch it with the resolved label offset.</p> The second, and more interesting, problem arises because not all branch instructions can necessarily refer to all possible labels! As a concrete example, on AArch64, conditional branches have a ±1 MB range, and unconditional branches have a ±128 MB range. This arises out of instruction-encoding considerations: particularly in fixed-instruction-size ISAs (such as ARM, MIPS, and RISC V), less than a full machine word of bits are available for the immediate jump offset that is embedded in the instruction word. (The instruction itself is always a machine-word wide, and we need some bits for the opcode and condition code too!) On x86, we have limits for a different reason: the variable-width encoding allows either a one-byte offset (allowing a ±128 byte range) or four-byte offset (allowing a ±2 GB range).</p> To make a branch to a far-off label, then, on some machines we need to either use a different sort of branch than the default choice for the instruction selector, or we need to use a form of indirection</em>, by targetting the original branch to another branch</em>, the latter in a special form. The former is tricky because we do not know whether a target will be in-range until all code is lowered and placement is computed; so we need to either optimistically or pessimistically lower branches to the shortest or longest form (respectively) and possibly switch later. To make matters worse, as we edit branches to use a shorter or longer form, their length may change, moving other</em> targets into or out of range; in the most general solution, this is a "fixpoint problem", where we iterate until no more changes occur.</p> Challenges in Lowering CFGs to Machine Code</h2> So far, we have a way to produce correct</em> machine code. To emit the final code for a two-target branch, we can emit a conditional- followed by unconditional-branch machine instruction. To resolve branch targets correctly, we can assume that any target could be anywhere in memory, and always use the long form of a branch; then we just need to come back in one final pass and fill in the offsets when we know them.</p> We can do much better than this, though! Below I'll describe four problems and the ways that they are traditionally solved.</p> Problem 1: Efficient use of Fallthroughs</h3> We described above how branch fallthroughs</em> allow us to omit some some unconditional branches once we know for sure the order that basic blocks will appear in the final binary. In particular, the simple lowering of a two-way branch if r0, label_if_true, label_if_false</code> to two one-way branches</p> br_if r0, label_if_true</span></span> goto label_if_false</span></span> label_if_false:</span></span> ...</span></span> label_if_true:</span></span> ...</span></span></code></pre> has a completely redundant and useless goto</code>! In general, if a branch target is the very next instruction, we can delete that branch.</p> However, there are slightly more complex cases where we can also find some improvements. Consider the inverted version of the above:</p> br_if r0, label_if_false</span></span> goto label_if_true</span></span> label_if_false:</span></span> ...</span></span> label_if_true:</span></span> ...</span></span></code></pre> No branch here branches to its fallthrough, so one might think that both branches are necessary. But in practice, on most CPU architectures, all conditional branches have inverted forms</em>. For example, the x86 instruction JE</code> (jump if equal) can be inverted to JNE</code> (jump if not equal). If we are allowed to edit branch conditions as well, then we can rewrite the above as:</p> br_if_not r0, label_if_true</span></span> label_if_false:</span></span> ...</span></span> label_if_true:</span></span> ...</span></span></code></pre> This turns out to remove many additional branches in practice.</p> Problem 2: Empty Blocks</h3> It is sometimes the case that after optimizations, a basic block is completely empty</em> aside from a final unconditional branch. This can occur when all of the code in an if- or else-block is optimized away or moved elsewhere in the function body. It can also occur when a block was inserted to split a critical edge</em> (see below).</p> Thus, a common optimization is jump threading</a></em>: when one branch points directly to another, we can just edit the first branch to point to the final target. Generalized, we can "chase through" any number of branches to eliminate intermediate steps. For example:</p> ...</span></span> goto L1</span></span> L1:</span></span> goto L2</span></span> L2:</span></span> goto L3</span></span> L3:</span></span> ...</span></span></code></pre> can become:</p> ...</span></span> goto L3 // <--- edited branch</span></span> L1:</span></span> goto L2</span></span> L2:</span></span> goto L3</span></span> L3:</span></span> ...</span></span></code></pre> note that the intermediate branches were not removed</em>: they may still be the targets of other branches</em>. We skip over them when starting from the first branch. However, if we know some other way that these branches are unused, we can then delete them, reducing code size.</p> Problem 3: Branch Relaxation</h3> As we noted above, the "branch relaxation" problem is that we must choose one of multiple forms</em> for each branch instruction, each of which may have a different range (maximal distance from current program-counter location). This is complex because the needed range depends on the final locations of the branch and its target, which in turn depends on the size of instructions in the machine code; but some of those instructions are themselves branches. We thus have a circular dependency.</p> There will always be some</em> way to branch to an arbitrary location in the processor's address space, so there is always the trivial but inefficient solution of using worst-case branch forms. However, we can usually do much better, because the majority of branches will be to relatively small offsets.</p> The usual approach to solving this problem involves a "fixpoint computation": an iterative loop that continues to make improvements until none are left. This is where the "relaxation" of branch relaxation comes from: we modify branch instructions to have more optimal forms as we discover that targets are within range; and as we do this, we recompute code offsets and see if this enables any other relaxations. As long as the relationship between branch range and branch instruction size is monotonic (smaller required range allows for shorter instruction), this will always converge to a unique fixpoint; but it is potentially expensive, and involves sticky data-structure design questions if we want the code editing and/or offset recomputation to be fast.</p> Problem 4: Critical Edges</h3> For a number of reasons, we usually want to split critical edges</a></em> in the control-flow graph. A critical edge is any control-flow transfer edge that comes from</em> a block with multiple out-edges, and goes to</em> a block with multiple in-edges. We sometimes need to insert some code to run whenever the program follows a critical edge: e.g., the register allocator may need to "fix up" the machine state, moving values around in registers as expected by the target block. Consider where we might insert such code: we can't insert it prior to the jump, because this would execute no matter what out-edge is taken. Similarly, we can't insert it at the target of the jump, because this would execute for any entry into the target block, not just transfers over the particular edge.</p> The solution is to "split" the critical edge: that is, create a new basic block, edit the branch to point to this block, and then create an unconditional branch in the block to the original target. This basic block is a place where we can insert whatever fixup code we need, and it will execute only</em> when control flow transfers from the one specific block to the other. A critical-edge split is illustrated in the following figure:</p> </p> There are multiple ways in which we could handle this problem: we could preemptively split every critical edge; or we could split them on demand, only when we need to insert code. The latter would require editing the CFG in place, and for various reasons, we would prefer to avoid doing this: it invalidates many analysis results, and complicates data structures. It is also much simpler to reason about many algorithms if we can assume that edges are already split. However, splitting every edge will leave many empty blocks, because we usually</em> do not need to insert any fixup code on an edge. In addition, splitting an edge raises the question of where</em> to insert the split-block. If we take the simplest approach and append it to the end of the function, we probably significantly reduce the number of branch-fallthrough simplifications we can make; a smarter heuristic that placed the block near its predecessor or successor would be better.</p> Traditional Approach: In-Place Edits</h2> The traditional approach to all of these problems is to decompose the task into a number of passes</em> and perform in-place edits</em> with those passes. For example, in LLVM, IR is lowered into a machine-specific form (MachineFunction</code> of MachineBasicBlock</code>s) with an explicit notion of layout and with machine-level branch instructions; then edits are made, taking care to update branches when the layout changes.</p> For example, the following sequence of passes should handle most of the above issues:</p> Split all critical edges, placing the split-block after the predecessor. (In LLVM, the SplitCriticalEdges</code></a> pass.)</p> </li> Perform other optimizations, and register allocation; these may use the split-blocks.</p> </li> Perform jump-threading transform; this will remove control-flow transfers through empty blocks. (In LLVM, the BranchFolding</code></a> pass.)</p> </li> Compute reachability, and delete "dead blocks" (blocks that are no longer reachable). (Also done by BranchFolding</code> in LLVM.)</p> </li> Compute a block order that tries to minimize jump distances and places at least one successor directly after every block when possible. (In LLVM, the MachineBlockPlacement</code></a> pass.)</p> </li> Linearize the code from the CFG nodes into a single stream of machine instructions using this block order. (In LLVM, blocks are initially lowered into the MachineFunction</code> and then reordered by MachineBlockPlacement</code>.)</p> </li> Remove branches to fallthrough blocks, and invert conditionals that create additional fallthroughs.</p> </li> Compute block offsets based on machine-code size of current instruction sequence, assuming worst-case size for every branch.</p> </li> Scan over branches, checking whether block locations allow for shorter forms due to nearer targets. Update branches and recompute block offsets if so. Continue until fixpoint. (In LLVM, the BranchRelaxation</code></a> does this.)</p> </li> Fill in branch targets using final offsets. Branches are now in a form ready for machine-code emission.</p> </li> </ol> Clearly this will work, and with some care (especially in the block-placement heuristics), it will produce very good code. But the above steps require many</em> in-place edits. This is both slow (we are re-doing some work every time we edit the code) and forces us to use data structures that allow for such edits (e.g., linked list), which imposes a tax on every other operation on the IR. Is there a better way?</p> Cranelift's New Approach: Streaming Edits</h2> It would be ideal if we could avoid some of the code-transform passes described above; can we? It turns out that one can actually do the functional equivalent of all</em> of the above as part of other, pre-existing work:</p> We can decide the final block order ahead of time, and do our CLIF-to-VCode lowering in this order, so VCode never needs to be reordered; it is already linearized. We can also insert critical-edge splits as part of this lowering.</p> </li> We can do all</em> of the other work -- inverting conditionals, threading jumps, removing dead blocks, and handling various branch sizes -- in a streaming approach during machine-code emission! The key insight is that we can do a sort of "peephole optimization": we can immediately delete and re-emit branches at the tail</em> of the emission buffer. By tracking some auxiliary state during the single emission scan, such as reachability, labels at current emission point, a list of unresolved label-refs earlier in code, and a "deadline" for short-range branches, we can do everything we need to do without ever backing up more than a few contiguous branches at the end of the buffer.</p> </li> </ol> Let's go into each of those in more detail!</p> Step 1: Decide Final order and Split Edges while Lowering</h3> As part of the instruction-selection pipeline described in the previous post</a>), we need to iterate over the basic blocks in the CLIF and, for each block, lower its code to VCode instructions. We would like to do this iteration in the same order as our final machine code layout so that we don't need to reorder the VCode later.</p> The only constraint that the lowering algorithm imposes is that we examine value uses</em> before value defs</em>, which we can ensure by visiting a block before any of its dominators. That leaves a lot of freedom in how we do the lowering.</p> If that were the whole problem, we could just do a postorder</a> traversal and be done with it. In fact, the problem is complicated by one other factor: critical-edge splitting!</p> Recall that we described above that we must either preemptively split all critical edges or else find a way to edit-in-place later. To avoid the complexities of edit-in-place, we choose to split them all. Note that this is cheap as far as our CFG lowering is concerned, because our later branch optimizations will remove empty blocks almost for free. (The register allocator's analyses may become more expensive with a higher block count, but in practice we have not found this to be much of a problem.)</p> The challenge is in generating these blocks in the correct place on the fly</em>. To generate the lowering order, we define a virtual graph</em> that is never actually materialized, whose nodes are implied by the CLIF blocks and edges (every CLIF edge becomes a split-edge block) and whose edges are defined only by a successor function. To generate the lowering order, we perform a depth-first search</a> over the virtual graph, recording the postorder</a>. This postorder is guaranteed to see uses before defs, as required. The DFS itself is a pretty good heuristic for block placement: it will tend to group structured-control-flow code together into its hierarchical units.</p> There are additional details in the implementation</a> that ensure we split only critical edges rather than all edges, that record block-successor information directly as we produce lowered blocks so that the subsequent backend stages do not need to recompute it, and some other small optimizations.</p> This is illustrated in the following figure, showing a CLIF-level CFG transformed with split edges and merged edge-blocks then linearized at a conceptual level; and the successor function actually defined to drive the DFS. Note that the naïve lowering of the split-edge CFG would result in 14 branches (due to 14 CFG edges); the final lowered machine code contains only 4 branches, while providing a slot for any needed fixup instructions on any CFG edge.</p> </p> Step 2: Edit Branches while Emitting</h3> Once we have lowered VCode</code>, we need to emit machine code! In a conventional design, this would require linearization, a bunch of branch optimizations, and branch-target resolution before we ever produced a byte of machine-code. But we can do much better.</p> In Cranelift's design, a machine backend just emits every conditional branch naïvely as a two-way combination</em> into a machine-code buffer we call the MachBuffer</code>. Critically, however, this MachBuffer</code> is not merely a Vec<u8></code>: it knows (many) things about its content, including where its branches are, what the branches' targets are, and how to invert the branches if necessary.</p> The MachBuffer</code> will perform streaming edits</em> on the code as it is emitted, editing only a suffix</em> of the buffer (contiguous bytes up to the end, or current emission point), in order to convert two-way branch combos when possible into simpler forms.</p> The abstraction that the machine backend sees is:</p> The MachBuffer</code> allows us to emit machine-code bytes.</p> </li> We can tell the MachBuffer</code> that a certain range of machine-code bytes that we just emitted are a branch</em>, either conditional or unconditional, how to invert it if conditional, and a label</em> as the branch target.</p> </li> We can bind</em> a label to the current emission point.</p> </li> We parameterize the MachBuffer</code> on a LabelUse</code> trait implementation which defines all the different kinds of branch-target references, how to patch the machine code with a resolved offset, and how to emit a veneer</em>, i.e., a longer-form branch that the original branch can indirect through in order to reach further.</p> </li> </ul> And that's it! The MachBuffer</code> does all of the work behind the scenes: when we emit a branch, it sometimes chomps some bytes; and when we define a label, it sometimes scans through a list of deferred fixups to patch earlier machine code.</p> A (simplified) illustrated example is shown below. The machine backend emits two-way branches naïvely by always emitting a conditional and unconditional branch (e.g. at the end of basic block L0</code>). It also provides metadata to the MachBuffer</code> that describes where the labels are, where the branches are, where the branches are targetted (as labels), and how to invert conditional branches. The MachBuffer</code> is able to perform the listed streaming edits as code is emitted</em>, producing the final machine code at the right with no intermediate buffering or additional passes. We'll describe how this editing occurs in more detail below.</p> </p> Branch Peephole Optimizations</h4> The key insight of the MachBuffer</code> design is that we can edit branches at the tail of the buffer</em> as code is emitted by tracking the "most recent" branches: specifically, the branches that are contiguous to the tail of the buffer</em>.</p> The first optimization that we do is branch inversion</em>, which can sometimes eliminate unconditional branches. In the example above, when the backend has emitted the machine-code bytes for all of the L0</code> basic block, the MachBuffer</code> will know that the last two branches, contiguous to the tail, are a conditional branch to L1</code> and an unconditional branch to L2</code>. When we then see the label L1</code> (which is the first branch's target) bound to this offset, we can apply a simple rule: a conditional that jumps over an immediately following unconditional can be inverted, and the unconditional branch removed. Note that, critically, because these branches are contiguous to the tail</em>, the edit will not affect any offsets that have already been resolved; we are free to contract the code-size here, and offsets of subsequently-emitted code will be correct without further fixups.</p> Said another way, our approach never moves code</em>. Rather, it only sometimes chomps or adjusts a just-emitted branch</em>, right away, before code-emission carries on past the branch.</p> The next optimizations we do are jump threading</em> and dead-block removal</em>. Recall from above that jump threading means that intermediate steps in a chain of jumps can be removed: a jump to a jump to X becomes just a jump to X. We resolve this by keeping an up-to-date alias table</em> that tracks label-to-label aliases. The table is updated whenever the MachBuffer</code> is informed that an unconditional jump was emitted and a label was bound to its address. We then indirect through the alias table when resolving labels to final offsets. The second task, dead-block removal, occurs as a side-effect of tracking reachability</em> of the current buffer tail. Any offset that (i) immediately follows an unconditional jump, and</em> (ii) has no labels bound to it, is unreachable; an unconditional jump at an unreachable offset can be elided. (Actually, any code at an unreachable offset can be removed, but for simplicity and to make it easier to reason about correctness, we restrict the MachBuffer</code>'s edits to code explicitly marked as branch instructions only.)</p> In order for this to work correctly, we need to track all labels that have been bound to the current buffer tail and adjust them if we chomp (truncate) the buffer or redirect a label. For this reason, the label-binding, label-use resolution, and branch-chomping are all tightly integrated into a set of interacting data structures:</p> </p> To summarize, we track:</p> Emitted bytes;</li> All labels bound to the current offset;</li> A table of all labels to a bound offset or "unbound";</li> A table of all labels to another label as an alias or "unaliased";</li> A list of the last contiguous branches</em>, each of which is conditional or not, with inverted form if so, and label-use, and labels that are bound to</em> this branch instruction;</li> A list of other label-uses for fixup.</li> </ul> As we emit code, we append to the emitted-bytes buffer. We lazily invalidate the "labels at current offset" set by tracking the offset for which that set is valid; appending new code implicitly clears it.</p> As the machine backend tells the MachBuffer</code> about branches, we append to the list of the last contiguous branches. This, too, is invalidated when code is emitted that is not a branch.</p> When a label-use is noted and the label is already resolved, we fix up the buffer right away. Note that once a label resolves to an offset, that offset cannot change; so this fixup can be done once and the metadata discarded.</p> All branch simplification happens when a label is bound</em>: this is when new actions become possible. We perform the following algorithm (see MachBuffer::optimize_branches()</code></a> for details):</p> Loop as long as there are some branches in the latest-branches list: If the current buffer tail is beyond the end of the latest branch, done (and clear list).</li> If the latest branch (which ends at current tail) has a target that resolves to current tail, chomp it and restart loop.</li> If the latest branch is unconditional and does not branch to itself</em>: Update any labels pointing at</em> this branch to point at its target</em> instead.</li> Restart loop if any labels were moved.</li> </ul> </li> If latest branch is unconditional, follows another unconditional branch, and no labels are bound at this branch, then chomp it (unreachable) and restart loop.</li> If latest branch is unconditional, follows a conditional branch, and conditional branch target is current tail, then invert conditional and chomp the unconditional, and restart loop.</li> </ul> </li> When loop is done, clear latest-branches list; no more can be simplified.</li> </ul> This may look to have undesirable algorithmic complexity, but in fact it is tightly bounded: we make a forward-progress argument as labels only move down alias chains and fixed work is done per branch (each is only examined once and acted upon or purged). Overall, the algorithm runs in linear time.</p> This linear-time algorithm that edits locally, avoids any code-movement, and streams into a buffer in final form, is far better than the multi-pass edit-in-place design of a traditional backend, both asymptotically and in practice (CPUs love streaming algorithms and minimized data movement). It seems to produce code nearly as good as a much more complex branch simplifier at a much lower cost.</p> Correctness</h3> The algorithm for simplifying branches is one of the most critical to correctness in the (post-optimizer) compiler backend; probably only second to the register allocator. It is very subtle, and bugs can be disastrous: incorrect control flow could cause anything</em> to happen, from impossible-to-debug incorrect program results (ask me how I know</a>! And here too</a>!) to serious security vulnerabilities.</p> Because of this, we have taken extensive</em> care to ensure correctness. In fact, more than a third of the lines in the MachBuffer</code> implementation</a> are a proof of correctness, based on several core invariants described here</a>. At each data-structure mutation, we show that (i) the invariants still hold, and (ii) the code mutation did not alter execution semantics.</p> Because there is significant wisdom in the Knuth quote "I have only proved it correct, not tried it" (there are always gaps between specification and reality, and unless one generates an implementation from a machine-checked proof, then the English prose or its translation to code may have bugs too3</a></sup>), all invariants are also fully checked on each label-bind event in debug builds of Cranelift. The various fuzzing harnesses that hammer on the new backend will thus be driving these checks continuously.</p> Other Concerns</h3> There are many subtleties to the branch-simplification and code layout problems that were not discussed here! Most prominently, we have not covered branch veneers</em> or the topic of branch ranges at all, though we saw the problem</em> of "branch relaxation" above. The MachBuffer</code> handles out-of-range branches by tracking a "deadline" (the last point at which any currently outstanding label may be bound without causing a branch to go out of range); if we hit the deadline, we emit an island</em> of branch veneers</em>, which are commonly just long-range unconditional branches, for each unresolved label and resolve the labels to those branches. This extends the deadlines. In practice island emission will almost never occur, so it is acceptable to pessimize this case (add an extra indirection) to avoid the need to go back and edit the original branch into a longer-range form.</p> We also haven't covered constant pools; these are handled with the same "island" mechanism, allowing emitted machine code to refer to nearby constant data.</p> Conclusion, and Next Time</h2> This has been a deep dive into the world of branch simplification, with an emphasis on how we engineered Cranelift's new backend to provide very good compilation speed taking control-flow handling and branch lowering/simplification as an example. We believe that there may be other significant opportunities to rethink, and carefully engineer, core algorithms in the compiler backend with specific attention to maximizing streaming behavior</em>, minimizing indirection</em>, and minimizing passes over data</em>. This is an interesting and exciting engineering pursuit largely because it goes beyond the world of "theoretical standard compiler-book algorithms" and calls on problem solving to find clever new design tricks.</p> As we described near the end of this post, correctness</em> is also an important focus -- perhaps the</em> most important focus -- of any compiler engineering effort. Given that, I plan to write the next (and final) post in this series about how we engineered for correctness by taking a deep-dive into the register allocator checker</em>, which is a novel symbolic checker (which can be seen as an application of abstract interpretation) that allows us to prove</em> that any particular register-allocator execution gave a correct allocation result. I'll talk about how this checker, driven by a fuzzing frontend, found some really subtle and interesting bugs</em> that we likely never would have found in production otherwise. With that, until next time, happy compiling!</p> For discussions about this post, please feel free to join us on our Zulip instance in this thread</a>.</em></p> Thanks to Benjamin Bouvier for reviewing this post and providing very helpful feedback! Thanks also to bjorn3 for correcting a typo in a figure.</em></p> ^{1</sup> Frances E. Allen and John Cocke. A catalogue of optimizing transformations.</em> In Design and Optimization of Compilers</em> (Prentice-Hall, 1972), pp. 1--30.</p> </div>}^{2</sup> This is a bit of a simplification of branches in the IR: in CLIF (and in most other CFG-based compiler IRs), there are several branch types. Another is the "switch" or "branch table" branch that chooses between N possible targets with an integer index. There are also simple single-target unconditional branches; and a return instruction is also a "branch" of sorts in that it ends a basic block, though it has no successors. The important takeaway is that IR-level branches are an abstraction level above machine-code control flow, allowing for a direct choice between several or many targets as one operation.</p> </div>}^3</sup>See PR #2083</a> above, which is a bug that arose after</em> I wrote the correctness proof, because the proof assumed target-aliasing supported arbitrarily-long branch chains but it actually followed only one level. This was a deliberate earlier implementation choice to avoid infinite loops on branch cycles. It turns out that it's possible to just avoid cycles in the alias table by construction; we carefully prove that this is so and then allow redirect-chasing through chains of branches. For extra paranoia, because a non-terminating compiler is bad and we are merely human, we still</em> limit redirect-chasing to 1 million branches and panic beyond that (because one can never be too careful); this is a limit that will never be hit when using the Wasm frontend (due to limits on function size) and is extremely unlikely to be hit elsewhere.</p> </div> A New Backend for Cranelift, Part 1: Instruction Selection 2020-09-18T00:00:00+00:00 This post is the first in a three-part series about my recent work on Cranelift</a> as part of my day job at Mozilla. In this first post, I will set some context and describe the instruction selection problem. In particular, I'll talk about a revamp to the instruction selector and backend framework in general that we've been working on for the last nine months or so. This work has been co-developed with my brilliant colleagues Julian Seward and Benjamin Bouvier</a>, with significant early input from Dan Gohman</a> as well, and help from all of the wonderful Cranelift hackers.</p> Background: Cranelift</h2> So what is Cranelift? The project is a compiler framework written in Rust</a> that is designed especially (but not exclusively) for just-in-time compilation</a>. It's a general-purpose compiler: its most popular use-case is to compile WebAssembly</a>, though several other frontends exist, for example, cg_clif</a>, which adapts the Rust compiler itself to use Cranelift. Folks at Mozilla and several other places have been developing the compiler for a few years now. It is the default compiler backend for wasmtime</a>, a runtime for WebAssembly outside the browser, and is used in production in several other places as well. We recently flipped the switch to turn on Cranelift-based WebAssembly support in nightly Firefox on ARM64 (AArch64)</a> machines, including most smartphones, and if all goes well, it will eventually go out in a stable Firefox release. Cranelift is developed under the umbrella of the Bytecode Alliance</a>.</p> In the past nine months, we have built a new framework in Cranelift for the "machine backends", or the parts of the compiler that support particular CPU instruction sets. We also added a new backend for AArch64, mentioned above, and filled out features as needed until Cranelift was ready for production use in Firefox. This blog post sets some context and describes the design process that went into the backend-framework revamp.</p> It can be a bit confusing to keep all of the moving parts straight. Here's a visual overview of Cranelift's place among various other components, focusing on two of the major Rust crates (the Wasm frontend and the codegen backend) and several of the other programs that make use of Cranelift:</p> </p> Old Backend Design: Instruction Legalizations</h2> To understand the work that we've done recently on Cranelift, we'll need to zoom into the cranelift_codegen</code> crate above and talk about how it used to</em> work. What is this "CLIF" input, and how does the compiler translate it to machine code that the CPU can execute?</p> Cranelift makes use of CLIF</a>, or the Cranelift IR (Intermediate Representation) Format, to represent the code that it is compiling. Every compiler that performs program optimizations uses some form of an Intermediate Representation (IR)</a>: you can think of this like a virtual instruction set that can represent all the operations a program is allowed to do. The IR is typically simpler than real instruction sets, designed to use a small set of well-defined instructions so that the compiler can easily reason about what a program means. The IR is also independent of the CPU architecture that the compiler eventually targets; this lets much of the compiler (such as the part that generates IR from the input programming language, and the parts that optimize the IR) be reused whenever the compiler is adapted to target a new CPU architecture. CLIF is in Static Single Assignment (SSA)</a> form, and uses a conventional control-flow graph</a> with basic blocks (though it previously allowed extended basic blocks, these have been phased out). Unlike many SSA IRs, it represents φ-nodes with block parameters rather than explicit φ-instructions.</p> Within cranelift_codegen</code>, before we revamped the backend design, the program remained in CLIF throughout compilation and up until the compiler emitted the final machine code. This might seem to contradict what we just said: how can the IR be machine-independent, but also be the final form from which we emit machine code?</p> The answer is that the old backends were built around the concept of "legalization" and "encodings". At a high level, the idea is that every Cranelift</em> instruction either corresponds to one machine</em> instruction, or can be replaced by a sequence of other Cranelift</em> instructions. Given such a mapping, we can refine the CLIF in steps, starting from arbitrary machine-independent instructions from earlier compiler stages, performing edits until the CLIF corresponds 1-to-1 with machine code. Let's visualize this process:</p> </p> A very simple example of a CLIF instruction that has a direct "encoding" to a machine instruction is iadd</code>, which just adds two integers. On essentially any modern architecture, this should map to a simple ALU instruction that adds two registers.</p> On the other hand, many CLIF instructions do not map cleanly. Some arithmetic instructions fall into this category: for example, there is a CLIF instruction to count the number of set bits in an integer's binary representation (popcount</code>); not every CPU has a single instruction for this, so it might be expanded into a longer series of bit manipulations. There are operations that are defined at a higher semantic level, as well, that will necessarily be lowered with expansions: for example, accesses to Wasm memories are lowered into operations that fetch the linear memory base and its size, bounds-check the Wasm address against the limit, compute the real address for the Wasm address, and perform the access.</p> To compile a function, then, we iterate over the CLIF and find instructions with no direct machine encodings; for each, we simply expand into the legalized sequence, and then recursively consider the instructions in that sequence. We loop until all instructions have machine encodings. At that point, we can emit the bytes corresponding to each instruction's encoding1</a></sup>.</p> Growing Pains, and a New Backend Framework?</h2> There are a number of advantages to the legacy Cranelift backend design, which performs expansion-based legalization with a single IR throughout. As one might expect, though, there are also a number of drawbacks. Let's discuss a few of each.</p> Single IR and Legalization: Pros</h3> By operating on a single IR all the way to machine-code emission, the same optimizations can be applied at multiple stages. For example, consider a legalization expansion that turns a high-level "access Wasm memory" instruction into a sequence of loads, adds and bounds-checks. If many such sequences occur in one function, we might be able to factor out common portions (e.g.: computing the base of the Wasm memory). Thus the legalization scheme exposes as much code as possible, at as many stages as possible, to opportunities for optimization. The legacy Cranelift pipeline in fact works in this way: it runs "pre-opt" and "post-opt" optimization passes, before and after legalization respectively.</p> </li> If most</em> of the Cranelift instructions become one machine instruction, and few legalizations are necessary, then this scheme can be very fast: it becomes simply a single traversal to fill in "encodings", which were represented by small indices into a table.</p> </li> </ol> Single IR and Legalization: Cons</h3> Expansion-based legalization may not always result in optimal code. So far we've seen that legalization can convert from CLIF to machine instructions with one-to-one or one-to-many mappings. However, there are sometimes also single</em> machine instructions that implement the behavior of multiple</em> CLIF instructions, i.e. a many-to-one mapping. In order to generate efficient code, we want to be able to make use of these instructions.</p> For example, on x86, an instruction that references memory can compute an address like base + scale * index</code>, where base</code> and index</code> are registers and scale</code> is 1, 2, 4, or 8. There is no notion of such an address mode in CLIF, so we would want to pattern-match the raw iadd</code> (add) and ishl</code> (shift) or imul</code> (multiply) operations when they occur in the address computation. Then, we would want to somehow select the encoding on the load</code> instruction based on the fact that its input is some specific combination of adds and shifts/multiplies. This seems to break the abstraction that the encoding represents only that instruction's operation.</p> In principle, we could implement more general pattern matching for legalization rules to allow many-to-one mappings. However, this would be a significant refactor; and as long as we were reconsidering the design in whole, there were other reasons to avoid patching the problem in this way.</p> </li> There is a conceptual difficulty with the single-IR approach: there is no static representation of which instructions are expanded into which others and it is difficult to reason about the correctness and termination properties of legalization as a whole.</p> Specifically, the expansion-based legalization rules must obey a partial order among instructions: if A expands into a sequence including B, then B cannot later expand into A. In practice, mappings were mostly one-to-one, and for those that weren't, there was a clear domain separation between the "input" high-level instructions and the "machine-level" instructions. However, for more complex machines, or more complex matching schemes that attempt to make better use of the target instruction set, this could become a real difficulty for the machine-backend author to keep straight.</p> </li> There are efficiency concerns with expansion-based legalization. At an algorithmic level, we prefer to avoid fixpoint loops (in this case, "continue expanding until no more expansions exist") whenever possible. The runtime is bounded, but the bound is somewhat difficult to reason about, because it depends on the maximum depth of chained expansions.</p> The data structures that enable in-place editing are also much slower than we would like. Typically, compilers store IR instructions in linked lists to allow for in-place editing. While this is asymptotically as fast as an array-based solution (we never need to perform random access), it is much less cache-friendly or ILP</a>-friendly on modern CPUs. We'd prefer instead to store arrays of instructions and perform single passes over them whenever possible.</p> </li> Our particular implementation of the legalization scheme grew to be somewhat unwieldy over time. Witness this GitHub issue, in which my eloquent colleague Benjamin Bouvier</a> describes all the reasons we'd like to fix the design: #1141: Kill Recipes With Fire</a>. This is no slight to the engineers who built it; the complexity was managed as well as could be, with a very nice DSL-based code generation step to produce the legalizer from high-level rule specifications. However, reasoning through legalizations and encodings become more cumbersome than we would prefer, and the compiler backends were not very accessible to contributors. Adding a new instruction required learning about "recipes", "encodings", and "legalizations" as well as mere instructions and opcodes, and finding one's way through the DSL to put the pieces together properly. A more conventional code-lowering approach would avoid much of this complexity.</p> </li> A single-level IR has a fundamental tension: for analyses and optimizations to work well, an IR should have only one way to represent any particular operation, i.e. should consist of a small set of canonical instructions. On the other hand, a machine-level representation should represent all of the relevant details of the target ISA. For example, an address computation might occur in many different ways (with different addressing modes) on the machine, but we would prefer not to have to analyze a special address-computation opcode in all of our analyses. An implicit rule at emission time ("a load with an add instruction as input always becomes this addressing mode") is not ideal, either.</p> A single IR simply cannot serve both ends of this spectrum properly, and difficulties arose as CLIF strayed from either end. To resolve this conflict, it is best to have a two-level representation, connected by an explicit instruction selector. It allows CLIF itself to be as simple and as normalized as possible, while allowing all the details we need in machine-specific instructions.</p> </li> </ol> For all of these reasons, as part of our revamp of Cranelift and a prerequisite to our new AArch64 backend, we built a new framework for machine backends and instruction selection. The framework allows machine backends to define their own instructions, separately from CLIF; rather than legalizing with expansions and running until a fixpoint, we define a single lowering pass; and everything is built around more efficient data-structures, carefully optimizing passes over data and avoiding linked lists entirely. We now describe this new design!</p> A New IR: VCode</h2> The main idea of the new Cranelift backend is to add a machine-specific IR</em>, with several properties that are chosen specifically to represent machine-code well (i.e., the IR is very close to machine code). We call this VCode</code>, which comes from "virtual-register code", and the VCode contains MachInst</code>s, or machine instructions. The key design choices we made are:</p> VCode is a linear sequence of instructions. There is control-flow information that allows traversal over basic blocks, but the data structures are not designed to easily allow inserting or removing instructions or reordering code. Instead, we lower into VCode with a single pass, generating instructions in their final (or near-final) order. I'll write more about how we make this efficient in a follow-up post.</p> This design aspect avoids the inefficiencies of linked-list data structures, allowing fast passes over arrays of instructions instead. We've kept the MachInst</code> size relatively small (16 bytes per instruction for AArch64) which aids code generation and iteration speed as well.</p> </li> VCode is not</em> SSA-based; instead, its instructions operate on registers. While lowering, we allocate virtual registers. After the VCode is generated, the register allocator computes appropriate register assignments and edits the instructions in-place, replacing virtual registers with real registers. (Both are packed into a 32-bit representation space, using the high bit to distinguish virtual from real.)</p> Eschewing SSA at this level allows us to avoid the overhead of maintaining its invariants, and maps more closely to the real machine. Lowerings for instructions are allowed to, e.g., use a destination register as a temporary before performing a final write into it. If we required SSA form, we would have to allocate a temporary in this case and rely on the register allocator to coalesce it back to the same register, which adds compile-time overhead.</p> </li> VCode is a container for MachInst</code>s, but there is a separate MachInst</code> type for each machine backend. The machine-independent part is parameterized on MachInst</code> (which is a trait in Rust) and is statically monomorphized to the particular target for which the compiler is built.</p> Modeling a machine instruction with Rust's excellent facilities for strongly-typed data structures, such as enum</code>s, avoids the issue of muddled instruction domain (is a CLIF instruction machine-independent, machine-dependent, or both?) and allows each backend to store the appropriate information for its encoding.</p> </li> </ul> One can visualize a VCode function body as consisting of the following information (simplified; a real example is further below):</p> </p> Note that the instructions are simply stored in an array, and the basic blocks are recorded separately as ranges of array (instruction) indices. As we described above, we designed this data structure for fast iteration, but not for editing. We always ensure that the first block (b0</code>) is the entry block, and that consecutive block indices have contiguous instruction-index ranges (i.e., are placed next to each other).</p> Each instruction is mostly opaque from the point of view of the VCode container, with a few exceptions: every instruction exposes its (i) register references, and (ii) basic-block targets, if a branch. Register references are categorized into the usual "uses" and "defs" (reads and writes).2</a></sup></p> Note as well that the instructions can refer to either</em> virtual registers (here denoted v0</code>..vN</code>) or</em> real machine registers (here denoted r0</code>..rN</code>). This design choice allows the machine backend to make use of specific registers where required by particular instructions, or by the ABI (parameter-passing conventions). The semantics of VCode are such that the register allocator recognizes live ranges</em> of the real registers, from defs to uses, and avoids allocating virtual registers to those particular real registers for their live ranges. After allocation, all machine instructions are edited in place to refer only to real registers.</p> Aside from registers and branch targets, an instruction contained in the VCode may contain whatever other information is necessary to emit machine code. Each machine backend defines its own type to store this information. For example, on AArch64, here are several of the instruction formats, simplified:</p> pub enum</span> Inst</span> {</span></span> /// An ALU operation with two register sources and a register destination.</span></span> AluRRR</span> {</span></span> alu_op</span>:</span> ALUOp</span>,</span></span> rd</span>:</span> Writable</span><</span>Reg</span>>,</span></span> rn</span>:</span> Reg</span>,</span></span> rm</span>:</span> Reg</span>,</span></span> },</span></span> </span> /// An ALU operation with a register source and an immediate-12 source, and a register</span></span> /// destination.</span></span> AluRRImm12</span> {</span></span> alu_op</span>:</span> ALUOp</span>,</span></span> rd</span>:</span> Writable</span><</span>Reg</span>>,</span></span> rn</span>:</span> Reg</span>,</span></span> imm12</span>:</span> Imm12</span>,</span></span> },</span></span> </span> /// A MOVZ with a 16-bit immediate.</span></span> MovZ</span> {</span></span> rd</span>:</span> Writable</span><</span>Reg</span>>,</span></span> imm</span>:</span> MoveWideConst</span>,</span></span> size</span>:</span> OperandSize</span>,</span></span> },</span></span> </span> /// A two-way conditional branch. Contains two targets; at emission time, a conditional</span></span> /// branch instruction followed by an unconditional branch instruction is emitted, but</span></span> /// the emission buffer will usually collapse this to just one branch. See a follow-up</span></span> /// blog post for more!</span></span> CondBr</span> {</span></span> taken</span>:</span> BranchTarget</span>,</span></span> not_taken</span>:</span> BranchTarget</span>,</span></span> kind</span>:</span> CondBrKind</span>,</span></span> },</span></span> </span> // ...</span></span> }</span></span></code></pre> These enum arms could be considered similar to "encodings" in the old backend, except that they are defined in a much more straightforward way. Whereas old Cranelift backends had to define instruction encodings using a DSL, and these encodings were assigned a numeric index and a special bit-packed encoding for additional instruction parameters, here the instructions are simply stored in type-safe and easy-to-use Rust data structures.</p> We will not discuss the VCode data-structure design or instruction interface much further, except to note that the relevant instruction-emission functionality for a new machine backend can be implemented by providing a MachInst</code> trait implementation</a> for one's instruction type (and then lowering into it; see below). We believe, and early experience seems to indicate, that this is a much easier task than what was required to develop a backend in Cranelift's old DSL-based framework.</p> Lowering from CLIF to VCode</h2> We've now come to the most interesting design question: how do we lower from CLIF instructions, which are machine-independent, into VCode with the appropriate type of CPU instructions? In other words, what have we replaced the expansion-based legalization and encoding scheme with?</p> In short, the scheme is a single pass</em> over the CLIF instructions, and at each instruction, we invoke a function provided by the machine backend to lower the CLIF instruction into VCode instruction(s). The backend is given a "lowering context</a>" by which it can examine the instruction and the values that flow into it, performing "tree matching" as desired (see below). This naturally allows 1-to-1, 1-to-many, or many-to-1 translations. We incorporate a reference-counting scheme into this pass to ensure that instructions are only generated if their values are actually used; this is necessary to eliminate dead code when many-to-1 matches occur.</p> Tree Matching</h3> Recall that the old design allowed for 1-to-1 and 1-to-many mappings from CLIF instructions to machine instructions, but not many-to-1. This is particularly problematic when it comes to pattern-matching for things like addressing modes, where we want to recognize particular combinations of operations and choose a specific instruction that covers all of those operations at once.</p> Let's start by defining a "tree" that is rooted at a particular CLIF instruction. For each argument to the instruction, we can look "up" the program to find its producer (def). Because CLIF is in SSA form, either the instruction argument is an ordinary value, which must have exactly one definition, or it is a block parameter (φ-node in conventional SSA formulations) that represents multiple possible definitions. We will say that if we reach a block parameter (φ-node), we simply end at a tree leaf -- it is perfectly alright to pattern-match on a tree that is a subset</em> of the true dataflow (we might get suboptimal code, but it will still be correct). For example, given the CLIF code:</p> block0(v0: i64, v1: i64, v2: b1):</span></span> brnz v2, block1(v0)</span></span> jump block1(v1)</span></span> </span> block1(v2: i64):</span></span> v3 = iconst.i64 64</span></span> v4 = iadd.i64 v2, v3</span></span> v5 = iadd.i64 v4, v0</span></span> v6 = load.i64 v5</span></span> return v6</span></span></code></pre> let's consider the load instruction: v6 = load.i64 v5</code>. A simple code generator could map this 1-to-1 to the CPU's ordinary load instruction, using the register holding v5</code> as an address. This would certainly be correct. However, we might be able to do better: for example, on AArch64, the available addressing modes include a two-register sum ldr x0, [x1, x2]</code> or a register with a constant offset ldr x0, [x1, #64]</code>.</p> The "operand tree" might be drawn like this:</p> </p> We stop at v2</code> and v0</code> because they are block parameters; we don't know with certainty which instruction will produce these values. We can replace v3</code> with the constant 64</code>. Given this view, the lowering process for the load instruction can fairly easily choose an addressing mode. (On AArch64, the code to make this choice is here</a>; in this case it would choose the register + constant immediate form, generating a separate add instruction for v0 + v2</code>.)</p> Note that we do not actually explicitly construct an operand tree during lowering. Instead, the machine backend can query each instruction input, and the lowering framework will provide a struct</a> giving the producing instruction if known, the constant value if known, and the register that will hold the value if needed. The backend may traverse up the tree (via the "producing instruction") as many times as needed. If it cannot combine the operation of an instruction further up the tree into the root instruction, it can simply use the value in the register at that point instead; it is always safe (though possibly suboptimal) to generate machine instructions for only the root instruction.</p> Lowering an Instruction</h3> Given this matching strategy, then, how do we actually do the translation? Basically, the backend provides a function that is called once per CLIF instruction, at the "root" of the operand tree, and can produce as many machine instructions as it likes. This function is essentially just a large match</code> statement over the opcode of the root CLIF instruction, with the match-arms looking deeper as needed.</p> Here is a simplified version of the match-arm for an integer add operation lowered to AArch64 (the full version is here</a>):</p> match</span> op</span> {</span></span> // ...</span></span> Opcode</span>::</span>Iadd</span> =></span> {</span></span> let</span> rd</span> =</span> get_output_reg</span>(</span>ctx</span>,</span> outputs</span>[</span>0</span>]);</span></span> let</span> rn</span> =</span> put_input_in_reg</span>(</span>ctx</span>,</span> inputs</span>[</span>0</span>]);</span></span> let</span> rm</span> =</span> put_input_in_rse_imm12</span>(</span>ctx</span>,</span> inputs</span>[</span>1</span>]);</span></span> let</span> alu_op</span> =</span> choose_32_64</span>(</span>ty</span>,</span> ALUOp</span>::</span>Add32</span>,</span> ALUOp</span>::</span>Add64</span>);</span></span> ctx</span>.</span>emit</span>(</span>alu_inst_imm12</span>(</span>alu_op</span>,</span> rd</span>,</span> rn</span>,</span> rm</span>));</span></span> }</span></span> // ...</span></span> }</span></span></code></pre> There is some magic that happens in several helper functions here. put_input_in_reg()</code> invokes the proper methods on the ctx</code> to look up the register that holds an input value. put_input_in_rse_imm12()</code> is more interesting: it returns a ResultRSEImm12</code></a>, which is a "register, shifted register, extended register, or 12-bit immediate". This set of choices captures all of the options we have for the second argument of an add instruction on AArch64. The helper looks at the node in the operand tree and attempts to match either a shift or zero/sign-extend operator, which can be incorporated directly into the add. It also checks whether the operand is a constant and if so, could fit into a 12-bit immediate field. If not, it falls back to simply using the register input. alu_inst_imm12()</code> then breaks down this enum and chooses the appropriate Inst</code> arm (AluRRR</code>, AluRRRShift</code>, AluRRRExtend</code>, or AluRRImm12</code> respectively).</p> And that's it! No need for legalization and repeated code editing to match several operations and produce a machine instruction. We have found this way of writing lowering logic to be quite straightforward and easy to understand.</p> Backward Pass with Use-Counts</h3> Now that we can lower a single instruction, how do we lower a function body with many instructions? This is not quite as straightforward as looping over the instructions and invoking the match-over-opcode function described above (though that would actually work). In particular, we want to handle the many-to-1 case more efficiently. Consider what happens when the add-instruction logic above is able to incorporate, say, a left-shift operator into the add instruction. The add</code> machine instruction would then use the shift</em>'s input register, and completely ignore the shift's output. If the shift operator has no other uses, we should avoid doing the computation entirely; otherwise, there was no point in merging the operation into the add.</p> We implement a sort of reference counting to solve this problem. In particular, we track whether any given SSA value is actually used, and we only generate code for a CLIF instruction if any of its results are used (or if it has a side-effect that must occur). This is a form of dead-code elimination</a> but integrated into the single lowering pass.</p> To know whether a value is used, we simply track a counter per value, initialized to zero. Whenever the machine backend uses a register input (as opposed to using a constant value directly, or incorporating the producing instruction's operation), it notifies the lowering driver that this register has been used.</p> We must see uses before defs for this to work. Thus, we iterate over the function body "backward". Specifically, we iterate in postorder</a>; this way, all instructions are seen before instructions that dominate</a> them, so given SSA form, we see uses before defs.</p> Finally, we have to consider side-effects carefully. This matters in two ways. First, if an instruction has a side-effect, then we must lower it into VCode even if its result(s) have no uses. Second, we cannot allow an operation to be merged into another if this would move a side-effecting operation over another or alter whether it might execute. We ensure side-effect correctness with a "coloring" scheme (in a forward pass, assign a color to every instruction, and update the color on every side effect and on every new basic block); the producing instruction is only considered for possible merging with its consuming instruction if it has no side-effects (hence can always be moved) or if it has the same color as the consuming instruction (hence would not move over another side effect).</p> The lowering procedure is as follows (full version here</a>):</p> Compute instruction colors based on side-effects.</li> Allocate virtual registers to all SSA values. It's OK if we don't use some; an unused virtual register will not be allocated any real register.</li> Iterate in postorder over instructions. If the instruction has a side-effect, or if any of its results are used, call into the machine backend to lower it.</li> Reverse the VCode instructons so that they appear in forward order. 3</a></sup></li> </ol> Easy!</p> Examples</h3> Let's see how this works in real life! Consider the following CLIF code:</p> function %f25(i32, i32) -> i32 {</span></span> block0(v0: i32, v1: i32):</span></span> v2 = iconst.i32 21</span></span> v3 = ishl.i32 v0, v2</span></span> v4 = isub.i32 v1, v3</span></span> return v4</span></span> }</span></span></code></pre> We expect that the left-shift (ishl</code>) operation should be merged into the subtract operation on AArch64, using the reg-reg-shift form of ALU instruction, and indeed this happens (here I am showing the debug-dump format one can see with RUST_LOG=debug</code> when running clif-util compile -d --target aarch64</code>):</p> VCode {</span></span> Entry block: 0</span></span> Block 0:</span></span> (original IR block: block0)</span></span> (instruction range: 0 .. 6)</span></span> Inst 0: mov %v0J, x0</span></span> Inst 1: mov %v1J, x1</span></span> Inst 2: sub %v4Jw, %v1Jw, %v0Jw, LSL 21</span></span> Inst 3: mov %v5J, %v4J</span></span> Inst 4: mov x0, %v5J</span></span> Inst 5: ret</span></span> }</span></span></code></pre> This then passes through the register allocator, has a prologue and epilogue attached (we cannot generate these until we know which registers are clobbered), has redundant moves elided, and becomes:</p> stp fp, lr, [sp, #-16]!</span></span> mov fp, sp</span></span> sub w0, w1, w0, LSL 21</span></span> mov sp, fp</span></span> ldp fp, lr, [sp], #16</span></span> ret</span></span></code></pre> which is a perfectly valid function, correct and callable from C, on AArch64! (We could do better if we knew that this were a leaf function and avoided the stack-frame setup and teardown! Alas, many optimization opportunities remain.)</p> There are many other examples of interesting instruction-selection cases in our filetests</a>. One of our favorite pastimes lately is to stare at disassemblies and find inefficient translations, improving the pattern-matching as required, so these are slowly getting better (my brilliant colleague Julian Seward has built a custom tool that dumps the hottest basic blocks from a given JIT execution and has found quite a number of improvements in our AArch64 and x86-64 backends).</p> Next: Efficient Code-Generation Passes, and Checking the Register Allocator</h2> I've covered a lot of ground in this post, but there's still a lot more to say about the new Cranelift backend framework!</p> In the second post, I'd like to talk about how we designed the passes after</em> VCode lowering to be as efficient as possible. In particular this will involve the way in which we simplify branches, which avoids the more usual step-by-step process of removing empty basic blocks and flipping branch conditions and taking advantage of fallthrough paths, instead doing last-minute edits as the binary code is being emitted (see the MachBuffer</code></a> implementation for all the details).</p> Then, in the third post, I'll talk about how I've used abstract interpretation to build a symbolic checker for our register allocator, which has been effective at finding several interesting bugs while fuzzing.</p> Stay tuned!</p> In the meantime, for any and all discussions about Cranelift, please feel free to join us on our Bytecode Alliance Zulip chat</a> (here's a topic</a> for this post)!</p> Thanks to Julian Seward and Benjamin Bouvier for reviewing this post and suggesting several additions and corrections.</em></p> ^{1</sup> Note that this description skips several quite important steps that come after instructions have encodings. Most importantly, we still must perform register allocation</em>, which chooses machine registers to hold each value in the IR. This may involve inserting instructions as well, when values need to be spilled to or reloaded from the stack or simply moved between registers. Then, after several other housekeeping tasks (such as resolving branches and optimizing their forms for the actual machine-code offsets), we can actually use the encodings to emit machine code.</p> </div> ^{2</sup> We also support a "mod" (modify) type of register reference that is both a use and def, while ensuring that the same register is allocated for the use- and the def-points. This replaces an earlier mechanism known as "tied operands" that introduced an ad-hoc constraint to the register allocator. Mods instead are handled by simply extending the live-range through the instruction.</p> </div> ^{3</sup> The reversal scheme is actually a bit more subtle than this. We want to emit instructions in forward order within the lowering for a single CLIF instruction, but we visit CLIF instructions backward. To make this work, we keep a buffer of lowered VCode instructions per CLIF instruction in forward order; at the end of a single CLIF instruction, these are copied in reverse order to a buffer of lowered VCode instructions for the basic block. Because we visit instructions within the block backward, this buffer contains the VCode sequence for the basic block in reverse order. Then, at the end of the block, we reverse it again onto the tail of the VCode buffer. The end result is that we see VCode instructions in forward order for each CLIF instruction in forward order, contained within basic blocks in forward order (phew!).</p> </div>}}} blog.cfallin is live! 2020-09-17T00:00:00+00:00 Hello, and welcome to blog.cfallin</a>! I've thought for a while that it might be nice to share, occasionally, some thoughts on whatever technical tidbits interest me. This blog will likely be home to assorted ramblings on compilers, runtimes, and the like; you can find a bit more about my background at 'About'</a>.</p> My first post, coming soon, will be about the new compiler backend framework I've developed (along with extremely capable co-conspirators) for Cranelift</a>, a compiler in Rust that will soon be used in production in Firefox, among other places. Stay tuned.</p>

The acyclic e-graph: Cranelift's mid-end optimizer

2026-04-09T00:00:00+00:00

Today, I'll be writing about the aegraph, or acyclic egraph, the data structure at the heart of Cranelift's mid-end optimizer. I introduced this approach in 2022</a> and, after a somewhat circuitous path involving one full rewrite, a number of interesting realizations and "patches" to the initial idea, various discussions with the wider e-graph community (including a talk</a> (slides</a>) at the EGRAPHS workshop at PLDI 2023</a> and a recent talk and discussions at the e-graphs Dagstuhl seminar</a>), and a whole bunch of contributed rewrite rules over the past three years</a>, it is time that I describe the why (why an e-graph? what benefits does it bring?), the how (how did we escape the pitfalls of full equality saturation? how did we make this efficient enough to productionize in Cranelift?), and the how much (does it help? how can we evaluate it against alternatives?).

For those who are already familiar with Cranelift's mid-end and its aegraph, note that I'm taking a slightly different approach in this post. I've come to the viewpoint that the "sea-of-nodes" aspect of our aegraph, and the translation passes we've designed to translate into and out of it (with optimizations fused in along the way), are actually more fundamental than the "multi-representation" part of the aegraph, or in other words, the "equivalence class" part itself. I'm choosing to introduce the ideas from sea-of-nodes-first in this post, so we will see a "trivial eclass of one enode" version of the aegraph first (no union nodes), then motivate unions later. In actuality, when I was experimenting then building this functionality in Cranelift in 2022, the desire to integrate e-graphs came first, and aegraphs were created to make them practical; the pedagogy and design taxonomy have only become clear to me over time. With that, let's jump in!

Initial context: Fixpoint Loops and the Pass-Ordering Problem</h2> Around May of 2022, I had introduced a simple alias analysis and related optimizations</a> (removing redundant loads, and doing store-to-load forwarding). It worked fine on all of the expected test cases, and we saw real speedup on a few benchmarks (e.g. 5% on meshoptimizer</code>here</a>) but led to a new question as well: how should we integrate this pass with our other optimization passes, which at the time included GVN (global value numbering), LICM (loop-invariant code motion), constant propagation and some algebraic rewrites?
To see why this is an interesting question, consider how GVN, which canonicalizes values, and redundant load elimination interact, on the following IR snippet:
v2 = load.i64 v0+8 v3 = iadd v2, v1 ;; e.g., array indexing v4 = load.i8 v3 ;; ... (no stores or other side effects here) ... v10 = load.i64 v0+8 v11 = iadd v10, v1 v12 = load.i8 v11</code></pre> Redundant load elimination (RLE) will be able to see that the load defining v10</code> can be removed, and v10</code> can be made an alias of v2</code>, in a single pass. In a perfect world, we should then be able to see that v11</code> becomes the same as v3</code> by means of GVN's canonicalization, and subsequently, v12</code> becomes an alias of v4</code>. But those last two steps imply a tight cooperation between two different optimization passes: we need to run one full pass of RLE (result: v10</code> rewritten), then one full pass of GVN (result: v11</code> rewritten), then one additional full pass of RLE (result: v12</code> rewritten). One can see that an arbitrarily long chain of such reasoning steps, bouncing through different passes, might require an arbitrarily long sequence of pass invocations to fully simplify. Not good! This is known as the pass-ordering problem in the study of compilers and is a classical heuristic question with no easy answers as long as the passes remain separate, coarse-grained algorithms (i.e., not interwoven). To permit some interesting cases to work in the initial Cranelift integration of alias analysis-based rewrites, I made a somewhat ad-hoc choice</a> to invoke GVN once after the alias-analysis rewrite pass. But this is clearly arbitrary, wastes compilation effort in the common case, and we should be able to do better. In general, the solution should reason about all passes' possible rewrites in a unified framework, and interleave them in a fine-grained way: so, for example, if we can apply RLE then GVN five times in a row just for one localized expression, we should be able to do that, without running each of these passes on the whole function body. In other words, we want a "single fixpoint loop" that iterates until optimization is done at a fine granularity. Three Building Blocks: Rewrites, Code Motion, and Canonicalization</h2> Let's review the optimizations we had at this point: GVN (global value numbering), which is a canonicalization operation: within a given scope where a value is defined (for SSA IRs, the subtree of the dominance tree</a> below a given definition), any identical computations of that value should be canonicalized to the original one. </li> LICM (loop-invariant code motion), which is a code-motion operation: computations that occur within a loop, but whose value is guaranteed to be the same on each iteration, should be moved out. Loop invariance can be defined recursively: values already outside the loop, or pure operators inside the loop whose arguments are all loop-invariant. The transform doesn't change any operators, it only moves where they occur. </li> Constant propagation (cprop) and algebraic rewrites: these are transforms like rewriting 1 + 2</code> to 3</code> (cprop) or x + 0</code> to x</code> (algebraic). They can all be expressed as substitutions for expressions that match a given pattern. </li> Redundant load elimination and store-to-load forwarding: these both replace load</code> operators with the SSA value that operator is known to load. </li> And one that we wanted to implement: rematerialization, which reduces register pressure for values that are easier to recompute on demand (e.g., integer constants) by re-defining them with a new computation. This can be seen as a kind of code motion as well. </li> </ul> As a start to thinking about frameworks, we can categorize the above into code motion, canonicalization, and rewrites. Code motion is what it sounds like: it involves moving where a computation occurs, but not changing it otherwise. Canonicalization is the unifying of more than one instance of a computation into one ("canonical") instance. And rewrites are any optimization that replaces one expression with another that should compute the same value. Said more intuitively (and colloquially), these three categories attempt to cover the whole space of possibilities for "simple" optimizations: one can move code, merge identical code, or replace code with equivalent code. (The notable missing possibility here is the ability to change control flow and/or make use of control-flow-related reasoning; more on that in a later section.) Thus, if we can build a framework that handles these kinds of transforms, we should have a good infrastructure for the next steps in Cranelift's evolution. IR Design, Sea-of-Nodes, and Intermediate Points</h2> From first principles, one might ask: how should a unifying framework for these concerns look? Code motion and canonicalization together imply that perhaps computations (operator nodes) should not have a "location" in the program, whenever that can be avoided. In other words, perhaps we should find a way to represent add v1, v2</code> in our IR without putting it somewhere concrete in the control flow. Then all instances of that same computation would be merged (because duplicates would differ only by their location, which we removed), and code motion is... inapplicable, because code does not have a location? Well, not quite: the idea is that one starts with a conventional IR (with control flow), and ends with it too, but in the middle one can eliminate locations where possible. So in the transition to this representation, we erase locations, and canonicalize; and in the transition from this representation, we re-assign locations, and code-motion can be a side-effect of how we do that. What we just described above is called a sea-of-nodes</a> IR. A sea-of-nodes IR is one that dispenses with a classical "sequential order" for all instructions or operators in the program, and instead builds a graph (the "sea") of operators (the "nodes") with edges to denote the actual dependencies, either for dataflow or control flow. In the purest form of this design, one can represent every IR transform as a graph rewrite, because a graph is all there is. For example, LICM, a kind of code motion that hoists a computation out of a loop, is a purely algebraic rewrite on the subgraph representing the loop body. This is because the loop itself is a kind of node in the sea of nodes, with control-flow edges like any other edge; code motion is not a "special" action outside the scope of the expression language (nodes and their operands). While that kind of flexibility is tempting, it comes with a significant complexity tax as well: it means that reasoning through and implementing classical compiler analyses and transforms is more difficult, at least for existing compiler engineers with their experience, because the IR is so different from the classical data structure (CFG of basic blocks). The V8 team wrote about this difficulty</a> recently as support for their decision to migrate away from a pure Sea-of-Nodes representation. However, we might achieve some progress toward our goal -- providing a general framework for rewrites, code motion and canonicalization -- if we take inspiration from sea-of-nodes' handling of pure (side-effect-free) operators, and the way that they can "float" in the sea, unmoored by any anchor other than actual inputs and outputs (dataflow edges). Stated succinctly: what if we kept the CFG for the side-effectful instructions (call it the "side-effect skeleton") and used a sea-of-nodes for the rest? This would allow for us to unify code motion, canonicalization and rewrites, as described above: canonicalization works on pure operators, because we remove distinctions based on location; code-motion can occur when we put pure operators back in the CFG; and rewrites can occur on pure operators. In fact rewrites are now both (i) simpler to reason about, because we don't have to place expression nodes at locations in an IR, only create them "floating in the air", and (ii) more efficient, because they occur once on a canonicalized instance of an expression, rather than all instances separately. We'll call this representation a "sea-of-nodes with CFG". Implementing Sea-of-Nodes-with-CFG</h3> Now, to practical implementation: architecting the entire compiler around sea-of-nodes for pure operators might make sense from first principles, but as a modification of the existing Cranelift compiler pipeline, we would not want to (or be able to) make such a radical change in one step. Rather, I wanted to build this as a replacement for the mid-end, taking CLIF (our conventional CFG-based SSA IR) as input and producing CLIF as output. So we need a three-stage optimizer: Lift all pure operators out of the CFG, leaving behind the skeleton. Put these operators into the "sea" of pure computation nodes, deduplicating (hash-consing</a>) as we go. </li> Perform rewrites on these operators, replacing some values with others according to whatever rules we have that preserve value equivalence. </li> Convert this sea-of-pure-nodes back to sequential IR by scheduling nodes into the CFG. We'll call this process "elaboration" of the computations. </li> </ol> This is in fact how the heart of Cranelift's mid-end now works; we'll go through each part above in turn. Into Sea-of-Nodes-with-CFG: Canonicalization</h3> Let's talk about how we get into the sea-of-nodes representation first. The most straightforward answer, of course, would be to simply "remove the nodes from the CFG" and let them free-float, referenced by their uses that remain in the skeleton -- and that's it. But that gives up on the obvious opportunity offered by the fact that these operators are pure (have no side-effects, or implicit dependencies on the rest of the world): an operator op v1, v2</code> always produces the same value given the same inputs, and two separate instances of this node have no distinguishing features or other properties that should lead to different results. Hence, we should canonicalize, or hash-cons, nodes. Hash-consing</a> is a standard technique in systems that have value- or operator-nodes: the idea is to keep a lookup table indexed by the contents of each value or operator, perform lookups in this table when creating a new node, and reuse existing nodes when a match occurs. What is the equivalence class by which we deduplicate? (In other words, more concretely, how do we define Eq</code> and Hash</code> on sea-of-nodes values?) We adopt a very simple answer (and deal with subtleties later, as is often the case!): the (shallow) content of a given node is its identity. In other words, if we have iadd v1, v2</code>, then that is "equal to" (deduplicates with) any other such operator. Now, this shallow notion of equality may not seem like enough to canonicalize all instances of the same expression tree. Consider if we had v0 = ... v1 = ... v2 = iadd v0, v1 v3 = iconst 42 v4 = imul v2, v3 v5 = iadd v0, v1 v6 = iconst 42 v7 = imul v5, v6</code></pre> Clearly any reasonable canonicalization algorithm should consider v4</code> and v7</code> to be the same, and condense uses of them into uses of one canonical node. But the nodes are not shallowly equal. How do we get from here to there? One possible answer is induction: we could canonicalize a node only after all of its operands have been canonicalized (and rewritten), so we know that if subtrees are identical, we will have identical value numbers. Thus, inductively, all values would be canonicalized deeply. This requires processing definitions of a node before its uses, however. Fortunately, the SSA CFG from which we are constructing the sea-of-nodes-with-CFG provides us this property already if we traverse it in a particular order: we need to visit blocks in the control-flow graph in some preorder of the dominance tree (domtree), which we usually have available already. So we have an algorithm something like the following pseudo-code to canonicalize the SSA CFG into a sea-of-nodes-with-CFG: def canonicalize(basic_block): for inst in basic_block: if is_pure(inst): # only dedup and move to sea-of-nodes for "pure" insts; # leave the "skeleton" in place basic_block.remove(inst) inst.rename_values(rename_map) # rewrite uses according to a value->value map if inst in hashcons_map: # equality defined by shallow content rename_map[inst.value] = hashcons_map[inst] else: nodes.push(inst) # add to the sea-of-nodes hashcons_map[inst] = inst.value else: # we still need to rename the CFG skeleton's uses to refer to sea-of-nodes inst.rename_values(rename_map) # recursive domtree-preorder traversal. for child in domtree.children(basic_block): canonicalize(child)</code></pre> This will handle not only the above example, where we have "deep equality" (because we will canonicalize and rename e.g. v5</code> into v2</code> before visiting v5</code>'s use), but also more complex examples with the redundancies spread across basic blocks. Finally: how does the "-with-CFG" aspect of all of this work? So far, we have very much glossed over any values that are defined in the CFG skeleton, other than to imply above that they are never renamed (because we never take the is_pure</code> branch). But is this OK? Yes, in a sense, by construction: we have defined all impure values to have their own "identity", distinct from any other such value, even if shallowly equal at a syntactic level. This aligns with the notion that impure computations have implicit inputs: for example, load v0</code> appearing twice in the program may produce different values at those two different times, so we cannot deduplicate it. This can be relaxed if we have a dedicated analysis that can reason about such implicit dependencies, and in fact for loads we do have one (alias analysis, feeding into redundant-load elimination and store-to-load forwarding). But in general, we cannot do anything with these "roots". Rather, they stay in the skeleton, feed values into the sea of nodes, and consume values back out of that sea of nodes. Out of Sea-of-Nodes-with-CFG: Scoped Elaboration</h3> Given a sea-of-nodes + skeleton representation of a program, how do we go back to a conventional CFG, with fully linearized operators (i.e., each of which has a concrete program-point where it is computed), to feed to the compiler backend and lower to machine code? The basic task is to decide a location at which to put each operator. Since nodes in the sea-of-nodes are "rooted" (referenced and ultimately computed/used) by side-effectful operators in the CFG skeleton, the first idea one might have is to copy pure nodes back into the CFG where they are referenced. One could do this recursively: if e.g. we have a side-effecting instruction store v1, v2</code>, we can place the (pure operator) definitions of v1</code> and v2</code> just before this instruction; if those definitions require other values, likewise compute them first. We could call this "elaboration". Let's consider the single-basic-block case first and then define something like the following pseudocode: def demand_based_elaboration(bb): for inst in bb: elaborate_inst(inst, bb, before=inst) def elaborate_inst(inst, bb, before): for value in inst.args: inst.rewrite_arg(value, elaborate_value(value, bb, before=inst)) if is_pure(inst): bb.insert_before(before, inst) return inst.def def elaborate_value(value, bb, before): if defined_by_inst(value): # some values are blockparam roots, not inst defs return elaborate_inst(value.inst, bb, before) else: return value</code></pre> This would certainly work, but is far too simple: it duplicates computation every time a value is used, and no value (other than blockparam roots) is ever used more than once. This will almost certainly result in extreme blowup in program size! So if we use a value multiple times, it seems that we should compute it once, some place in the program before any of the uses. For example, perhaps we could augment the above algorithm with a map that records the resulting value number the first time we elaborate a node, and reuses it (i.e., memoizes the elaboration): # ... def elaborate_value(value, bb, before): if value in elaborated: return already_elaborated[value] else if defined_by_inst(inst): result = elaborate_inst(value.inst, bb, before) elaborated[value] = result return result else: return value</code></pre> This modified algorithm will handle the case of a single block with reuse efficiently, computing a value the first time it is used ("on demand") as expected. Now let's consider multiple basic blocks. One might be tempted to wrap the above with a traversal, as we did for the translation into sea-of-nodes: def elaborate_domtree(bb): demand_based_elaboration(bb) for child in domtree.children(bb): elaborate_domtree(child) def elaborate(func): elaborate_domtree(func.entry)</code></pre> But this, too, has an issue. Consider a program that began as a CFG with many paths, two of which compute the same value: If we define some traversal over all basic blocks to perform an elaboration as above, with a single map elaborated</code>, we will Elaborate a computation of v2</code> in bb2</code> and use it there;</li> Use it in bb3</code> as well in place of v3</code>, since it has already been computed and is thus memoized;</li> And thus generate invalid SSA, where a value is used on a path where it is never computed!</li> </ul> Perhaps we could hoist the computation to a "common ancestor" of all of its uses instead. Here that would be bb1</code>. But that creates yet another problem: if control flows from bb1</code> to bb4</code>, then we will have computed the value and never used it -- in supposedly optimized code! This is sometimes called a "partial redundancy": a computation that is sometimes unused, depending on control flow. We would like to avoid this if possible. It turns out that this problem exactly corresponds to common subexpression elimination (CSE), which aims to find one place to compute a value possibly used multiple times. The usual approach in SSA code, global value numbering</a> (GVN), solves the problem by reasoning about scopes, where a "scope" is the region in which a value has already been computed. The intuition is that at any given use, we can cast a "shadow" downward and remove redundant uses but only in that shadow. So in our example program, if bb1</code> computed v2</code> then we could reuse it in bb2</code> and bb3</code>; but because it occurs independently in two subtrees with no common ancestor, we do nothing; we duplicate it (re-elaborate it). SSA "scopes" -- regions in which a value can be used -- are defined by the dominance relation, and so we can work with a domtree traversal to implement the needed behavior. Concretely, we can do a domtree preorder traversal; we can keep the elaborated</code> map but separate it into scope "overlays", and push a new overlay for each subtree. This formalizes the "shadow" intuition above. We call this scoped elaboration. Pseudo-code follows: def find_in_scope(value, scope): if value in scope.map: return scope.map[value] elif scope.parent: return find_in_scope(value, scope.parent) else: return None def elaborate_value(value, bb, before, scope): if find_in_scope(value, scope): # ... # ... def elaborate_domtree(bb, scope): demand_based_elaboration(bb, scope) for child in domtree.children(bb): subscope = { map = {}, parent = scope } elaborate_domtree(child, subscope) def elaborate(func): root_scope = { map = {}, parent = None } elaborate_domtree(func.entry, root_scope)</code></pre> The real implementation</a> of our scoped hashmap takes advantage of the fact that keys will not overlap between overlay layers (because once defined, a value will not be re-defined in a lower layer), and this enables us to have true O(1) rather than O(depth) lookup using some tricks with a layer</code> number and generation</code>-per-layer (see implementation for details!). Nevertheless, the semantics are the same as above. As we foreshadowed above, just as the problem is closely related to CSE and GVN, scoped elaboration is as well. In fact, the approach of tracking a definition-within-scope for scopes that correspond to subtrees in the domtree, given a preorder traversal on the domtree, is exactly how Cranelift's old implementation</a> works as well. We even borrowed the scoped hashmap implementation! A few more observations are in order. First, it's fairly interesting that we sometimes re-elaborate a node into multiple dom subtrees; why is this? Does this introduce inefficiency (e.g. in code size) or is it the best we can do? The duplication is, in my opinion, best seen as a dual of the canonicalization. The original code may have multiple copies of a pure computation in multiple paths, with no common ancestor that computes that value. When translating to sea-of-nodes, we will canonicalize that computation, so we can optimize it once. But then when returning to the original linearized IR, we may need to restore the original duplication if there truly was no (non-redundancy-producing) optimization opportunity. Additionally, and very importantly: we should never elaborate a value in more than one place unless it also appeared in more than once in the original program. So we should not grow the program size beyond the original. Another interesting observation is that by driving elaboration by demand (from the roots in the side-effecting CFG skeleton), we do dead-code elimination (DCE) of the pure operations for free. Their existence in the sea of nodes may cost us some compile time if we spend effort to optimize them (only to throw them away later); but anything that becomes dead because of rewrites in sea-of-nodes will then naturally disappear from the final result. A third observation is that elaboration gives us a central location to control when and where code is placed in the final program. In other words, there is room for us to add heuristics beyond the simplest version of the algorithm described above. For example: we stated that we did not want to introduce any partial redundancies. But for correctness, we don't need to adhere to this: our only real restriction is that a pure computation cannot happen before its arguments are computed (i.e., we have to obey dataflow dependencies). So, for example, if we have the loop nest (structure of loops in the program) available, if a pure computation within a loop does not use any values that are computed within that loop, we know it is loop-invariant and we may choose to elaborate it before the loop begins (into the "preheader"), in a transform known as loop-invariant code motion (LICM). This is redundant if the loop executes zero iterations, but most loops execute at least once; and performing a loop-invariant computation only once can be a huge efficiency improvement. In the other direction -- pushing computation downward rather than upward -- we could choose to implement rematerialization by strategically forgetting a value in the already-elaborated scope and recomputing it at a new use. Why would we do this? Perhaps it is cheaper to recompute than to thread the original value through the program. For example, constant values are very cheap to "compute" (typically 1 or 2 instructions) but burning a machine register to keep a constant across a long function can be expensive. There is a lot of room for heuristic code scheduling within elaboration as well (LICM and rematerialization can be seen as scheduling too, but here I mean the order that operations are linearized within the block they are otherwise elaborated into). For a modern out-of-order</a> CPU, this may not matter too much to the hardware -- but it may matter to the register allocator, because reordering instructions changes the "interference graph", or the way that different live register values compete for finite resources (hardware registers). E.g., pushing an instruction that uses many values for the last time "earlier" (to eliminate the need to store those values) is great; but this minimization is not always straightforward. In fact, ordering instructions that define and use values to minimize the coloring count for the resulting live-range interference graph is an NP-complete problem. So it goes, too often, in compiler engineering! Despite the complexities that may arise in combining many heuristics, these three dimensions -- LICM, rematerialization, and code scheduling for register pressure -- are an interesting high-dimensional cost optimization problem and one that we still haven't fully solved (see e.g. #6159</a>, #6260</a> and #8959</a>). Optimizing Pure Expression Nodes: Rewrite Framework</h3> We've covered the transitions into and out of the sea-of-nodes-with-CFG program representation. We've seen how merely this translation gives us GVN (deduplication), DCE, LICM, and rematerialization "for free" (not really free, but falling out as a natural consequence of the algorithms). But we still haven't covered one of the most classical sets of optimizations: algebraic (and other) rewrites from one expression to another equivalent one (e.g, x+0</code> to x</code>). How can we do this on the sea-of-nodes? In principle, the answer is as "simple" as: build the logic that pattern-matches the "left-hand side" of a rewrite (the part that we have a "better" equivalent expression for), and then replaces it with the "right-hand side". That is, in x + 0 -> x</code>, the left-hand side is x + 0</code> and the right-hand side is x</code>. Such a framework is highly amenable to a domain-specific language to express these rewrites: ideally one doesn't want to write code that manually iterates through nodes to find these patterns. Fortunately for us, in the Cranelift project we have the ISLE (instruction-selection and lowering-expressions) DSL (RFC</a>, language reference</a>, blog post</a>). I originally designed ISLE in the context of instruction lowering, as the name implies, but I was careful to keep a separation between the core language and its "prelude" binding it to a particular environment. Hence we could adapt it fairly easily to rewrite a graph of Cranelift IR operators as well. The idea is that, as in instruction lowering, for mid-end optimizations we invoke an ISLE constructor (entry point) on a particular node and the ruleset produces a possibly better node. That gives us the logic for one expression, but there is still an open question how to apply these rewrites: to which nodes, in what order, and how to manage or update any uses of a node when that node is rewritten. The two general design axes one might consider are: Eager or deferred: do we apply rewrites to a node as soon as it exists, or apply them later (perhaps as some sort of batch-rewrite)? </li> Single-rewrite or fixpoint loop: do we rewrite a node only once, or apply rewrite rules again to the result of a rewrite? Also, if the operand of a node is rewritten, do we (and how do we) rewrite users of that node as well, since more tree-matching patterns may now apply to the new subtree? </li> </ul> It is clear that different choices to these questions could lead to different efficiency-quality tradeoffs: most obviously, applying rewrites in a fixpoint should produce better code at the cost of longer compile time. But also, it seems possible that either eager or deferred rewrite processing could win, depending on the workload and particular rules: batching (hence, deferred until one bulk pass) often leads to efficiency advantages (see the egg paper</a> and discussion below!), but also, deferral may require additional bookkeeping vs. eagerly rewriting before making use of the (soon to be stale) original value. For the overall design that we have described so far, there turns out to be a fairly clear optimal answer, surprisingly: because we build an acyclic sea-of-nodes, as long as we keep it acyclic during rewrites, we should be able to do a single rewrite pass rather than a fixpoint. And, to make that single pass work, we rewrite eagerly, as soon as we create a node; then use the final rewritten version of that node for any uses of the original value. Because we visit defs before uses and do rewrites immediately at the def, we never need to update (and re-canonicalize!) nodes after creation. An aside is in order: while it is fairly clear why the sea-of-nodes-with-CFG is initially acyclic -- because SSA permits dataflow cycles only through block-parameters / phi-nodes, and those remain in the CFG, which we don't "look through" when applying rewrites -- it is less clear why rewrites should maintain acyclicity, especially in the face of hashconsing, which may "tie the knot" of a cycle if we're not careful. The answer lies in the previous paragraph: once we create a node, we never update it. That's it! We've now maintained acyclicity, by construction. Perhaps surprisingly as well, this rewrite process can be fused with the translation pass into the sea-of-nodes itself. So we can amend the above canonicalize</code> to def canonicalize(basic_block): for inst in basic_block: if is_pure(inst): # ... if inst in hashcons_map: # ... else: inst = rewrite(inst) # NEW nodes.push(inst) # add to the sea-of-nodes hashcons_map[inst] = inst.value else: # ... </code></pre> i.e., simply add the rewrite rule application at the place we create nodes, and hashcons based on the final version of the instruction. Now, note that this is not quite complete yet: inst = rewrite(inst)</code> is doing some heavy lifting, and is actually a bit too simplistic, in the sense that this implies that a rewrite rule can only ever rewrite to one instruction on the right hand side. This isn't quite right: for example, one may want a DeMorgan rewrite rule ~(x & y) -> ~x | ~y</code>. The right-hand side includes three operator nodes (instructions): two bitwise-NOTs and the OR that uses them. What if x</code> or y</code> in this pattern also match a subexpression that can be simplified with some logic rule? There seem to be two general answers: create the original right-hand side nodes un-rewritten and later apply rewrites, or immediately and recursively rewrite. As we observed above, deferral requires additional bookkeeping and re-canonicalization as a node's inputs change, so we choose the recursive approach. So, concretely, given ~((a & b) & (c & d))</code> and the one rewrite rule above, we would: Encounter the top-level ~</code>, and try to match the rewrite rule's left-hand side. It would match with bindings x = (a & b)</code> and y = (c & d)</code>.</li> Apply the right-hand side ~x | ~y</code> bottom-up, building nodes and rewriting them as we go: First, ~x</code>. This creates ~(a & b)</code>, which recursively fires the rule, which results in (~a | ~b)</code>.</li> Then, ~y</code>. This creates ~(c & d)</code>, again recursively firing the rule, which results in (~c | ~d)</code>.</li> We then create the top-level node on the right-hand side, resulting in (~a | ~b) | (~c | ~d)</code>.</li> </ul> </li> </ul> One needs to limit the recursion if there is any concern that rule chain depths may not be statically bounded or easily analyzable, but otherwise this yields the correct answer in a single pass without the need to track users of a node to later rewrite and recanonicalize it. And that's the whole pipeline: we now have a way to optimize code by translating to sea-of-nodes-with-CFG, applying rewrites as we go, then translating back to classical SSA CFG. In the process we've achieved all the goals we set out with: GVN, LICM, DCE, rematerialization, and algebraic rewrites. E-graphs: Representing Many Possible Rewrites</h2> So far, we've described a system that has zero or one deterministic rewrite for any given node; this is analogous to a classical compiler pipeline that destructively updates instructions/operators. This is great for rewrite rules like x+0 -> x</code>: the right-hand side is unambiguously better if it is "smaller" (rewrites a whole expression into only one of its parts). This is also fine when instructions have clear and very distinct costs, such as integer divide (typically tens of cycles or more even on modern CPUs) by a constant converted into magic wrapping multiplies</a>. But what about cases where the benefit of a rewrite is less clear, or depends on context, or depends on how it may or may not be able to compose with or enable other rewrites in a given program? For example, consider the classical example from the 2021 paper on egg, an e-graph framework</a>: if we have the expression (x * 2) / 2</code> in our program, we would expect that to simplify to x</code>1</a>. To implement this simplification, we might have a general rewrite rule (x * k) / k -> x</code>. But we might also, separately, have a rewrite rule that (x * 2^k) -> (x << k)</code>, i.e., convert a multiplication into a left-shift operation. If we performed this latter rewrite eagerly, the former rewrite rule might never match. (Now, you might complain that we could also convert the divide into a right-shift, then we have another rewrite rule that simplifies (x << k) >> k -> x</code>. In this particular example, that might be reasonable. But (i) that required careful thinking about canonical forms, where multiplies/divides by powers-of-2 are always canonicalized down to shifts, and (ii) this same fortunate behavior might not exist for all rulesets.) In general, we also have a question at the rule-application level: if multiple rules apply, which do we take? In the above example, we would have had to have some prioritization scheme to (say) apply strength-reduction rules to convert to shifts before we examine divide-of-multiply. That's an extra layer of heuristic engineering that must be considered when designing the optimizer. Onto the scene, then, comes a new data structure: the e-graph, or equivalence graph, which is a kind of sea-of-nodes program/expression representation that can represent many different equivalent forms of a program at once. The key idea is that, rather than have a single expression node as a referent for any value, we have an e-class (equivalence class) that contains many e-nodes, and we can pick any of these e-nodes to compute the value. The idea is a sort of principled approach to the optimization problem: let's model the state space explicitly, and then pick the best result objectively. Typically one uses the result of an e-graph by "extracting" one possible representation of the program according to a cost metric. (More on this below, but a simple cost metric could be a static number per operator kind, plus cost of inputs.) The magic of e-graphs is how they can compress a very large combinatorial space of equivalent programs into a small data structure. A detailed exploration of how this works is beyond the scope of this blog post (please read the egg paper: it's very good!) but a very short intuitive summary might be something like: Ensuring that all value uses point to an e-class rather than a particular node will propagate knowledge of equivalences to maximally many places. That is, if we know that op1 v1, v2</code> is equivalent to op2 v3, v4</code>, all users of the op1 v1, v2</code> expression should automatically get the knowledge propagated that they can use any form. This knowledge propagation is the essence of "equality saturation" that e-graphs enable. </li> A strong regime of canonicalization and "re-interning" (re-hashconsing), which the egg paper calls "rebuilding", ensures that such information is maximally propagated. Basically, when we discover that the op1</code> and op2</code> expressions above are equivalent, we re-process all users of both op1 and op2, looking for more follow-on consequences. Merging those two might in turn cause other expressions to be equivalent or other rewrite rules to fire. </li> </ul> Practical Efficiency of Classical E-graphs</h3> The two problems that arise with a "classical e-graph" (by which I include the 2021 egg paper's batched-rebuilding formulation) are blowup -- that is, too many rewrite rules apply and the e-graph becomes too large -- and data-structure inefficiency. The blowup problem is easier to understand: if we allow for representing many different forms of the program, maybe we will represent too many, and run out of memory and processing time. It is often hard to control how rules will compose and lead to blowup, as well: each rewrite rule may seem reasonable in isolation, but the transitive closure of all possible programs under a well-developed set of equivalences can be massive. So practical applications of e-graphs usually need some kind of meta/strategy driver layer that uses "fuel" to bound effort, and/or selectively applies rewrites where they are likely to lead to better outcomes. Even then, this operating regime often has compile-times measured in seconds or worse. This may be appropriate for certain kinds of optimization problems where compilation happens once or rarely and the quality of the outcome is extremely important (e.g., hardware design), but not for a fast compiler like Cranelift. We can protect against such outcomes with careful heuristics, though, and the possibility of allowing for objective choice of the best possible expression is still very tempting. So in my initial experiments, I applied the egg crate</a> to the problem and eventually, with custom tweaks, managed to get e-graph roundtripping to 23% overhead</a> -- with no rewrites applied. That's not bad at first glance but it proposes to replace an optimization pipeline that itself takes only 10% of compile-time, and we haven't yet added the rewrites to the 23%. (And the 23% came after a good amount of data-structure engineering to reduce storage; the initial overhead was over 2x.) In profiling the optimizer's execution, the overheads were occurring more or less in building the e-graph itself (that is, cache misses throughout the code transcribing IR to the e-graph). And what does the e-graph contain? Per e-class, it contains a "parent pointer" list: we need to track users of every e-class so that we can re-canonicalize them during the "rebuild" step when e-classes are merged (a new equivalence is discovered). And, even more fundamentally, it stores e-nodes separately from e-classes, which is an essential element of the idea but means that we have (at least) two different entities for each value, even when most e-classes have only one e-node. Is there any way to simplify the data structures so that we don't have to store so many different bits for one value? Insight #1: Implicit E-graph in the SSA IR</h3> The first major insight that enabled efficient implementation of an e-graph in Cranelift was that we could redefine the existing IR into an implicit e-graph, without copying over the whole function body into an e-graph and back, thus avoiding the compile-time penalty of this data movement. (Data movement can be very expensive when the main loops of a program are otherwise fairly optimized! It is best to keep and operate on data in-place whenever possible.) We start with a sea-of-nodes-with-CFG, where we have an IR with SSA values not placed in basic blocks. We can already build this "in-place" in Cranelift's IR, CLIF, by removing existing SSA definitions from the CFG but keeping their data in the data-flow graph (DFG) data structures. Then, to allow for multi-representation in an e-graph, the idea is to discard the separation between e-classes and e-nodes, and instead define a new kind of IR node that is a union node. Rather than two index spaces, for e-nodes and e-classes, we have only one index space, the SSA value space. An SSA value is either an ordinary operator result or a block parameter (as before), or a union of two other SSA values. Any arbitrary e-class can then be represented via a binary tree of union nodes. We don't need to change anything about operator arguments to make use of this representation: operators already refer to value numbers, and an e-class of multiple e-nodes (defined by the "top" union node in its union tree) already has a value number. The coolest thing about this representation is: once we have a sea-of-nodes, it is already implicitly an e-graph, with "trivial" (one-member) e-classes for each e-node. Thus, the lift from sea-of-nodes to e-graph is a no-op -- the best (and cheapest) kind of compile-time pass. We only pay for multi-representation when we use that capability, creating union nodes as needed. Insight #2: Acyclicity with Eager Rewrites</h3> The other aspect of the classical e-graph data structure's cost has to do with its need to rebuild, and in order to do so, to track all uses of an e-class (its "parents" in egg's terminology). Cranelift does not keep bidirectional use-def links, and the binary tree of union nodes would make this even more complex still to track. In trying to address this cost, I started with a somewhat radical question: what would happen if we never rebuilt (to propagate equalities)? How much "optimization goodness" would that give up? If one (i) builds an e-graph then (ii) applies rewrite rules to find better versions of nodes, adding to e-classes, then the answer is that this would hardly work at all: this would mean that all users of a value would see only its initial form and never its rewrites. The rewritten forms would float in the sea-of-nodes, and union-nodes joining them to the original forms would exist, but no users would actually refer to those union nodes. Instead, what is needed is to apply rewrites eagerly. When we create a new node in the sea-of-nodes, we apply all rewrites immediately, then join those rewrites with the original form with union nodes. The "top" of that union tree is then the value number used as the "optimized form" of that original value, referenced by all subsequent uses. The union-node representation plays a key part of this story: it acts as an immutable data structure in a sense, where we always append new knowledge and union it with existing values, and refer to that "newer version" of an e-class; but we never go back and update existing references. This has a very nice implication for the graph structure of the sea of nodes as well: it preserves acyclicity! Classical e-graphs, in their rebuild step, can create cycles even when the input is acyclic because they can condense nodes arbitrarily. But when we eagerly rewrite, then freeze, we can never "tie the knot" and create a cycle. This acyclicity is important because it permits a single pass for the rewrites. In fact, taking our sea-of-nodes build algorithm above as a baseline, we can add eager rewriting as a very small change: when we apply rewrites, we build a "union-node spine" to join all rewritten forms, rather than destructively take only the rewritten form. def canonicalize_and_rewrite(basic_block): for inst in basic_block: if is_pure(inst): # ... if inst in hashcons_map: # ... else: optimized = rewrites(inst) # NEW union = join_with_union_nodes(inst, optimized) # NEW optimized_form[inst.def] = union else: # ... </code></pre> All of these aspects work together and cannot really be separated: Union nodes allow for a cheap, pay-as-you-go representation for e-classes, without a two-level data structure (nodes and classes) and without parent pointers.</li> Eager rewriting, applied as we build the e-graph (sea of nodes), allows for a single-pass algorithm and ensures all members of the e-class are present before it is "sealed" by union nodes and referenced by uses.</li> Acyclicity, present in the input (because of SSA), is preserved by the append-only, immutable nature of union nodes, and permits eager rewriting to work in a single pass.</li> </ul> Note that here we are glossing over recursive rewrites. Due to space constraints I will only outline the problem and solution briefly: the right-hand side of a rewrite rule application (rewrites</code> above) will produce nodes that themselves may be able to trigger further rewrites. Rather than leave this to another iteration of a rewrite loop, as a classical e-graph driver might do, we want to eagerly rewrite this right-hand side as well before establishing any uses of it. So we recursively re-invoke rewrites</code>; and this occurs within the right-hand side of rules as pieces of the final expression are created, as well. This recursion is tightly bounded (in levels and in total rewrite invocations per top-level loop iteration) to prevent blowup. Finally, we are also glossing over details of how we apply our pattern-matching/rewrite DSL, ISLE, to the rewrite problem when multiple rewrites are now permitted. In brief, we extended the language to permit "multi-extractors" and "multi-constructors": rather than matching only one rule, and disambiguating by priority, we take all applicable rules. The RFC</a> has more details. The Extraction Problem</h3> So we now have a way to represent multiple expressions as alternatives to compute the same value. How do we compile this program? It surely wouldn't make sense to compile all of these expressions: they produce the same bits, so we only need one. Which one do we pick? This is the extraction problem, and it is both easy to state and deceptively hard (in fact, NP-hard): choose the easiest (cheapest) expression to compute any given value. Why is this hard? First, let's construct the case where it's easy. Let's say that we have one root expression (say, returned from a function) with all pure operators. This forms a tree of choices: each eclass lets us choose one enode to compute it, and that enode has arguments that themselves refer to eclasses with choices. Given this tree of choices, with every choice independent, we can pick the best choice for each subtree, and compute the cost of any given expression node as best-cost-of-args plus that own node's cost to compute. In more formal algorithmic terms, that is optimal substructure</a>. Unfortunately, as soon as we permit references to shared nodes (a DAG rather than a tree), this nice structure evaporates. To see why, consider: we could have two eclasses we wish to compute v0 = union v10, v11 v1 = union v10, v12</code></pre> with computations (not shown) v10</code> that costs 10 units to compute, and v11</code> and v12</code> that each cost 7 units to compute. The optimal choice at each subproblem is to choose the cheaper computation (v11</code> or v12</code>), but the program would actually be more globally optimal if we computed only v10</code> (cost of 10 total). A solver that tries to recognize this would either process each root (v0</code> and v1</code>) one at a time and "backtrack" at some point once it sees the additional use, or somehow build a shared representation of the problem, which is no longer deconstructed in a way that permits sub-problem solutions to compose. In fact, the extraction problem is NP-hard. To see why, I will show a simple linear-time reduction (mapping) from a known NP-hard problem, weighted set-cover, to eclass extraction. Take each weighted set S</code> with weight w</code> and elements S = { x_1, x_2, ... }</code>. Add an enode (operator with args) N</code>, with self-cost (not including args) w</code>, and no arguments. Then for each element x_n</code> in the universe (the union of all sets' elements), define an eclass: that is, if we have an x_i</code>, define an eclass C_i</code>. Then for each set-element edge (for each i</code>, j</code> such that x_i ∈ S_j</code>), add an enode to C_i</code> with opaque zero-cost operator SetElt_ij(y)</code> where y</code> is the eclass for x_i</code>. Performing an optimal (lowest-cost) extraction, with all eclasses C_i</code> taken as roots, will compute the lowest-weight set cover: the choice of enode in each eclass C_i</code> encodes which set we choose to cover element x_i</code>. Thus, because egraph extraction with shared structure can compute the solution to an NP-hard problem (weighted set cover), egraph extraction with shared structure is NP-hard. OK, but we want a fast compiler. What do we do? The classical compiler-literature answer to this problem -- seen over and over in a 50-year history -- is "solve a simpler approximation problem". Register allocation, for example, is filled with simplified problem models (linear scan, no live-range splitting, ...) that reduce the decision space and allow for a simpler algorithm. In our case, we solve the extraction problem with a simplifying choice: we will not try to account for shared substructure and the way that it complicates accounting of cost. In other words, we'll ignore shared substructure, pretending that each use of a subtree counts that subtree's cost anew. For each enode, having computed the cost of each of its arguments, we can compute its own cost easily as the sum of its arguments plus its own computation cost; and for each eclass, we can pick the minimum-cost enode. That's it! We implement this with a dynamic programming</a> algorithm: we do a toposort of the aegraph (which can always be done, because it's acyclic), then process nodes from leaves upward, accumulating cost and picking minima at each subproblem. This is a single pass</a> and is a relatively fast and straightforward algorithm. After the Dagstuhl seminar in January, I had an ongoing discussion with collaborators Alexa VanHattum and Nick Fitzgerald about whether we could do better here. Alexa and Nick both prototyped a bunch of interesting alternatives: dynamically updating (shortcutting to zero) costs when subtrees become used ("sunk-cost" accounting), computing costs by doing full top-down traversals rather than bottom-up dynamic programming (and then mixing in memoization somehow), trying to account for sharing by doing DP but tracking the full set of covered leaves</a>, and some other things. This was an interesting exploration but in the end we didn't find anything that looked better in the compile-time / execution-time tradeoff space. We have an issue tracking this</a> and more ideas are always welcome, of course. Other Aspects</h3> There are two other aspects of our aegraph implementation that I don't have space to go into in this post: There is an interesting problem that arises with respect to the domtree and SSA invariants when different values are merged together with a union node and some of them have wider "scope" than others. For example, via store-to-load forwarding we may know that a load instruction produces a constant 0</code>; so we might have a union node with iconst 0</code>. The load can only happen at its current location, but iconst 0</code> can be computed anywhere. A user of this eclass should be able to pick either value (said another way: extraction should not be load-bearing for correctness). If the user is within the dominance subtree under the load, then all is fine, but if not, e.g. if some other user of iconst 0</code> elsewhere in the function errantly happened upon the eclass-neighbor load instruction, we might get an invalid program. There are many ways one might be tempted to solve this, but in the end we landed on an "available block" analysis that runs as we build nodes. For every node, we record which block is the "highest" in the domtree that it can be computed: function entry for pure zero-argument nodes, current block for any impure nodes, otherwise the lowest node in the domtree among available blocks for all arguments. (Claim: the available-nodes for all args of a node will form an ancestor path in the domtree; one will always exist that is dominated by all others. This follows from the properties of SSA.) Then when we insert into the hashcons map, we insert at the level that the final union is available. </li> We also have an important optimization that we call subsume</code>. This is an identity operator that wraps a value returned by a rewrite rule. It is not required for correctness, but its semantics are: if any value is marked "subsume", all "subsuming" values erase existing members of the eclass. Usually, only one subsuming rule will match (but this, also, is not necessary for correctness). The usual use-case is for rules that have clear "directionality": it is always better to say 2</code> than (iadd 1 1)</code>, so let's go ahead and shrink the eclass so that all further matching, and eventual extraction, is more efficient. </li> </ul> Evaluation</h2> So how does all of this actually work? Do aegraphs benefit Cranelift's strength as a compiler -- its ability to optimize code, its efficiency in doing so quickly, or both? This is the part where I offer a somewhat surprising conclusion: the tl;dr of this post is that I believe the sea-of-nodes-with-CFG aspect of this mid-end works great, but the aegraph itself -- the ability to represent multiple options for one value -- may not (yet?) be pulling its weight. It doesn't really hurt much either, so maybe it's a reasonable capability to keep around. But in any case, it's an interesting conclusion and we'll dig more into it below. The main interesting evaluation is a two-dimension comparison of compile time -- that is, how long Cranelift takes to compile code -- on the X-axis, versus execution time -- that is, how long the resulting code takes to execute -- on the Y-axis. This forms a tradeoff space: it may be good to spend a little more time to compile if the resulting code runs faster (or vice-versa), for example. Of course, reducing both is best. One point may be "strictly better" than another if it reduces both -- then there is no tradeoff, because one would always choose the configuration that both compiles faster and produces better code. (One can then find the Pareto frontier</a> of points that form a set in which none is strictly better than another -- these are all "valid configuration points" that one may rationally choose depending on one's goals.) Below we have a compile-time vs. execution-time plot for a number of configurations of Cranelift, compiling and running the Sightglass benchmark suite</a>: No optimizations enabled;</li> The (on by default) aegraph-based optimization pipeline, as described in this post, with several variants (below);</li> A "classical optimization pipeline" that does not form a sea-of-nodes-with-CFG at all; instead, it applies exactly the same rewrite rules, but in-place, and interleaves with classical GVN and LICM passes;</li> Variants of the aegraphs pipeline and classical pipeline with the whole mid-end repeated 2 or 3 times (to test whether code continues to get better).</li> </ul> Here's the main result: A few conclusions are in order. First, the aegraph pipeline does generate better code than the classical pipeline. This objective result is "mission accomplished" with respect to the aegraph effort's original motivation: we wanted to allow optimization passes to interact more finely and optimize more completely. Note in particular that repeating the classical pipeline multiple times does not get the same result; we could not have obtained the ~2% speedup without building a new optimization framework. Second, though, there is clearly a Pareto frontier that includes "no optimizations" and "classical pipeline" as well as the aegraph variants: each takes more compilation time than the previous. In other words, moving from a classical compiler pipeline to the design described here, we spend about 7-8% more compile time. Notably, this is not the result that we had when we first built the aegraphs implementation in 2023 and switched over -- at that time, we were more or less at parity. This is likely a result of the growth of the body of rewrite rules over the intervening three years. To get a better picture of how aegraph's various design choices matter, let's zoom into the area in the red ellipse above, which contains multiple variants of the aegraphs pipeline: "aegraph": Exactly as described in this post, and default Cranelift configuration;</li> "no multivalue (eager pick)": sea-of-nodes-with-CFG, without union nodes; i.e., not actually representing more than one equivalent value in an eclass. Instead, after evaluating rewrite rules, we pick the best option and use that one option (destructively replacing the original);</li> "no rematerialization": testing the effect of this aspect of the elaboration algorithm;</li> "no subsume": testing this efficiency tweak of the rewrite-rule application.</li> </ul> Here's the plot: One can see that there are some definite tradeoffs, but looking closely at the axis scales, these effects are very very small. In particular, moving from sea-of-nodes-with-CFG to true aegraph (taking all rewritten values, and picking the best in a principled way with cost-based extraction) nets us ~0.1% execution-time improvement, at ~0.005% compile-time cost. That's more-or-less in the noise. Supporting that conclusion is the statistic that the average eclass size after rewriting is 1.13 enodes: in other words, very few cases with our ruleset and benchmark corpus actually result in more than one option. Finally, the most interesting question in my view: does the eager aspect of aegraphs -- applying rewrite rules right away, and never going back to "fill in" other equivalences -- matter? In other words, does skipping equality saturation take the egraph goodness out of an egraph(-alike)? We can measure this, too: I instrumented our implementation to track when a subtree of an eclass is not chosen by extraction, and then any node in that subtree is later actually elaborated (in other words, when we use a suboptimal choice because we could not see an equality in the "wrong" direction). This should only happen if, in theory, our rules rewrite f</code> to g</code> where cost(g) > cost(f)</code>, and we don't have a rewrite g</code> to f</code>: then a user of g</code> might never directly get a rewrite of f</code> eagerly, but a later coincidentally-occurring f</code> might rewrite onto g</code> (but we'll never propagate that equality into the original users of g</code>). It turns out that, in all of our benchmarks, with ~4 million value nodes created overall, this happens two (2) times. Both instances occur in spidermonkey.wasm</code> (a large benchmark that consists of the SpiderMonkey JS engine, compiled to WebAssembly, then run through Wasmtime+Cranelift), and occur due to an ireduce-of-iadd rewrite rule that violates this move-toward-lower-cost principle (explicitly, in the name of simplicity). Overall, we conclude that the eager rewrites are effective as long as the ruleset is designed with optimization (rather than mere exploration of all equivalent expressions) in mind. Discussion</h2> The most surprising conclusion in all of the data was, for me, that aegraphs (per se) -- multi-value representations -- don't seem to matter. What?! That was the entire point of the project, and (proper) e-graphs have seen great promise in other application areas. I think the main reason for this is that our workload is somewhat "small" in a combinatorial possibility-space sense: we are (i) compiling workloads that are often optimized already (as Wasm modules) before hitting the Cranelift compilation pipeline, and (ii) applying a set of rewrite rules that, while large and growing (hundreds of rules), explicitly do not include identities like associativity and commutativity, or arbitrary algebraic identities, that do not "simplify" somehow. In other words, if we're generally applying rewrites that look more like simple, obvious "cleanups", we would expect that we don't hold a "superposition" of multiple good expression options very often. Given that it doesn't cost us that much compile time to keep aegraphs around, though, maybe this is... fine? Having the capability to do principled cost-based extraction is great, versus having to think about whether a rewrite rule should exist. We still do try to be careful not to introduce rules that are never productive, of course. And, further into the future, one could imagine that workloads with more optimization opportunity could cause more interesting situations to occur within the aegraph, leading to more emergent composition in the rewrites. Future Directions</h2> There are a bunch of directions we could (and should) take this in the future. In terms of evaluation: finding the "corner of the use-case domain" where aegraphs truly shine is still an open question. More concretely: if we evaluate Cranelift with new and different workloads, and/or pile on more rewrite rules, do we get to a point where the classical benefit of "multi-representation with cost-based extraction" pays off in a conventional compiler? I don't know! There is also still a lot of room to improve the core algorithms: Better extraction, as mentioned above: something that accounts for shared substructure would be great, as long as we don't have to pay the NP-hard cost for it. Maybe there's a nice approximation algorithm that's better than our current dynamic-programming approach. </li> We'd like to be able to handle more rewrites that alter the CFG skeleton as well. Right now, we have a separate ISLE entry-point that allows for destructive rewriting of skeleton instructions (thanks to my colleague Nick Fitzgerald for building this!). However, maybe we could remove redundant block parameters (phi nodes), for example; and/or maybe we could fold branches; and/or maybe we could apply path-sensitive knowledge to values when used in certain control-flow contexts (x=1</code> in the dominance subtree under the "true" branch for if x==1, goto ...</code>). My former colleague Jamey Sharp wrote up a few excellent, in-depth issues on these topics in our tracker (#5623</a>, #6109</a>, #6129</a>) and I think there is a lot of potential here. (The full version of this is, again, something like RVSDG</a> in the node language seems like the most principled option to express all useful forms of control-flow rewrites; Jamey also has a prototype called optir</code></a> for this.) </li> It would be interesting to experiment with incorporating our lowering backend rules into the aegraph somehow: they are a rich, fruitful target-specific database of natural "costs" for various operations. For example, on AArch64 we can fold shifts and extends into (some) arithmetic operations for "free"; maybe this alters the extraction choices we make. Or likewise for the various odd corners of addressing modes on each architecture. The simple version of this idea is to incorporate lowering rules as rewrites, and make the egraph's node language a union of CLIF and the machine's instruction set. But maybe there's something better we could do instead, allowing multi-extractors to see the aegraph eclasses directly and keeping various VCode sequences. I need to write up more of my ideas on this topic someday. Jamey also has more thoughts on this in #8529</a>. </li> </ul> I'm sure there are other things that could be done here too! Further Reading</h2> I gave a talk about aegraphs at EGRAPHS 2023</a>: slides</a>, re-recorded video</a> (the original was not recorded). </li> I gave a talk about aegraphs at the January 2026 Dagstuhl e-graphs seminar</a>; the slides</a> are a heavily updated and amended version of the 2023 talk, with the experiments/data I presented here. </li> There is a Cranelift RFC on aegraphs here</a>, and one on ISLE (the rewrite DSL that we use to drive rewrites in the aegraph) here</a>. </li> The main PR that implemented the current form of aegraphs is here</a>, co-authored by my former colleague Jamey Sharp (this production implementation was a fantastically fun and productive pair-programming project!). </li> </ul> Acknowledgments</h2> Thanks to many folks for discussion of the ideas around aegraphs through the years: Nick Fitzgerald, Jamey Sharp, Trevor Elliott, Max Willsey, Alexa VanHattum, Max Bernstein, and many others at the Dagstuhl e-graphs seminar. None of them reviewed this post (it had been languishing for too long already and I wanted to get it out) so all fault for any errors herein is solely my own! ^{1 Possibly with masking of the top bit if our IR semantics have defined wrapping/truncation behavior: x & 0x7fff..ffff</code>. </div>}

Fast(er) JavaScript on WebAssembly: Portable Baseline Interpreter and Future Plans

2023-10-11T00:00:00+00:00

For the past year, I have been hard at work trying to improve the performance of the SpiderMonkey</a> JavaScript engine when compiled as a WebAssembly</a> module. For server-side applications that use WebAssembly (and WASI</a>, its "system" layer) as a software distribution and sandboxing technology with significant exciting potential</a>, this is an important enabling technology: it allows existing software written in JavaScript to be run within the sandboxed environment and to interact with other Wasm modules.

Running an entire JavaScript engine inside of a Wasm module may seem like a strange approach at first, but it serves real use-cases. There are platforms that accept WebAssembly-sandboxed code for security reasons, as it ensures complete memory isolation between requests while remaining very fine-grained (hence with lower overheads). In such an environment, JavaScript code needs to bring its own engine, because no platform-native JS engine is provided. This approach ensures a sandbox without trusting the JavaScript engine's security -- because the JS engine is just another application on the hardened Wasm platform -- and carries other benefits too: for example, the JS code can interact with other languages that compile to Wasm easily, and we can leverage Wasm's determinism and modularity to snapshot execution and then perform extremely fast cold startup. We have been using this strategy to great success for a while now: we did the initial port of SpiderMonkey to WASI in 2020, and we wrote two years ago</a> about how we can leverage Wasm's clean modularity and determinism to use snapshots for fast startup.

This post is an update to that effort. At the end of that prior post, we hinted at efforts to bring more of the JavaScript performance engineering magic that browsers have done to the JS-in-Wasm environment. Today we'll see how we've successfully adapted inline caches, achieving significant speedups (~2x in some cases) without compromising the security of the interpreter-based strategy. At the end of this post, I'll hint at how we plan to use this as a foundation for ahead-of-time compilation as well. Exciting times ahead!

Note: the design document</a> is also available, and SpiderMonkey patches can be found on the upstreaming bug</a>.

Current State: Interpreters Only Beyond This Point</h2>
A distinguishing feature of some platforms is that they do not allow runtime code generation. For example, a WebAssembly module may contain functions; these functions are compiled at some point before the functions are run; but the functions cannot do anything to create new functions in the same code space, at least without going through some lower-level and nonstandard system interface.1</a>
On the other hand, JavaScript engines over the past several decades have embraced runtime code generation. The basic reason for this is that there are a lot of facts about a JS program that one cannot know (or not easily, without human intelligence and reasoning and a view of the whole program) until the program executes. For example: in the simple one-line function function(x) { return x + x; }</code>, what is x</code>'s type? It could be an integer, a floating-point number, a string, or an object that can be converted to a string, or probably many other things.2</a> If one were to try to generate machine code for that function ahead-of-time, it would look very different than that of, say, a C function with the same return x + x;</code> body but an integer-typed x</code>. It would have to contain type-checks, branches, and implementations for all the different possible cases. With that dynamic dispatch overhead, and the bloated and difficult-to-optimize function body that supports all combinations of types, we are unlikely to see much speedup unless we can know something about the types. A similar problem arises in other "dynamic" aspects of the language: when we say obj.x</code>, where in the memory for the object obj</code> is the field x</code>? When we call a function f(1, 2, 3)</code>, is f</code> another JavaScript function, a native function in the runtime, or something even more special that we handle in a different way?
The modern JavaScript engine's answer to performance, then, is to collect a lot of information as a program executes and then dynamically generate machine-code that is specialized to what the program is actually doing, but can fall back to the generic implementation if conditions change. Because we can't generate this code until the program is already running, we need the platform to support the ability to add more executable code as we are running: this is "runtime codegen", or as it is often known, "JIT (just-in-time) compilation".
So we have a conundrum: we run JavaScript inside a Wasm module, we want performance, but the usual way to get that performance is to JIT-compile specialized machine-code versions of the JS code after observing it, and we can't do that from within a Wasm module. What are we to do?
Systematic Fast-Paths: Inline Caches</h2> It's worth understanding how JITs specialize the execution of the program based on runtime observations. A simple approach might be to build in "ad-hoc" kinds of observations to the interpreter, and use those as needed in a type-specific "specializing" compiler. For example, we could record types seen at a +</code> operator at a given point in the program, and then generate only the cases we've observed when we compile that operator (perhaps with "fallback code" to call into a generic implementation if our assumption becomes wrong). However, this ad-hoc approach does not scale well: every semantic piece (operators, function calls, object accesses) of the language implementation would have to become a profiler, an analysis pass, and a profile-guided compiler. Instead, JITs specialize with a general dispatch mechanism known as "inline caches" (ICs), and ICs build straight-line sequences of "fast-paths" in a uniform program representation. The usual approach is to define certain points in the original program at which we have an operator with widely-varying behavior, and place an inline cache site in the compiled code. The idea of an inline cache site is that it performs an indirect dispatch to some other "stub": these are "pluggable implementations" that replace a generic operator, like +</code>, with some specific case, like "if both inputs are int32</code>-typed, do an integer add". For example, we might compile the following function body into the polymorphic (type-generic) code on the left, then generate the specialized fast-paths and attach them as stubs on the right: The IC site starts with a link to a generic implementation -- just like the naive interpreter -- that works for any case. However, after it executes, it also generates a fast-path for that case and "attaches" the new stub to the IC site. The stubs form a "chain", or singly-linked list, with the generic case at the end. Some examples of fast-paths that we see in practice are:
For an access to a property on an object, we can generate a fast-path that checks whether the object is a known "shape" -- defined as the set of existing properties and their layout in memory -- and directly accessing the appropriate memory offset on a match. This avoids an expensive lookup by property name. </li> For any of the polymorphic built-in operators, like +</code>, we can generate a fast-path that checks types and does the appropriate primitive action (integer addition, floating-point addition, or string concatenation, say). </li>For a call to a built-in or "well-known" function, we can generate a fast-path that avoids a function call altogether. For example, if the user calls String.length</code>, and this has not been overridden globally (we need to check!) and the input is a string, then the IC can load the string length directly from the known length-field location in the object. This replaces a call into the JS runtime's native string-implementation code with just a few IC instructions. </li> </ul> Each stub has a simple format: it checks some conditions, then either does its action (if it is a matching fast-path) or jumps to the next stub in the chain. This collection of stubs, once "warmed up" by program execution, is useful in at least two ways. First, it represents a knowledge-base of the program's actual behavior. The execution has been "tuned" to have fast-paths inserted for cases that are actually observed, and will become faster as a result. That is quite powerful indeed! Second, an even more interesting opportunity arises (first introduced in the WarpMonkey</a> effort from the SpiderMonkey team, to their knowledge a novel contribution</a>): once we have the IC chains, we can use the combination of two parts -- the original program bytecode, and the pluggable stub fast-paths -- to compile fully specialized code by translating both to one IR and inlining. This is how we achieve specialized-variant compilation in a systematic way: we just write out the necessary fast-paths as we need them, and then we later incorporate them.3</a> The following figure illustrates this process: In the SpiderMonkey engine, there are three JIT tiers that make use of ICs: The "baseline interpreter</a>" interprets the JS function body's opcodes, but accelerates individual operations with ICs. The interpreter-based approach means we have fast startup (because we don't need to compile the function body), while ICs give significant speedups on many operations. </li> The "baseline compiler" translates the JS function body into machine code on a 1-for-1 basis (each JS opcode becomes a small snippet of machine code), and dispatches to the same ICs that the baseline interpreter does. The main speedup over the baseline interpreter is that we no longer have the "JS opcode dispatch overhead" (the cost of fetching JS opcodes and jumping to the right interpreter case), though we do still have the "IC dispatch overhead" (the cost of jumping to the right fast-path). </li> The optimizing compiler, known as WarpMonkey</a>, inlines ICs and JS bytecode to perform specialized compilation. </li> </ul> We can summarize the advantages and tradeoffs of these tiers as follows: Name</th> Data Required</th> JS opcode dispatch</th> ICs</th> Optimization scope</th> CacheIR dispatch</th> Codegen at runtime</th></tr></thead> Generic (C++) interpreter</td> JS bytecode</td> Interpreter (C++)</td> None</td> None</td> N/A</td> No</td></tr> ----------------------------------</td> -----------------------------------</td> --------------------------------------------</td> ------------------</td> ----------------------------------------</td> ------------------</td> ----------------------------------</td></tr> Baseline interpreter</td> JS bytecode + IC data structures</td> Interpreter (generated at startup)</td> Dynamic dispatch</td> Special cases within one opcode via IC</td> Compiled</td> Yes (IC bodies, interp body)</td></tr> ----------------------------------</td> -----------------------------------</td> --------------------------------------------</td> ------------------</td> ----------------------------------------</td> ------------------</td> ----------------------------------</td></tr> Baseline compiler</td> JS bytecode + IC data structures</td> Compiled function body (1:1 with bytecode)</td> Dynamic dispatch</td> Special cases within one opcode via IC</td> Compiled</td> Yes (IC bodies, function bodies)</td></tr> ----------------------------------</td> -----------------------------------</td> --------------------------------------------</td> ------------------</td> ----------------------------------------</td> ------------------</td> ----------------------------------</td></tr> Optimizing compiler (WarpMonkey)</td> JS bytecode + warmed-up IC chains</td> Compiled function body (optimized)</td> Inlined</td> Across opcodes / whole function</td> Compiled</td> Yes (optimized function body)</td></tr> </tbody></table> Can we Use ICs in a WASI Program?</h2> Given that we have a means to speed up execution beyond that of a generic interpreter, namely, inline caches (ICs), and given that SpiderMonkey supports ICs, surely we can simply make use of this feature in a build of SpiderMonkey for WASI (i.e., when running inside of a Wasm module)? Not so fast! There are two basic problems: As designed, the IC stubs can only be run as compiled code. Even the "baseline interpreter" above will invoke a pointer to an IC stub of machine code compiled with a dedicated IC-stub compiler. This works well for SpiderMonkey on a native platform -- the fastest way to implement a fast-path is to produce a purpose-built sequence of a handful of machine instructions -- but is not compatible with WebAssembly's inability to add new code at runtime that we noted above. This is because SpiderMonkey only knows what the fast-paths should be after it starts executing, which is too late to add code to the Wasm module. </li> Less fundamental, but still a roadblock: the "baseline interpreter" in SpiderMonkey is also JIT-compiled, albeit once at JS engine startup rather than as code is executing. This is more of an implementation/engineering tradeoff, wherein the SpiderMonkey authors realized they could reuse the baseline compiler backend</a> to cheaply produce a new tier (a brilliant idea!), but again is not compatible with the WASI environment. </li> </ul> You might already be thinking: the above two points are not laws of nature -- nothing says that we can't interpret whatever code we would have JIT-compiled and executed in native SpiderMonkey. And you would be right: in fact, that's the starting point for the Portable Baseline Interpreter (PBL)!4</a> A Baseline without JIT: Portable Baseline</h2> Here we can now introduce the Portable Baseline Interpreter, or PBL for short. PBL is a new execution tier</a> that replaces the native "baseline interpreter" described above. Its key distinguishing feature is that it does not require any runtime code generation (JIT-compilation). Thus, it is suitable for use in a Wasm/WASI program, or in any other environment where runtime codegen is prohibited. The key design principle of PBL is to stick as closely as possible to the other baseline tiers. In SpiderMonkey, significant shared machinery exists for the (existing) baseline interpreter and baseline compiler: there is a defined stack layout and execution state, there is code that understands how to garbage-collect, introspect, and unwind this state, and there are mechanisms to track the inline caches associated with baseline execution. PBL's goal at a technical level is to perform exactly the work that the (native) baseline interpreter would do, except in portable C++ code rather than runtime-generated code. To achieve this goal, the two major tasks were: Implementing a new interpreter loop over JS opcodes. We cannot use the generic interpreter tier's main loop</a> (what SpiderMonkey calls the "C++ interpreter"), because the actions for each opcode in that implementation are "generic" -- they do not use ICs to specialize on types or other kinds of fast-paths -- and so are not suitable for our purposes. Likewise, we cannot use the baseline interpreter's main loop because it is generated at startup</a> using the JIT backend, and so is not suitable for use in a context where we can only run portable C++. Instead, we need to implement a new interpreter loop</a> whose actions for each opcode invoke ICs where appropriate -- exactly the actions that the baseline interpreter does, but written in portable code. This is superficially "simple", but turns out to require careful attention to many subtle details, because handwritten JIT-compiled code can control some aspects of execution much more precisely than C++ ordinarily can. (More on this below!) </li> Implementing an interpreter for CacheIR</a>, the intermediate representation in which the "fast-path code" for IC stubs is represented. CacheIR opcodes encode the "guards", or preconditions necessary for a fast-path to apply, and the actions to perform. There are many specialized CacheIR opcodes to particular data structures or runtime state -- it is a heavily custom IR -- but this tight fit to SpiderMonkey's design is exactly what gives it its ability to concisely encode many fast-paths. </li> </ul> In principle, developing an interpreter for an IR that already has two compilers (to machine code</a> and optimizing compiler IR (MIR)</a>) should be relatively straightforward: we transliterate the actions that the compiled code is performing into a direct C++ implementation. In a system as complex as a JavaScript engine, though, nothing is ever quite "simple". Challenges encountered in implementing the CacheIR interpreter fall into two general categories: aspects of execution that cannot be directly replicated in C++ code, so need to be "emulated" in some way; and achieving practical performance by keeping the "virtual machine" model lightweight and playing some other tricks too. We'll give a few examples of each kind of challenge below. Challenge: Baseline Stack Layout</h3> The first challenge that arose consists of emulating the stack as the JIT-compiled code would have managed it. SpiderMonkey's baseline tiers build a series of stack frames with a carefully-defined format</a> that can be traversed for various purposes: finding (and updating) garbage-collection roots, handling exceptions and unwinding execution, producing stack backtraces, providing information to the debugger API and allowing the debugger to update state, and so on. The format of a single stack frame consists of a JitFrame</code> that looks a lot like a "normal" function prologue's frame -- return address, previous frame pointer -- but also includes a "callee token" that the caller pushes, describing the called function at the JS level, and a "receiver" (the this</code> value in JavaScript). The BaselineFrame</code> below that records the JS bytecode virtual machine state in a known format, so it can be introspected: current bytecode PC, current IC slot, and so on. Below that, the JS bytecode VM's operand/value stack is maintained on the real machine stack. And, just before calling any other function, a "footer" descriptor is pushed: this denotes the kind of frame that just finished, so it can be handled appropriately. This format has a very important property: it has no gaps. It is not simply a linked list of fixed-size descriptor or header structures. If it were, we could potentially place BaselineFrame</code> / JitFrame</code> instances on the C++ stack in PBL, and link them together with the previous-FP fields as normal. But this won't work: rather, every machine word of the baseline-format stack is accounted for. This works fine for JIT-compiled code, because we control the code that is emitted and can maintain whatever stack-format we define. But because the C++ compiler owns and manages the machine stack layout when we are running in C++ code, PBL is not able to maintain the actual machine stack in this format. Thus, we instead define an auxiliary stack, build a series of real baseline frames on it, and maintain this in parallel to the executing C++ code's actual machine stack. When we enter a new frame at the C++ level, we push a new frame on the auxiliary stack; when we return, we pop a frame.5</a> This auxiliary stack is what the garbage collector, exception unwinder, debugger, and other components introspect: we store pointers to its frames in the JIT state, and so on. As far as the rest of the engine is concerned, it is the real stack. The only major difference is that all return addresses are nullptr</code>s: we don't need them, because we still manage control flow at the C++ level. Challenge: Unwinding</h3> A second issue that arises from the differences between a native machine model and that of PBL is unwinding. In JIT code, where we have complete control over emitted instructions, the call stack is just a convention and we are free to skip over frames and jump to any code location we please. The exception unwinder uses this to great effect: when an exception is thrown, the runtime walks the stack and looks for any appropriate handler. This might be several call-frames up the stack. When one is found, it sets the stack pointer and frame pointer to within that frame -- effectively popping all deeper frames in one action -- and jumps directly to the appropriate program counter in that handler's frame. Unfortunately, this is not possible to do directly in portable C++ code.6</a> Instead, starting from the invariant that one C++ frame in the PBL interpreter function "owns" one (or more as an optimization -- see below) baseline frames, we implement a linear-time unwinding scheme: each C++ interpreter invocation remembers its "entry frame"; when unwinding, after an exception or otherwise, we compare the new frame-pointer value to this entry frame; if "above" (higher in address, for a downward-growing stack), we return from the interpreter function with a special code indicating an unwind is happening. The caller instance of the PBL interpreter function then performs the same logic, propagating upward until we reach the correct C++ frame. The following figure contrasts the native and PBL strategies: Thus, we do not have the same asymptotic O(1)</code> unwind-efficiency guarantee that native baseline execution does, but we remain completely portable, able to execute anywhere that standard C++ runs. Challenge: VM exits</h3> A third issue that often arose was that of emulating VM exits. On the native baseline platform, when JIT code is executing, the stack is "under construction", in a sense: the innermost frame is not complete (there is no footer descriptor word) and is not reachable from the VM data structures. JIT code can call back into the runtime only via a carefully-controlled "VM exit" mechanism, which pushes a special kind of "exit frame", records the end of the series of contiguous JIT frames (the "JIT activation"), and then invokes C++ code: # JIT code: ... push arg2 push arg1 # trampoline call VMHelper -----> push exit frame cx->exitFP = fp call VMHelperImpl -----> walkStack(cx->exitFP) doStuff(cx) pop exit frame <----- ret ... <----- ret</code></pre> which results in a stack layout that looks like: While executing within the C++ PBL interpreter function, it is very tempting to simply call into the rest of the runtime as required. This results in a stack that looks like the below, and unfortunately breaks in all sorts of exciting and subtle ways: it may appear to work, but frames are missing and GC roots are not updated after a moving GC; or if the dangling exit FP is not null, an entirely bogus set of stack frames may be traced. Either way, various impossible-to-find bugs arise. PBL thus requires extreme discipline in separating "JIT-code mode" (or its emulation, in a portable C++ interpreter) and "runtime mode". To make this distinction clearer, I designed a type-enforced mechanism that leverages an important idiom in SpiderMonkey: every function that might perform a GC or otherwise introspect overall VM state will take a JSContext</code> parameter. In the PBL interpreter function, we hide the JSContext</code> (rename the local and set it to nullptr</code> normally). We then have a helper RAII class that pushes an exit frame and does everything that a "VM exit" trampoline would do, then behaves as a restricted-scope local that implicitly converts to the true JSContext</code>. This looks like the below: CASE(Opcode) { Value arg0 = POP().asValue(); // POP() is a macro that uses the `sp` local. Value arg1 = POP().asValue(); // here, `sp` is the top-of-stack for our in-progress frame in our // in-progress activation. We are "in JIT code" from the engine's // perspective, even though this is still C++. { // This macro completes the activation and creates a `cx` local // that gives us the JSContext* for use. PUSH_EXIT_FRAME(); if (!DoEngineThings(cx, arg0, arg1)) { goto error; } } // pops the exit frame. // ... }</code></pre> This idiom works fairly well in practice and statically prevents us from making most kinds of stack-frame-related mistakes. Optimization: Avoiding Function-Call Overhead</h3> At this point, we have introduced techniques to enable PBL to run correctly; we now have a functioning JavaScript interpreter that can invoke ICs. (Take a breath and celebrate!) Unfortunately, I arrived at this point and found that performance still lagged behind that of the generic interpreter. How could this be, when ICs directly encode fast-paths and allow us to short-circuit expensive runtime calls? The first realization came after profiling both a native build of PBL, and especially a Wasm build: C++ function calls can be expensive. The basic PBL design consisted of a JS interpreter that invoked the IC interpreter for every opcode with an IC -- a majority of them, in most programs (all numeric operators, property accesses, function calls, and so on!). Thus function calls are extremely frequent. Their high cost is for a few basic reasons: When the interpreter function is large and has a lot of context (live variables), register pressure is high; when the called function is similar, we effectively have a full "context switch" (save all register values and use for new variables) on every call/return. </li> Splitting logic across multiple functions precludes optimizations that span the logic of both functions. For example, the IC interpreter "reified control flow as data" by returning an enum value that the JS interpreter then switched on. Combining the two functions would allow us to embed the switch-bodies directly where the return code is set. </li> On many Wasm implementations, including Wasmtime</a> (my VM of choice and the main optimization target for our WASI port), function prologues have some extra cost: the generated code needs to check for stack overflow, and may need to check for interruption or preemption. This is a part of the cost of sandboxing that can only be avoided by staying within a single function frame. </li> </ul> Thus, it is very important to avoid function-call overhead whenever possible. I optimized this in two ways. First, the IC interpreter is aggressively inlined into the JS interpreter. This produces one super-interpreter that can run both kinds of bytecode -- JS bytecodes and CacheIR -- without extra frame setup at every IC site. Second, and more important in practice, multiple JS frames are handled by one C++ frame (interpreter invocation). In a technique borrowed from SpiderMonkey's generic interpreter</a>, when certain conditions are met, we handle</a> a JS call opcode by pushing a JS frame and dispatching directly to the callee's first opcode without any C++-level call. (This may be an obvious implementation to anyone who has written an interpreter virtual machine before, but disentangling C++ frames and JS frames is actually not trivial at all, given the prologue/epilogue logic -- hence the required conditions!) This interacts in subtle ways with unwinding described above: it means that the mapping from JS to C++ frames is 1-to-many, and thus requires some care. (As a silver lining, however, the logic for a "return" is substantially similar to that for "unwind": we can use the same conditions to know when to return at the C++ level.) Optimization: Hybrid ICs</h3> Having implemented all of the above techniques, I was still finding PBL to have somewhat disappointing performance numbers. Fortunately, one final insight came: perhaps the tradeoffs related to which operations are profitable to fast-path change, when the cost of the fast-path mechanism (an IC) itself changes? For example: in native baseline execution, every arithmetic operator uses ICs to dispatch to type-specific behavior. The +</code> operator, our favorite example, has possible fast-paths for integers, floating-point numbers, strings, and more. This is profitable in "native baseline" because the cost of an IC is extremely low: the JIT controls register allocation so it can effectively do global allocation across the function body and IC stubs by using special-purpose registers and a custom calling convention, and it can avoid generating any prologue/epilogue in the IC stubs themselves. As a result, ICs can literally be a handful of instructions: call, check type tag in registers 0 and 1, integer add, return. PBL, in contrast, is both emulating virtual-machine state (rather than using an optimized IC calling convention), and paying the interpreter-dispatch cost for every IC opcode. So I ran a simple experiment: in a native PBL build, I added rdtsc</code> (CPU time counter)-based timing measurements around execution of each JS opcode both in the generic interpreter and in PBL's interpreter loop, and binned the results by opcode type. The results were fascinating: property accesses (e.g., GetProp</code>) were significantly faster with ICs, for example, but many simpler operators, like Add</code>, were twice as slow. Then given this data, I developed the "hybrid ICs" approach, namely: use ICs only where they help! For the Add</code> operator, the PBL interpreter now has specific cases for integer and floating-point addition</a>, and then invokes the generic interpreter's behavior (AddOperation</code>); it never invokes the IC chain, but rather skips over it entirely. This behavior is configurable -- with faster IC mechanisms in the future, we may be able to use ICs for these opcodes again, so the code for both strategies remains. The results were striking: PBL was finally showing significant speedups on almost all benchmarks. The final "hybrid IC" set includes: Property accesses. These are extremely common in most JavaScript code, and can benefit from fast-path behavior whenever objects usually have the same "shape", or set of properties, at a given point. This is because the engine can encode a fast-path that directly accesses a particular memory offset in the object in memory, without looking up a property by name. </li> Calls. This is somewhat less intuitive: for an ordinary call to another JavaScript function, there is not much an IC can do -- we just need to update interpreter state to the callee and start dispatching. But for calls to built-in functions, as described above, the benefits can be huge: string and array operations, for example, transform from an expensive call into the runtime (through several layers of generic JS function-call logic) into just a few direct field accesses or other operations on a known object type. </li> </ul> Every other JS opcode is executed with generic logic. Results</h2> Enough description -- how well does it perform? The best test of any language runtime or platform is a "real-world" use-case, and PBL has been fortunate to see some early adoption, where two real applications saw wall-clock CPU time reductions of 42% and 17%, respectively, when executing on a Wasm/WASI platform. That is quite significant and exciting, and is motivating adoption and further work of PBL. While developing PBL, I did most of my benchmarking with Octane</a>, which is deprecated</a> but still useful when hacking on the core of a JS engine (one just needs to give the appropriate caveats that benchmark speedups will have an uncertain correlation to real-world speedups). On Octane, PBL currently sees a 1.26x speedup (that is, throughput is 26% higher; or, equivalently, there is a runtime reduction of 1 - 1/1.26</code>, or 21%). That is quite something as well, for a new engine tier that remains completely portable as a pure interpreter! Because of these exciting results, and our future plans below, we have worked with the SpiderMonkey team themselves to plan upstreaming -- incorporating PBL into the main SpiderMonkey tree. This will ease maintenance because it will allow PBL to be updated and evolved (i.e., kept compiling and running) as SpiderMonkey itself does, will allow us to use SpiderMonkey without a heavy patch-stack on top, and will make PBL available for others to use as well. We believe it could be useful beyond the Wasm/WASI world: for example, high-security contexts that disallow JIT could benefit as well. The upstreaming code-review</a> is in-progress and we look forward to completing it! Future: Compiled Code</h2> Note: this section describes my own thoughts and plans, but goes beyond what is currently being upstreamed into SpiderMonkey, and is not necessarily endorsed yet by upstream. My plan and hope is to develop the ideas to maturity and, if results hold up, propose additional upstreaming -- but that is further out. PBL has an attractive simplicity as a pure interpreter, and has surprised us with speedups even under that restriction. However, the larger question, for me at least, has always been: how can we compile JS ahead-of-time in a performant way? Recall that the main restriction of the WebAssembly platform is not that we can't generate code at all; it's just that all code, no matter the producer (the traditional Wasm compiler toolchain or our own JS tools), needs to be generated before any execution occurs. SpiderMonkey's native baseline tiers hint at a way forward here. PBL as described above is roughly equivalent to the baseline interpreter (modulo the way that ICs are executed). Can we (i) produce compiled code for ICs, and (ii) do the equivalent of the baseline compiler, generating a specialized Wasm function for every JS function body? In principle, this should be possible without information from execution, because it handles the type-specific specialization with the runtime dispatch inherent in the IC chains. In other words, types are late-binding, so we retain late-binding control flow to match. This still requires us to know what possible ICs we might need, but here we can play a trick: we can collect many IC bodies ahead of time, and generate straight-line compiled Wasm functions for these IC bodies. This is more-or-less the trick we described in our post two years ago</a>. But all of this is still implying the development of a Wasm compiler backend. How does PBL help us at all? Isn't it a dead-end, if we are eventually able to compile JS source (which we typically have available ahead-of-time -- performance-critical eval()</code> usage is rare) straight to specialized Wasm, with late-bound ICs? The answer to that lies in partial evaluation</a>. Over the past year I have developed a tool called weval</a> that takes an interpreter in a Wasm module, with a few minor intrinsic-call annotations (to specify what to specialize, and to denote that memory storing bytecode is "constant" and can be assumed not to self-modify dynamically), and generates a Wasm module with specialized functions appended. This gives us a compiler "for free" once we have an interpreter, and PBL has been designed to be that interpreter. In particular, the JS opcode and IC opcode interpreters in PBL were designed carefully to work efficiently with weval, and in a next step to the project (development branch</a>), I have the whole thing working. Whereas pure-interpreter PBL got a 1.26x speedup on Octane, PBL with weval gets a 1.58x speedup, up to 2.4x or so, with a bunch of low-hanging fruit remaining that will hopefully push that number further. This combination isn't quite ready for production use yet, but I continue to polish it, and we hope sometime early next year it will be ready, taking us to "conceptual parity" (if not engineering fine-tuning parity!) with SpiderMonkey's native baseline compiler. We have some more thoughts on going beyond that -- inlining ICs like WarpMonkey does, hoisting guards, and all the rest -- but more on that in due time. Given all of that, one could compare PBL and PBL+weval to SpiderMonkey's existing tiers. Recall our table above: Name</th> Data Required</th> JS opcode dispatch</th> ICs</th> Optimization scope</th> CacheIR dispatch</th> Codegen at runtime</th></tr></thead> Generic (C++) interpreter</td> JS bytecode</td> Interpreter (C++)</td> None</td> None</td> N/A</td> No</td></tr> ----------------------------------</td> -----------------------------------</td> --------------------------------------------</td> ------------------</td> ----------------------------------------</td> -------------------</td> ----------------------------------</td></tr> Baseline interpreter</td> JS bytecode + IC data structures</td> Interpreter (generated at startup)</td> Dynamic dispatch</td> Special cases within one opcode via IC</td> Compiled</td> Yes (IC bodies, interp body)</td></tr> ----------------------------------</td> -----------------------------------</td> --------------------------------------------</td> ------------------</td> ----------------------------------------</td> -------------------</td> ----------------------------------</td></tr> Baseline compiler</td> JS bytecode + IC data structures</td> Compiled function body (1:1 with bytecode)</td> Dynamic dispatch</td> Special cases within one opcode via IC</td> Compiled</td> Yes (IC bodies, function bodies)</td></tr> ----------------------------------</td> -----------------------------------</td> --------------------------------------------</td> ------------------</td> ----------------------------------------</td> -------------------</td> ----------------------------------</td></tr> Optimizing compiler (WarpMonkey)</td> JS bytecode + warmed-up IC chains</td> Compiled function body (optimized)</td> Inlined</td> Across opcodes / whole function</td> Compiled</td> Yes (optimized function body)</td></tr> </tbody></table> To which we could add the row: Name</th> Data Required</th> JS opcode dispatch</th> ICs</th> Optimization scope</th> CacheIR dispatch</th> Codegen at runtime</th></tr></thead> PBL (interpreter)</td> JS bytecode + IC data structures</td> Interpreter (C++)</td> Dynamic dispatch</td> Special cases within one opcode via IC</td> Interpreter (C++)</td> No</td></tr> </tbody></table> And then, with weval and pre-collected ICs (but no profiling of the JS code!), we could have: Name</th> Data Required</th> JS opcode dispatch</th> ICs</th> Optimization scope</th> CacheIR dispatch</th> Codegen at runtime</th></tr></thead> PBL (wevaled)</td> JS bytecode + IC data structures</td> Compiled function body (1:1 with bytecode)</td> Dynamic dispatch</td> Special cases within one opcode via IC</td> Compiled</td> No (!!)</td></tr> </tbody></table> which one will note is identical to the baseline-compiler row above, except that no runtime codegen is required. Finally, if we have reliable profiling information, such as from a profiling run at build time, we could use this profile (just as one does in a standard C/C++ "PGO" or "profile-guided optimization</a>" build) to inline the ICs. Note that this could be done in a way that is completely agnostic to the underlying interpreter, because IC invocations are just indirect calls: that is, it is also a semantics-preserving, independently-verifiable transform. Having done that, we would then have: Name</th> Data Required</th> JS opcode dispatch</th> ICs</th> Optimization scope</th> CacheIR dispatch</th> Codegen at runtime</th></tr></thead> PBL (wevaled + inlined ICs)</td> JS bytecode + IC data structures + warmed-up IC chains</td> Compiled function body (optimized)</td> Inlined</td> Across opcodes / whole function</td> Compiled</td> No (!!)</td></tr> </tbody></table> which approximates WarpMonkey. Note that this will require significant additional engineering -- SpiderMonkey's native JITs, after all, embody engineer-centuries of effort (much of which we leverage by reusing its well-tuned CacheIR sequences, but much which we can't) -- but is a clear path to allow for optimized JS without runtime code generation. The thing that excites me most about this direction is that it is, in some sense, "deriving a JIT from scratch": we are writing down the semantics of the opcodes, and we're explicitly extracting fast-paths, but we're using semantics-preserving tools to go beyond that. (Weval's semantics are that it provides a function pointer to a specialized function that behaves identically to the original.) That allows us to decouple the correctness aspects of our work from performance, mostly, and makes life far simpler -- no more insidious JIT bugs, or divergence between the interpreter and compiler tiers. More to come! Many, many thanks to Luke Wagner</a>, Nick Fitzgerald</a>, Trevor Elliott</a>, Jamey Sharp</a>, L Pereira</a>, Michelle Thalakottur</a>, and others with whom I've discussed these ideas over the past several years. Thanks to Luke, Nick, Lin Clark</a>, and Matt Gaudet</a> for feedback on this post. Thanks also to Trevor Elliott and Jake Champion</a> for help in getting PBL integrated with other infrastructure, Jamey Sharp for ramping up efforts to fill out PBL's CacheIR opcode support, and the Mozilla SpiderMonkey team</a> for graciously hearing our ideas and agreeing to upstream this work. ^{1 WebAssembly running in a browser could implement runtime code generation and loading by calling out to external JavaScript. In essence, it would generate a new Wasm module in memory, then call JS to instantiate that module into an instance that shares the current linear memory, and call into it. However, this is fundamentally a feature of the Web platform and not built-in to Wasm; and many Wasm platforms, especially those designed with security among untrusted co-tenants in mind, do not allow this and strictly enforce ahead-of-time compilation of fixed code instead. </div>}^{2 Honestly, even after writing a new interpreter tier for SpiderMonkey, I couldn't tell you the answer to this. (I run the bytecode, I don't lower to it!) The language's semantics are something to marvel at, in the edge cases, and this is all the more reason to centralize on a few well-tested, well-engineered shared implementations. </div>}^{3 Note that there are some details here omitted for simplicity. Most importantly, once inlining ICs, we need to hoist the guards, or the conditions that exist at the beginning of every IC, so that they are shared in common (or removed entirely, if we can prove they will always succeed). Consider a function that operates on floating-point numbers: every IC will be some form of "check if inputs are floats, do operation, tag result as float" but instead we could check that the function arguments are floats once, then propagate from "produce float" in one IC to "check if float" in the next. ICs enable this by expressing the necessary preconditions (checks) and postconditions (produced values and their types) for each operator, and inlining is necessary as well because it places everything in one code-path so it is in scope to be cross-optimized; but guard-hoisting and -elimination are JIT-compiler-specific optimizations. </div>}^4One can see this idea as a variant of the interpreter quickening</a> idea, in a way: the CacheIR sequences are a shorter or more efficient implementation of particular behavior that we rewrite the interpreted program to use (via the pluggable IC sites) as we learn more about its execution. </div> ^{5 The correspondence isn't actually 1-to-1, unfortunately (that would have been much simpler!): instead, we sometimes push an additional frame for VM exits, and we also handle some calls "inline", pushing the frame and going right to the top of the dispatch loop again. The actual invariant is that every auxiliary stack frame is "owned" by one C++ function invocation, but there may be several such frames. It is thus a 1-to-many relationship. </div>}^6Strictly speaking, we could have used setjmp()</code>/longjmp()</code> to implement similar constant-time unwinding. However, this interacts poorly with C++ destructors, and is also problematic -- that is, does not exist -- on WebAssembly with WASI. Eventually the exception handling proposal</a> for Wasm may be directly usable for this purpose, but it is not finalized yet. </div> Cranelift's Instruction Selector DSL, ISLE: Term-Rewriting Made Practical 2023-01-20T00:00:00+00:00 Today I'm going to be writing about ISLE, or the "instruction selection/lowering expressions" domain-specific language (DSL</a>), which over the past year we have designed, improved, and fully adopted in the Cranelift</a> compiler project. ISLE</a> is now used to express both our instruction-lowering patterns for each of four target architectures, and also machine-independent optimizing rewrites. It allows us to develop these parts of the compiler in an extremely productive way: we can write the key idea -- that one opcode or instruction should map to another -- in a concise way, while maintaining type-safety with an expressive type system, and allowing us to use the declarative patterns for many different purposes. The goal of this blog post is to illustrate the requirements and the design-space that led to ISLE's key ideas, and especially its departures from other "term-rewriting systems</a>" and blending of ideas from backtracking languages like Prolog</a>. In particular, we'll talk about how ISLE has a strong type system, with terms of distinct types (as opposed to one "value" type); how it matches on abstract "extractors" provided by an embedding environment, which effectively define a virtual "input term" without ever reifying it; and how it was explicitly designed to have a simple "FFI</a>" with Rust for easy-to-understand, no-magic interactions with the rest of the compiler. All of these properties allowed us to chart an incremental path from a fully handwritten compiler backend to one with all lowering logic in the DSL (in 27k lines of DSL code), with a relatively low defect rate over the year-long migration and with significant correctness improvements along the way. The ISLE project was also a really interesting moment in my career personally: it was very much a "research project", in that it required synthesis of existing approaches and careful thought about the domain and requirements, and invention of slightly new takes on old ideas in order to make the whole thing practical. At the same time, there was quite a lot of work put into the incrementalist and pragmatic aspect of the design (something I also talk about in an earlier post on regalloc2</a>), and a lot of care and feeding to a 12-month migration effort, curation of our understanding of the language and how to use it, and nurturing of ongoing ideas for improvement. I feel pretty fortunate to have (i) been given the space to wrestle the problem space down to its essential kernel and find a working design, and (ii) a cohort of really great coworkers who ran with it and made it real. (Especially thanks to my brilliant colleague Nick Fitzgerald (@fitzgen)</a> who completed the Cranelift integration of ISLE and who pioneered the idea of rewrite DSLs in Cranelift with his Peepmatic</a> project, which inspired many parts of ISLE and primed the project for this effort.) I wrote more about the benefits we've seen so far from ISLE, and anticipate to see, here</a>; suffice it to say that we're happy we've gone this way and we look forward to additional results that this work has enabled. This post covers material and repeats some arguments I made early in ISLE's pre-RFC</a> and RFC</a>; I recommend reading those documents as well for further background, if desired. The ISLE language reference</a> is the canonical definition of the DSL. Let's first go through some background on Cranelift, on compiler DSLs in general, and motivate the case for a DSL in Cranelift; then we'll get into the details of ISLE proper. Context: New Cranelift Backends and Handwritten Code</h2> In the Cranelift project, starting in 2020, we developed a new framework for the machine backends -- the part of the compiler that takes the final optimized machine-independent IR (intermediate representation)</a> and converts it to instructions for the target instruction set, allocates registers, lowers control flow, and emits machine code. (I described more details in an earlier post series: first</a>, second</a>, third</a>.) Prior to introducing ISLE, we build three backends in this framework: AArch64</a>, x86-64</a>, and s390x (IBM Z)</a>. The general experience was quite positive -- the simplicity that we aimed to achieve by focusing on a "straightforward handwritten lowering pass" design allowed us to quickly implement quite complete support for WebAssembly (core and SIMD) and support other users of Cranelift (like cg_clif</a>). In March 2021, we made the new x86-64 backend the default</a>, and at the end of September 2021, we were able to remove the old x86 backend</a> with its legacy, complex, and generally slower framework. The Need for a DSL</h2> However, as with many engineering problems, there is a tradeoff point in the design space. The simplicity of the "just write the match</code> on the opcode and emit the instructions" approach</a> eventually becames a downside: we had an increasingly-deeply-nested</a> lowering function with many conditions matching on types, sub-expressions, and special cases, and keeping track of it all became more and more difficult. In total, we found we were running into three major problems</a>: It became very tedious to write what was essentially a longhand form of a list of lowering patterns. If adding a special lowering for a combination of IR operators requires understanding control in handwritten code, and writing out the checks for special conditions or searches for other combining operators by hand, then we are much less likely to improve the compiler backends: the incentives instead point toward keeping the code as simple and as minimalistic as possible and discouraging change. </li> It became difficult to refactor the code at all: the compiler "lowering API" was ossifying as more and more handwritten backend code came to depend on its subtle details, and refactors became very hard or impossible. This became especially apparent with the regalloc2 work. </li> It became more and more difficult to maintain correctness, and ensure that the backends were generating the code we expected. While writing code against the lowering API, one had to keep in mind the subtle correctness invariants for when one can "combine" instructions, when one can "sink" an operation, and so on; and the rules for how to use registers and temporaries properly. Even when generating some kind of correct code, it was easy to miss a corner of the state space and, say, omit a lowering for a particular combination of input types, or skip an intended optimized lowering and use a general one instead, in an accidental and hard-to-reason-about way due to complex control flow. As we'll see below, a DSL allows us to solve both of these problems with (i) principled strongly-typed abstractions in the DSL, and (ii) "overlap checks", respectively. </li> </ol> It became clear that generating lowering patterns from some meta-description would lead to overall clearer and more maintainable compiler source, and would give us more flexibility if we wanted to change any details of or optimize the translation, as well. Hence, our realization that we probably needed a domain-specific language (DSL) to generate this part of Cranelift. DSLs in Compilers and Term-Rewriting</h2> There is a long history of DSLs to specify compilers -- the "metacompiler" or "compiler-compiler" concept -- going back to at least the 1970s. The general idea of a DSL-based instruction selection stage is to declaratively describe a list of patterns -- combinations of operators in the program -- and for each pattern, when it matches, a series of instructions that can implement that pattern. This makes it easier to reason about what the compiler is doing, to modify and improve it, and to apply systematic optimizations across the backend by changing how the DSL is used to generate the compiler backend itself. The "pattern rewriting" approach to a compiler backend can be seen as a kind of term-rewriting system</a>: that is, a formal framework in which rules operate on a data representation (in this case, the program to be compiled). Term-rewriting is an old idea: the lambda calculus</a>, one of the original mathematical models of computation, operates via term-rewriting. It is general enough to be useful yet also concise and expressive enough to be very productive when the goal is to transform structured data. One reason why term-rewriting is such a good fit for a compiler in particular is that "terms", or nodes in an AST, represent values in the program being compiled; given this, a rewrite rule is an expression of an equivalence. An "integer addition" operator in a compiler IR is equivalent to (or produces an equivalent result to) the integer addition instruction in a given CPU's instruction set; so we can replace one with the other. One might write this rule as something like (add x y) => (x86_add x y)</code>, for example. Likewise, many compiler optimizations can be expressed as rules, in a way that is familiar to any student of algebra: for example, x + 0 == x</code>, or in an AST notation, (add x 0) => x</code>. Examples of such pattern-rules abound in production compilers. For example, in the Go compiler, a set of rules</a> define how the IR's operators are converted into x86-64 instructions: // Lowering arithmetic (Add(64|32|16|8) ...) => (ADD(Q|L|L|L) ...) // combine add/shift into LEAQ/LEAL (ADD(L|Q) x (SHL(L|Q)const [3] y)) => (LEA(L|Q)8 x y) // Merge load and op ((ADD|SUB|AND|OR|XOR)Q x l:(MOVQload [off] {sym} ptr mem)) && canMergeLoadClobber(v, l, x) && clobber(l) => ((ADD|SUB|AND|OR|XOR)Qload x [off] {sym} ptr mem)</code></pre> Similar kinds of descriptions exist in the LLVM x86 backend</a> and the GCC x86 backend</a>, using the TableGen</a> language (LLVM) and the Machine Description DSL</a> (gcc), respectively. There is a large and well-explored design-space for auto-generated compiler backends from such rules. A classical design is the BURS</a> (bottom-up rewrite system) technique. I won't attempt a deeper introduction here; further descriptions can be found in e.g. the Dragon Book1</a> or Muchnick's textbook2</a>. For this post, it suffices to know that these systems find a "covering" of tree patterns such as the above over an input expression tree. Term Rewriting for Lowering... Maybe?</h2> Given the above precedent -- several mainstream compilers adopting a pattern-matching-based scheme, with clear benefits -- it would seem that our path ahead is well-defined. Why, then, is there so much of this post remaining? What more could be said? It turns out that three main problems arise when considering how to adopt a typical term-rewriting scheme. First, there is a basic question: do we actually reify the input tree as a tree? For example, if we have a pattern (add x y) => (x86_add x y)</code></pre> meaning that an add</code> operator on two operands is lowered to an x86 ADD</code> instruction, does that imply that our IR literally contains a node for the add</code>? That may seem like a silly question to raise, as CLIF, Cranelift's IR, does indeed have an operator for add</code> (actually iadd</code> for "integer add") that takes two arguments. But directly matching on a "tree of operators" implies several properties of the IR that have deep impact. One of these properties is that the "value" or "result" of the operator is its unique identifier. In CLIF, this isn't the case: each instruction has its own identifier (Inst</code>) and each Inst</code> can have any number of Value</code> result. Another is that it implies that the tree that the matching process should see is exactly what is in the IR. However, quite a few considerations</a> are involved when a backend wants to "merge" the handling of operators by looking deeper into the tree. None of these impedance mismatches are fatal to the approach, but they do imply extra work to build the tree in memory "as matched". Second, there is the problem of how to incorporate additional information beyond the raw tree of operators. One could see this as a question of "side-tables" or "auxiliary information", or of supporting various "queries" on the input tree. For example, we might want to represent the ability to encode an integer immediate in certain ISA-specific forms as a term. AArch64 has several such forms: a "regular" 12-bit immediate, and a rather clever "logical immediate"</a> format designed to efficiently encode common kinds of bitmasks in only 13 bits. We might represent these with terms, and have rules that translate (iconst K)</code> into (aarch64_imm12 bits)</code> or (aarch64_logicalimm bits)</code> and subsequent rules that match on these terms to encode immediate-using instruction forms. The problem then comes: how do we know which of these intermediate rewrites to do before we attempt to match any instruction forms? Do we do both, and represent both forms? The net effect of this requirement is that the matching pattern for a rewrite rule starts to look less like a tree of terms and more like a sequence of custom queries. The Go compiler's rules</a> add predicates to rewrite rules to handle these cases, but this is awkward and makes the language harder to reason about. It would be better if we could represent the ISA concepts at the term-rewrite level as well. Third, there is the question of how to interact with the rest of the compiler as we make these queries on the input representation. In the most straightforward implementation, a rewrite system has knowledge of the "tree nodes" that terms in the pattern match and that terms in the rewrite expression produce. But building the glue between the rewrite system and the IR data structures may be nontrivial, especially if custom queries (as above) are also involved. All of this raises the question: is there a better way to think about the execution semantics of the rewrite rules? What if the DSL were not involved in ASTs or rewrites in a direct way at all? Sequential Execution Semantics and "External Extractors/Constructors"</h2> At this point, I hit upon ISLE's first key idea: what if all interactions with the rest of the compiler were via "virtual" terms, implemented by Rust functions? In other words, rather than build a system that matches on literal AST data structures and rewrites or produces new output ASTs, all of the pattern matching would "bottom out" in a sort of FFI that would invoke the rest of the compiler, in handwritten Rust. The DSL itself knows nothing about the rest of the compiler, or "ASTs", or any other IR-specific concept. (This is ISLE's main secret: it is not actually an instruction-selection DSL, but actually (or at least aspirationally) a more general language.) Sequential Semantics for Matching</h3> One could imagine a rule like (iadd (imul a b) c) => (aarch64_madd a b c)</code> to "compile" to a series of "match operations" like the following invented operations for some matching engine: $t0, $t1 := match_op $root, iadd $t2, $t3 := match_op $t0, imul $t4 = create_op aarch64_madd, $t2, $t3, $t1 return $t4</code></pre> In this "matching VM", we execute match_op</code> operations by trying to unpack a tree node into its children (arguments), given the expected operator. Any step in this sequence of match operators might "fail", which causes us to try the next rewrite rule instead. If we can match the iadd</code> from the input tree root, and the imul</code> from its first argument, then the compiled form of this rule builds the aarch64_madd</code> ("multiply-add") term. Programmable Matching</h3> Rather than a fixed set of operators like match_op</code>, what if we allowed for environment-defined operators? What if operators like the aarch64_logicalimm</code> above were "match operators" as well, such that they "matched" if the given u64</code> could be encoded in the desired form and failed to match otherwise? This is the essence of the "external extractor" idea (and the dual to it, the "external constructor") in ISLE. Once we allow user-defined operators for the left-hand side ("matching pattern") of a rule, we actually no longer need any built-in notion of AST matching at all; this becomes just another thing we can define in the "standard library" or "prelude" of our DSL!3</a> The basic idea is to introduce the ability to define a term like iadd</code> or imul</code> and associate an external Rust function with it. When appearing on the left-hand side of a rewrite rule, terms match "outside in": that is, (iadd (imul a b) c)</code> takes the root of the input, tries to use iadd</code> to destructure it to two arguments, and tries to use imul</code> to destructure it further. (This outside-in, reversed order is the opposite of what one might expect if this were a tree of function calls, because we are destructuring (extracting) rather than constructing. We'll explore the analogy to functions more below.) So, skipping straight to the real ISLE syntax now, we can define the term like (decl (iadd Value Value) Value)</code></pre> and then associate a Rust function with it like (extern extractor iadd my_iadd_impl)</code></pre> and this implies the existence of a Rust function that the generated matching code will invoke: fn my_iadd_impl(&mut self, input: Value) -> Option<(Value, Value)>;</code></pre> Likewise, we can define a term to be used on the right-hand side and associate an implementation like (decl (aarch64_madd Value Value Value) Value) (extern constructor aarch64_madd my_madd_impl)</code></pre> with the Rust function fn my_madd_impl(&mut self, a: Value, b: Value, c: Value) -> Value;</code></pre> and then use it like (rule (iadd (imul a b) c) (aarch64_madd a b c))</code></pre> and the generated code will invoke the external extractor my_iadd_impl</code>; if it returns Some</code> (matches), will invoke whatever external extractor is associated with imul</code>; if it also returns Some</code>, then invoke aarch64_madd</code> to "construct" the result. These Rust functions can do whatever they like: we have abstracted away the need to actually query a reified AST and mutate it. Extractors and the Execution-Driven View</h3> Another important consequence of this design is that we can define arbitrary extractors and constructors, and they can have arbitrary types. (ISLE is strongly-typed, with sum types that lower 1:1 to Rust enums.) This neatly addresses the "metadata or side-table" question above: we don't need to generate auxiliary nodes in a real AST to represent information about a value, and we don't need to know when to compute them; such computations can be driven by demand on the matching side, and don't need to reify any actual node. Let's take a step back and understand what we have done here. We have taken a rule that expresses a tree rewrite -- like (iadd (imul a b) c) => (madd a b c)</code> -- and allowed for the terms in the left-hand side and right-hand side to compile to Rust function calls, with well-defined semantics. The DSL itself deals only with pattern-matching; ASTs and compiler IRs are wholly outside of its understanding. Nevertheless, if we ascribe formal semantics to iadd</code>, imul</code> and madd</code>, we can still reason about the rewrite at the term-rewriting system level, and this is essential to formal verification efforts for our compiler backends (see below!). We have thus allowed for integration with existing, handwritten Rust code while raising the abstraction level to allow for more declarative reasoning.4</a> It's worth dwelling for a moment on the shift from an explicit AST data structure, traversed and built by a rule-matching engine, to calls to external extractors and constructors. As a consequence of this shift, the term nodes corresponding to extractor matches need not ever actually exist in memory. The ISLE rewrite flow thus works something like the "visitor pattern</a>": it introduces a level of abstraction that decouples the consumption and production of data from its representation, allowing more flexibility. The execution-driven view of term rewriting gives us our rewrite procedure for free, as well: rather than an engine that takes some specification of patterns and rewrites, we compile rules to sequential code that invokes extractors and constructors. "Rewriting" a top-level term is equivalent to invoking a Rust function. If we define a term, lower</code>, corresponding to instruction lowering, then (lower (iadd ...))</code> is a term that will be rewritten to whatever ISA-specific terms the rules specify. This rewriting is done by a Rust function that implements lower</code>; we invoke it with the term's arguments, and the body will match on extractors as needed, then invoke constructors to build the rewritten expression. Explicit Types (and Implicit Conversions) in ISLE</h2> The second key differentiator in ISLE as compared to most other term-rewriting systems is its strong type system. It might not be too surprising that a DSL that compiles to Rust would incorporate a type system that mirrors Rust's type system to some degree, e.g. with sum types (enum variants). But this is actually a bit of a departure from conventional compiler-backend rule systems, and allows significant expressivity and safe-encapsulation wins, as we'll see below. Why Types?</h3> A conventional rewrite system operates on an AST whose nodes all represent values in the program. In other words, every term has the same type (at the DSL level): we can replace iadd</code> with x86_add</code> because both have type Value</code>. While this works fine for simple substitutions, it quickly breaks down when various ISA complexities are considered. For example, how do we model an addressing mode? We might wish to have a node x86_add</code> that accepts a "register or memory" operand, as x86 allows; and if a memory operand, the memory address can have one of several different forms ("addressing modes"): [reg]</code>, [reg + offset]</code>, [reg + reg + offset]</code>, [reg + scale*reg + offset]</code>. We could impose some ad-hoc structure on the AST in order to model this: for example, an x86_add</code> with an x86_load</code> in its second argument (or alternately, a separate opcode x86_add_from_memory</code>) could represent this case. Then we would need to have rules for the address expression: if another x86_add</code> node (but only with register arguments!), we could absorb that into the instruction's addressing mode. Ad-hoc structure like this is fragile, though, especially when transformed by optimization passes. As a general guideline, as well, the more we can put program invariants into the type system (at any level), the more likely we are to be able to maintain the structure across refactors or unexpected interactions. Types in ISLE</h3> ISLE thus allows terms to have types that resemble function types. One can define a term (decl lower_amode (Value) AMode)</code></pre> that takes one argument, a Value</code>, and produces an AMode</code>. This type can then be defined as (type AMode (enum (Reg (reg Reg)) (RegReg (ra Reg) (rb Reg)) (RegOffset (base Reg) (offset u32))))</code></pre> and so on. The AMode</code> compiles to an enum in Rust with the specified enum variants, making interop with Rust code (in external extractors and constructors) via these rich types straightforward. In our machine backends, machine instructions are defined as enums and constructed directly in the ISLE. Typed Terms as Functions</h3> Now that we have seen how to give a "signature" to a term, it's worth discussing how one can see terms -- both extractors and constructors -- as functions, albeit in opposite directions. In particular, with a term (decl F (A B C) R)</code></pre> that has arguments (or AST child nodes) of types A, B, C</code>, with a type R</code> for the term itself, one can see: F</code> as a function from A, B, C</code> to R</code>, when used as a constructor; or</li> F</code> as a function from R</code> to A, B, C</code>, when used as an extractor.</li> </ul> In other words, given a tree of terms in a pattern (left-hand side), where terms are interpreted as extractors, we can see each term as a function invocation from the "outer" type (the thing being destructured) to the "inner" types (the pieces that are the result of the destructuring). Conversely, given a tree of terms in an expression (right-hand side), we can see each term as a function from the "inner" types (the arguments of the new thing being constructed) to the "outer" type (the return value). This is another way of seeing the "execution-driven" view of ISLE semantics described above. Note that the extractor form of F</code> above is ordinarily a partial function -- that is, it is allowed to have no mapping for a particular value. This is how we formally think about the "doesn't match" case when searching for a particular kind of node in an AST, or any other matcher on the left-hand side of a pattern. In contrast, the constructor is normally total -- cannot fail -- unless explicitly declared to be partial. (Partial constructors are useful for if-let</code> clauses</a>.) Types for Invariants</h3> Support for arbitrary types lets us much more richly capture invariants as well, by encapsulating values (on the input side) or machine instructions (on the output side) so that they can only be used or combined in legal ways. For example: Many instruction-set architectures have a "flags" register that is set by certain operations with bits that correspond to conditions (result was zero, result was negative, etc.) and used by conditional branches and conditional moves. This is "global" or "ambient" state and one has to be careful to use the flags after computing them, without overwriting them in the meantime. To ensure exact correspondence of a particular flag-producer and flag-consumer, certain instruction constructors create ProducesFlags</code> and ConsumesFlags</code> values</a> rather than raw instructions. These can then be emitted together, with no clobber in the middle, with the with_flags</code></a> constructor. </li> There is a distinction between an IR-level Value</code> and a machine-level value in a register, which we denote with Reg</code>. When a lowering rule requires an input to be in a register, it can use the put_in_reg</code> constructor, which takes a Value</code> and produces (rewrites to) a Reg</code>. This provides a way for us to do bookkeeping (note that the value was used, and ensure that we codegen its producer), but also allows us to distinguish how to place the value in the register: one may wish to sign- or zero-extend</a> the values. </li> There is a distinction between an IR-level Value</code> and the instruction that produces it. Not every Value</code> is defined by an instruction; some are block parameters</a>. Furthermore, at a given point in the lowering process, we may not be allowed to see the producer of a value, if we cannot "sink" its effect (merge it into an instruction generated at the current point). Thus, instructions have Value</code>s as operands, but one goes from a Value</code> to an Inst</code> (instruction ID) with def_inst</code></a>, which may or may not match depending on whether there is an Inst</code> and we can see/merge it. </li> </ul> Implicit Conversions</h3> Type-safe abstractions allow for well-defined and safe interfaces, but can lead to verbose code. After several months of experience with ISLE, we were finding that we wrote rules like: (rule (lower (iadd (def_inst (imul x y)) z)) (madd (put_in_reg x) (put_in_reg y) (put_in_reg z)))</code></pre> when we would prefer to write more natural rules like (rule (lower (iadd (imul x y) z)) (madd x y z))</code></pre> as in our original term-rewriting examples above. At first it seemed we had to choose one or the other: the shorter form would require abandoning some of the type distinctions we were making. But in actuality, there is some redundancy. Consider: when typechecking the left-hand side pattern, we know that iadd</code>'s arguments (which the extractor will produce, if it can destructure the Inst</code> type) have type Value</code>, but the inner imul</code> expects an Inst</code>. In our prelude we have one canonical term that converts from one to the other: def_inst</code>. Likewise, in the right-hand side, bindings x</code>, y</code> and z</code> have type Value</code>, but the madd</code> constructor that builds a machine instruction requires Reg</code> types (which are virtual registers in pre-regalloc machine code). We likewise have one term, put_in_reg</code>, that can do this conversion. If there is only one canonical way, or usual way, to make the conversion, and if the types on both sides of the conversion are already known, why can't we rectify the types by inserting necessary conversions automatically? ISLE thus has one final trick up its sleeve to improve ease-of-use: implicit conversions. By specifying converter terms for pairs of types like (convert Inst Value def_inst) (convert Value Reg put_in_reg)</code></pre> the typechecking pass can expand the pattern and rewrite expression ASTs as necessary. This makes writing lowering rules much more natural with less boilerplate, and we have a fairly rich set of implicit conversions</a> defined in our prelude to facilitate this. Putting it All Together: AST Patterns in ISLE</h2> Now that we've taken a tour of the various features of ISLE, let's review what we have built. We have started with a desire to express high-level rewrite rules that equate AST nodes -- such as (iadd (imul x y) z)</code> and (madd x y z)</code> -- and to have a system that performs such rewrites, in a way that interoperates with the existing compiler infrastructure and has predictable and comprehensible behavior. ISLE allows high-level patterns to be compiled to straightforward Rust pattern-matching code. A strong type system with sum types (enums) ensures that terms in patterns and rewrites match the expected schema, and allows for expressing high-level invariants. Implicit conversions leverage these types to remove redundancy in the patterns, allowing for more natural high-level forms while retaining the useful type-level distinctions. The ability to arbitrarily define "extractors" to use in patterns allows us to build up a rich pattern-language in our prelude, matching trees of operators, values, and pieces of the input program with various properties in a programmable way. And the well-defined mapping to Rust and an execution scheme that maps terms directly to function calls, rather than an incrementally-rewritten AST allows for code as efficient as what one would write by hand. We have thus gone from (iadd (imul x y) z) => (madd x y z)</code> to something like: Match the input Inst</code> against the iadd</code> enum variant, getting argument Value</code>s if so;</li> Get the Inst</code> that produced the first Value</code>, if we're allowed to merge it, via def_inst</code>;</li> Match that Inst</code> against the imul</code> enum variant, getting its argument Value</code>s if so;</li> Put all three remaining Value</code>s in registers with calls to put_in_reg</code>; and</li> Emit an madd</code> instruction with these registers</li> </ul> all via bindings defined in ISLE itself and without any knowledge of "instruction lowering" in the ISLE DSL compiler. As concrete evidence that the last point is valuable, we have been able to use ISLE for CLIF-to-CLIF rewrite rules in our new mid-end optimization framework as well, simply by defining a new prelude (see below!). Ongoing and Future Benefits of Declarative Rules</h2> The next most exciting thing about writing lowering rules declaratively -- after the expressivity and productivity win while developing the compiler itself -- is that being able to reason about the rules as data lets us analyze them in various ways and check that they satisfy desirable properties. Correctness and Formal Verification</h3> As an example, during the development of our rulesets, we found that it was sometimes unintuitive which rule would fire first. We have a priority mechanism to allow this to be controlled in an arbitrary way, but the default heuristics (roughly, "more specific rule first") were sometimes counter-intuitive. We thus invented the idea of an overlap checker</a>, initially implemented by my brilliant colleague Trevor Elliott</a> and subsequently redesigned with a new internal representation and algorithm</a> by my other brilliant colleague Jamey Sharp</a>. The key idea is to define "rule overlap" such that two rules overlap if a given input to the pattern-matching could cause either rule to fire. In these (and only these) cases, priority and/or default ordering heuristics determine which rule actually does fire. Then we decided that in such cases, we would require the ISLE author to use the priority mechanism to explicitly choose one of the rules. In other words, no two overlapping rules can have the same priority. Through a series of PRs to fix up our existing rules, we were able to actually find several cases where rules were fully "shadowed", or unreachable because some other more general rule was always firing first. We turned on enforcing mode for non-overlap among same-priority rules</a> and non-shadowing of rules by higher-priority rules</a> after fixing several cases, and as a result, we now have more confidence in the correctness of our rulesets. On a broader scale, writing rules as equivalences from one AST to another lets us verify that the two sides are, well, actually equivalent! There is an ongoing collaboration with some formal-verification researchers who are adding annotations to external extractors and constructors that describe their semantics (mostly in terms of an SMT checker's theory-of-bitvectors primitives). Given these "specs", one could lower each ISLE rule to SMT clauses rather than executable Rust code, and search for cases where it is incorrect. I won't steal any thunder here -- the work is really exciting (also still in progress) and the researchers will present it in due time -- but it's an example of what declarative DSLs allow. The ISLE-to-Rust compiler (metacompiler) is also a fairly complex tool in its own right, and has had bugs in the past. What makes us confident that we are generating code that correctly implements the rules -- even if the rules themselves are verified? To answer that question, we have considered</a> how to verify the translation of the ISLE rules. Our current plan is to modify the ISLE compiler to generate both the production backend, with intelligently-scheduled matching operations, and a "naive" version that runs through rules sequentially in priority order. If the latter picks the same rule as the former, then we know we have a faithful implementation of the left-hand-side matching (and the translation of the right-hand-side expression to constructor calls is straightforward in comparison, so we trust it already). Then we can trust that our verified-correct rules are being applied as written. Optimizing the Instruction Selector</h3> Next, my colleague Jamey is working on a new ISLE metacompiler backend</a> that lowers ISLE rules to a planned sequence of matching ops more efficiently. The ability to change the way that compiler-backend code matches the IR was a theoretical benefit of a DSL-based approach, recognized and evaluated as we weighed the pros and cons of ISLE, but admittedly was a bit speculative -- no one knew if we would actually find a better way to generate code from the rules than the initial ISLE compiler and its "trie"-based approach. I am very excited that Jamey actually did manage to do this (and we should look forward to a hopeful future blog post in which he can describe this in more detail!). Mid-end Optimizations</h3> Finally, as mentioned above, we were able to find a second</a> use</a> for ISLE as part of the egraph-based mid-end optimizer</a> work, which we just enabled by default</a>. (I hope to write a blog post about this soon too!) This was excellent and very satisfying validation to me personally that ISLE is more general than just Cranelift backends: we were able to write a new prelude (and actually share a bunch of extractors too) so that rules could specify IR-to-IR rewrites, in addition to IR-to-machine-instruction lowerings. This will allow us to iterate on and improve the compiler's suite of optimizations more easily in the future, and it will also have follow-on benefits in terms of shared infrastructure: verification tools that we build for backend lowering rules can also be adapted to verify mid-end rules. In addition, there are potentially other ways that putting all of the compiler's core program-transform logic into the same DSL will allow us to blur the lines, combine stages, or move logic around in ways we can't yet anticipate today; but it seems like a worthwhile investment. In any case, ISLE proved to be a quite useful tool in developing pattern-matching Rust code with less boilerplate and tedium than before! Conclusion</h2> ISLE has been great fun to design, build, and use; while we have learned a lot and made several language adjustments and extensions over the past year, I think that there is general consensus that it has made the compiler backends easier to work on. I'm excited to see how the ongoing work (verification, new ISLE codegen strategy) turns out, and how the language evolves in general. And as noted above, ISLE's secret is that it is actually more general than instruction selection, or Cranelift: if you find another way to use it, I'd be very interested in hearing about it! Thanks to Jamey Sharp and Nick Fitzgerald for reading and providing very helpful feedback on a draft of this blog post, and to bjorn3 and Adrian Sampson for feedback and typo fixes after publication. ^{1 A Aho, R Sethi, J Ullman, M S Lam. Compilers: Principles, Techniques, and Tools. 2006. </div>}^{2 S Muchnick. Advanced Compiler Design and Implementation. 1997. </div>}^{3 ISLE does have special knowledge about Rust enums, and the ability to match on them efficiently with match</code> expressions in the generated Rust code, because to miss this optimization would be very costly. But in principle it could have been built without this, involving only Rust function calls and control flow around them. </div>}^4There is a parallel to the Prolog</a> language here, in that it also allows for high-level, declarative expression of rule-matching with backtracking while also having a well-defined sequential execution semantics with FFI to an imperative world. In fact Prolog was a central inspiration for ISLE's design. The key difference(s) are that (i) ISLE does not have full backtracking -- once a left-hand side matches, we cannot backtrack, as the right-hand sides are infallible -- and (ii) there is no unification, and all dataflow in a term is unidirectional, from input (value to be destructured) to output (arguments). We used to have "argument polarity", which was closer to unification in that it allowed a configurable (but fixed) input/output direction for each argument to an extractor. We discarded this feature, however, in favor of a more general if-let</code> clause</a>. </div> Cranelift, Part 4: A New Register Allocator 2022-06-09T00:00:00+00:00 This post is the fourth part of a three-part series1</a> describing work that I have been doing to improve the Cranelift</a> compiler. In this post, I'll describe the work that went into regalloc2</a>, a new register allocator I developed over the past year. The allocator started as an effort to port IonMonkey's register allocator</a> to Rust and a standalone form usable by Cranelift ("how hard could it be?"), but quickly evolved during a focused optimization effort to have its own unique design and implementation aspects. Register allocation</a> is a classically hard (NP-hard!</a>) problem, and a good solution is mainly a question of concocting a suitable combination of heuristics and engineering high-performance data structures and algorithms such that most cases are good enough with few enough exceptions. As I've found, this rabbithole goes infinitely deep and there is always more to improve, but for now we're in a fairly good place. We recently switched over to regalloc2</a>, and Cranelift 0.84 and the concurrently released Wasmtime 0.37 use it by default. Some measurements</a> show that it generally improves overall compiler speed (which was and is dominated by regalloc time) by 20%-ish, and generated code performance improves on register pressure-impacted benchmarks up to 10-20% in Wasmtime. In Cranelift's use as a backend for rustc via rustc_codegen_cranelift</a> runtime performance improved by up to 7%</a>. The allocator seems to have generally fewer compile-time outliers than our previous allocator, which in many cases is a more important property than 10-20% improvements. Overall, it seems to be a reasonable performance win with few downsides. Of course, this work benefits hugely from the lessons learned in developing that prior allocator, regalloc.rs</a>, which was work primarily done by Julian Seward and Benjamin Bouvier; I learned enormous amounts talking to them and watching their work on regalloc.rs in 2020, and this work stands on their shoulders as well as IonMonkey's. This post will make a whirlwind tour through several topics. After reviewing the register allocation problem and why it is important, we will learn about regalloc2's approach: its abstractions, its key features, and how its passes work. We'll then spend a good amount of time on "lessons learned": how we attained reasonable performance; how we managed to make anything work at all in reasonable development time; how we migrated a large existing compiler codebase to new foundational types and invariants; and some perspective on ongoing tuning and refinements. A design document</a> for the allocator exists as well, and this blogpost is meant to be complementary: we'll walk through some interesting bits of the architecture, but anyone hoping to actually grok the thing in its entirety (and please talk to me if this is you!) is advised to dig into the design doc and the source for the full story. Finally, a fair warning: this post has become a bit of a book chapter; if you're looking for a tl;dr, you can skip to the Lessons</a> section or the Conclusions</a>. Register Allocation: Recap</h2> First, let's recap what register allocation is and why it's important.2</a> The basic problem at hand is to assign storage locations to program dataflow. We can imagine our compiler input as a graph of operators, each of which consumes some values and produces others (let's ignore control flow for the moment): Some compilers directly represent the program in this way (called a "sea of nodes" IR) but most, including Cranelift, linearize the operators into a particular program order. And in fact, by the time that the register allocator does its work, the "operators" are really machine instructions, or something very close to them, so we will call them that: we have a sequence of instructions, and program points before and after each one. Even in this new view, we still have the dataflow connectivity that we did above; now, each edge corresponds to a particular range of program points: We call each of these dataflow edges, representing a value that must flow from one instruction to another, a liverange. We say that virtual registers -- the names we give the values before regalloc -- have a set of liveranges. With control flow, liveranges might be discontiguous from a linear-instruction-order point of view, because of jumps; for example: Each instruction requires its inputs to be in registers and produces its outputs in registers, usually.3</a> So, the job of the register allocator is to choose one of a finite set of machine registers to convey each of the liveranges from its definition(s) to all of its use(s). Why is this hard? In brief, because we may not have enough registers. We thus enter a world of compromise: if more values are "alive" (need to be kept around for later use) than the number of registers that the CPU has, then we have to put some of them somewhere else, and bring them back into registers only when we actually need to use them. That "somewhere else" is usually memory in the function's stack frame that the compiler reserves (a "stackslot"), and the process of saving values away to free up registers is called "spilling". One more concept before we go further: we may want to choose to place a liverange in different places throughout its lifetime, depending on the needs of the instruction sequence at certain points. For example, if a value is produced at the top of a function, then dormant (but live) for a while, and then used frequently in a tight loop at the bottom of the function, we don't really want to spill it, and reload it from memory every loop iteration. In other words, given this program: v0 := ... v1 := ... // lots of intermediate defs that use all regs v2 := ... ... vN := ... loop: v100 := ... v101 := add v0, v100 store v101, ... jmp loop</code></pre> we ideally do not want to assign a stack slot to the value v0</code> and then produce machine code like add rax, rbx ;; `v0` stored in `rax` mov [rsp+16], rax ;; spill `v0` to a stack slot ... loop: ... mov rax, [rsp+16] ;; load `v0` from stack on every iteration -- expensive! add rcx, rax mov [...], rcx jmp loop</code></pre> but if we only choose a location per liverange, we either choose a register, or a stackslot -- no middle ground. Intuitively, it seems like we should be able to put the value in a different place while it is "dormant" (spill it to the stack, most likely), then pick an optimal location during the tight loop. To do so, we need to refer to the two parts of the liverange separately, and assign each one a separate location. This is called liverange splitting. If we split liveranges, we can then do something like: add rax, rbx ;; `v0` stored in `r0` mov [rsp+16], rax ;; spill `v0` to a stack slot ... mov rdx, [rsp+16] ;; move `v0` from stackslot to a new liverange in `rdx` loop: ... add rcx, rdx ;; no load within loop mov [...], rcx jmp loop</code></pre> This seems quite powerful and useful -- so why doesn't every register allocator do this? In brief, because it makes the problem much much harder. When we have a fixed number of liveranges, we have a fixed amount of work, and we assign a register per liverange, probably in some priority order. And then we are done. But as soon as we allow for splitting, we can increase the amount of work we have almost arbitrarily: we could split every liverange into many tiny pieces, greatly multiplying the cost of register allocation. A well-placed split reduces the constraints in the problem we're solving, making it easier, but too many splits just increases work and also the likelihood that we will unnecessarily move values around. Splitting is thus the kind of problem that requires finely-tuned heuristics. To be concrete, consider the example above: we showed a split outside of the tight inner loop. But a naive splitting implementation might just split before the use, putting a move from stack to register inside the inner loop. Some sort of cost model is necessary to put splits in "cheap" places. With all of that, hopefully you have some feel for the problem: we compute liveranges, we might split them, and then we choose where to put them. That's (almost) it -- modulo many tiny details. regalloc2's Design</h2> At a high level, regalloc2 is a backtracking register allocator that computes precise liveranges, performs merging according to some heuristics into "bundles", and then runs a main loop that assigns locations to bundles, sometimes splitting them to make the allocation problem easier (or possible at all). Once every part of every liverange has a location, it inserts move instructions to connect all of the pieces. Let's break that down a bit: regalloc2 starts with precise liveranges. These are computed according to the input to the allocator, which is a program that refers to virtual registers and may be in SSA</a> form (one definition per register) or non-SSA (multiple definitions per register). </li> It then merges these liveranges into larger-than-liverange "bundles". If done correctly, this reduces work (fewer liverange bundles to process) and also gives better code (when merged, it is guaranteed that the related pieces will not need a move instruction to join them). </li> It then builds a priority queue of bundles, and processes them until every bundle has a location. (In simplified terms, regalloc2's entire job is to "assign locations to bundles".) This processing may involve undoing, or backtracking, earlier assignments, and may also involve splitting bundles into smaller bundles when separate pieces could attain better allocations. </li> </ul> We'll explain each of these design aspects in turn below. Input: Instructions with Operands</h3> First, let's talk about the input to the register allocator. To understand how regalloc2 works, we first need to understand how it sees the world. (Said another way, before we solve the problem, let's define it fully!) regalloc2 processes a "program" that consists of instructions that refer to virtual registers rather than real machine registers. These instructions are arranged in a control-flow graph</a> of basic blocks.4</a> The most important principle regarding the allocator's view of the program is: the meaning of instructions is mostly irrelevant. Instead, the allocator cares mainly how a particular instruction accesses program values as registers: which values and how (read or written), and with what constraints on location. Let's look more at the implications of this principle. Constraints</h4> The allocator views the input program as a sequence of instructions that use and define virtual registers. Every access to a register managed by the regalloc (an "allocatable register") must be via a virtual register. Already we see a divergence from common ISAs like x86: there are instructions that implicitly access certain registers. (One form of the x86 integer multiply instruction always places its output in rax</code> and rdx</code>, for example.) Since these registers are not mentioned by the instruction explicitly, one might initially think that there is no need to create regalloc operands or use virtual registers for them. But these registers (e.g. rax</code> and rdx</code>) can also be used by explicit inputs and outputs to instructions; so the regalloc at least needs to know that the registers will be clobbered, and at some later point presumably the results will be read and the registers become free again. We solve this problem by allowing constraints on operands. An instruction that always reads or writes a specific physical register still names a virtual-register operand. The only difference from an ordinary instruction that can use any register is that this operand is constrained to a particular register. This lets the allocator uniformly reason about virtual registers allocated to physical registers as the basic way that space is reserved; the constraint becomes only a detail of the allocation process. Let's take an x86 instruction mul</code> (integer multiply) as an example to see how this works. Ordinarily, one would write the following in assembly: ;; Multiplicand is implicitly in rax. mul rcx ; multiply by rcx ;; 128-bit wide result is implicitly placed rdx (high 64 bits) ;; and rax (low 64 bits).</code></pre> The instruction mul rcx</code> does not tell the whole story from regalloc2's point of view, so we would instead present an instruction like so to the register allocator, with constraints annotating uses/definitions of virtual registers: ;; Put inputs in v0 and v1. mul v0 [use, fixed rax], v1 [use, any reg], v2 [def, fixed rax], v3 [def, fixed rdx] ;; Use results in v2 and v3.</code></pre> The allocator will "do the right thing" by either inserting moves or generating inputs directly into, and using outputs directly from, the appropriate registers. The advantage of this scheme is that aside from the constraints, it makes mul</code> behave like any other instruction: it isolates complexity in one place and presents a more uniform, easier-to-use abstraction for the rest of the compiler. "Modify" Operands and Reused-Input Constraints</h4> The next difference we might observe between a real ISA like x86 and a compiler's view of the world is: an operator in a compiler IR usually produces its result as a completely new value, but real machine instructions often modify existing values. For example, on x86, arithmetic operations are written in two-operand form. They look like add rax, rbx</code>, which means: compute the sum of rax</code> and rbx</code>, and store the result in rax</code>, overwriting that input value. The register allocator reasons about segments of value dataflow from definitions (defs) to uses; but the use of rax</code> in this example seems to be neither. Or rather, it is both: it consumes a value in rax</code>, and it produces a value in rax</code>. But we can't decompose it into a separate use and def either, because then the allocator might choose different locations for each. The encoding of add rax, rbx</code> only has slots for two register names: the input in rax</code> and output in rax</code> must be in the same register! We solve this by introducing a new kind of constraint: the "reused input register" constraint.5</a> At the regalloc level, we say that the add</code> above has three operands: two inputs (uses) and an output (a def). It exactly corresponds to the compiler IR-level operator, with nicely separated values in different virtual registers. But, we constrain the output by indicating that it must be placed in the same register as the input. We can assert that this is the case when we get final assignments from the regalloc, then emit that register number into the "first source and also destination" slot of the instruction.6</a> So, instead of add rax, rbx</code> (or add v0, v1</code> with v0</code> a "modify" operand), we can present the following 3-operand instruction to the register allocator: add v0 [use, any reg], v1 [use, any reg], v2 [def, reuse-input(0)]</code></pre> This corresponds more closely to what the compiler IR describes, namely a new value for the result of the add and non-destructive uses of both operands. All of the complexity of saving the destructive source if needed is pushed to the allocator itself. Program Points, "Early" and "Late" Operands</h4> Finally, we need to go a bit deeper on what exactly it means to allocate a register "at" an instruction. To see why there may be some subtlety, let's consider an example. Take the instruction movzx</code> on x86: this instruction does a 16-to-64-bit zero-extend, with one input and one output. In pre-regalloc form with virtual registers, we could write movzx v1, v0</code>, reading an input in v0</code> and putting the output in v1</code>. An intuitive understanding of liveranges and the allocation problem might lead us to reason: both v0</code> and v1</code> are "live" at this instruction, so they overlap, and have to be placed in different registers. v0 v1 : v1 := movzx v0 | | | :</code></pre> But an experienced assembly programmer, knowing that v0</code> is not used again after this instruction, might reuse its register for the output. So for example, if it were initially in r13</code>, one might write movzx r13, r13w</code> (the r13w</code> is x86's archaic way of writing "the 16 bit version of r13</code>"). But isn't this an invalid assignment, because we have put two liveranges in the same register r13</code> when they are both "live" at this particular instruction? It turns out that this will work fine, for a subtle reason: generally instructions read all of their inputs, then write all of their outputs. In other words, there is a sort of two-phase semantics to most instructions. So we could say that the input v0</code> is live up to, and including, the "read" or "early" phase of this instruction, and the output v1</code> is live starting at the "write" or "late" phase of this instruction. These two liveranges don't conflict at all! So the above figure showing liveranges overlapping becomes: v0 v1 : EARLY | v1 := movzx v0 LATE | :</code></pre> regalloc2 (along with most other register allocators) thus has a notion of "when" an operand occurs -- the "operand position" -- and it calls these two points in an instruction Early</code> and Late</code>. Along with this, throughout the allocator, we name program points (distinct points at which allocations are made) as Before</code> or After</code> a given instruction. One final bit of subtlety: when a single instruction from a regalloc point of view actually emits multiple instructions at the machine level, sometimes the usual phasing of reads and writes breaks down. For example, maybe a pseudoinstruction becomes a sequence that starts to write outputs before it has read all of its inputs. In such a case, reusing one of the inputs (which is no longer live at Late</code>) as an output register could be catastrophic. For this reason, regalloc2 decouples an operand's position from its kind (use or def). One could have an "early def" or a "late use". Temporary registers are also possible: these are live during both early and late points on an instruction so they do not conflict with any input or output, and can be used in sequences emitted from one pseudoinstruction. regalloc2's View of Operands</h4> To summarize, each instruction can have a sequence of "operands", each of which: Names a virtual register that corresponds to a value in the original program;</li> Indicates whether this value is read ("used") or written ("defined"),</li> Indicates when during the instruction execution the value is accessed ("early", before the instruction executes; or "late", after it does);</li> Indicates where the value should be placed: in a machine register of a certain kind, or a specific machine register, or in the same register that another operand took, or in a slot in the stack frame.</li> </ul> Stage 1: Live Ranges</h3> We've described what the register allocator expects as its input. Now let's talk about how the input is processed into an "allocation problem" that can be solved by the main algorithm. The input is described as a graph of blocks of instructions, with operands; but most of the allocator works in terms of liveranges and bundles of liveranges instead. In brief, a liverange (originally "live range", but we say it so often it has become one word!) is a span of program points -- that is, a range of "before" and "after" points on instructions -- that connects a program value in a virtual register from a definition to one or more uses. A liverange represents one unit of needed storage, either as a register or a slot in the stackframe. The basic strategy of regalloc2 is to reduce the input into liveranges as soon as possible, and then operate mostly on liveranges, translating back to program terms (inserted moves and assigned registers per instruction) only at the very end of the process. This lets us reason about a simpler "core problem" that is actually quite concisely specified: A liverange is a span of program points, which can be numbered consecutively;</li> A liverange has constraints at certain points that arise from program uses/defs;</li> We must assign locations to liveranges, such that: At any point, at most one liverange lives in a given location;</li> At all points, a liverange's constraints are satisfied;</li> </ul> </li> We are allowed to split a liverange into pieces and assign different locations to each piece. However, moves between pieces have a cost, and we must minimize cost.</li> </ul> And that's it! No need to reason about ISA specifics, or the way that regalloc2 generates moves, or anything else. We'll worry about generating moves to "reify" (make real) the assignments later. For now, we just need to slot the liveranges into locations and avoid conflicts. Computing Liveness</h4> To build up our set of liveranges, we first need to compute liveness. This is a property of any particular virtual register at a program point indicating that it has a value that will eventually be used. Liveness analysis</a> is an iterative dataflow analysis</a> that is computed in the backward direction: any use of a virtual register propagates liveness backward ("upward" in the program), and a definition of that virtual register's value ends the liveness (when scanning upward), because the old value (from above) is no longer relevant. Thus the first thing that regalloc2 does with the input program is to run a worklist algorithm to compute precise liveness</a>. This produces a bitset7</a> that, at each basic block entry and exit, gives us the set of live virtual registers. Once we know which registers are live into and out of each basic block, we can perform block-local processing</a> to compute actual liveranges with each use of the register properly noted. This is another backward scan, but this time we build the data structures</a> we'll use for the rest of the allocation program. Normalization, and Saving Fixups for Later</h4> We mentioned above that one way to see the liverange-building step is as a simplification of the problem to its core essence, in order to more easily solve it. "Ranges that may overlap" is certainly simpler than "instructions that access registers with certain semantics". However, even the constraints on the liveranges can be made simpler in several ways. A good example of a complex set of constraints is the following: inst v0 [use, fixed r0], v0 [use, fixed r1], v1 [def, any reg]</code></pre> This is an instruction that has two inputs, and takes the inputs in fixed physical registers r0</code> and r1</code>. This is completely reasonable and such instructions exist in real ISAs (see, e.g., x86's integer divide instruction, with inputs in rdx</code> and rax</code>, or a call with an ABI that passes arguments in fixed registers). If the two inputs happen to be given the same program value, here virtual register v0</code>, then we have created an impossible constraint: we require v0</code> to be in both r0</code> and r1</code> at the same time. As we have formulated the problem, a liverange is in only one place at a time; and in fact this is a very useful simplifying invariant, and a simpler model than "there are N copies of the virtual register at once" (which one(s) are up-to-date, if we allow multiple defs?). We can "simplify to a previously solved problem" in this case with a neat trick: we keep a side-list of "fixup moves" to add back in, after we complete allocation, and we insert such a fixup move from r0</code> to r1</code> just before this instruction. Then we delete the constraint on the second operand that uses v0</code>. The rest of the allocation will proceed as if v0</code> were only required in r0</code>; it will end up in that location; and the fixup move will copy it to r1</code> as well. We perform a similar rewrite for reused-input constraints. These seem as if they would be fairly fundamental to the core allocation loop, because they tie one decision to another; now we have to deal with dependent allocation decisions. But we can do a simpler thing: we edit the liveranges so that (i) the output that reuses the input has a liverange that starts at the early (input) phase, and (ii) the input has a liverange that ends just before the instruction, not overlapping. (In other words, we shift both back by one program point.) Then we insert a fixup move from input to output. The figure below illustrates this rewrite. INITIAL v0 v1 v2 : : | | EARLY use use add v2, v0, v1 LATE def reuse(0) | : REWRITTEN v0 v1 v2 : : | | | | LATE | | | (implicit copy: v2 := v0) EARLY use def add v2, v0, v1 | LATE | | :</code></pre> One may object that this pessimizes all reused-input allocations -- haven't we removed all knowledge of the constraint, so we will almost always get different registers at input and output, and cause many new moves to be inserted? The answer to this issue comes in the bundle merging, which we discuss below (basically, we try to rejoin the two parts if no overlap would result). In general, this is a powerful technique: whenever some complexity arises from a constraint or feature, it is best if the complexity can be kept as close to the outer boundary of the system as possible. Rewrites or lowerings into a simpler "core form" are common in compilers, and it so happens that considering regalloc constraints in this light is useful too.8</a> Step 2: Bundles and Merging</h3> Once we have created a list of liveranges with constraints, we could in theory begin to assign locations right away, finding available locations that fulfill constraints and splitting where necessary to do so. However, such an approach would almost certainly run more slowly, and produce worse code, than most state-of-the-art allocators today. Why is that? A key observation about liveranges in real programs is that there are clusters of related liveranges connected by moves. Several examples are the liveranges on either side of an SSA block parameter (or phi-node), or on either side of a move instruction, or the input and reused-register-constrained output of an instruction.9</a> These liveranges often would benefit if they were in the same register: in all three cases, it would mean one fewer move instruction in the final program. Processing such related liveranges together, as one unit of allocation, would guarantee that they would be assigned the same location. (If impossible, the merged liveranges could always be split again.) Attaining this result some other way would require reasoning about "affinity" for locations between related liveranges, which is a much more complex question. Furthermore, processing multiple liveranges together brings all the usual efficiency benefits of batching: the more progress we can make with a single decision, the faster the register allocator runs. We thus define a "bundle" of liveranges as the unit of allocation. After computing liveranges in the initial input program scan, we merge liveranges into bundles according to a few simple heuristics: across SSA block parameters, across move instructions, and from inputs to outputs of instructions with "reused-input" constraints. The one key invariant is: all liveranges in a bundle must not overlap. We greedily grow a bundle with the above heuristics, testing at each step whether another liverange can join. Beyond this point in the allocation process, we will reason about bundles: we enqueue them in the priority workqueue, we process them one at a time and assign locations or split. At the end of the process, we'll scan the liveranges in the bundle and assign each the location that the bundle received. CORE ALLOCATION PROBLEM: bundle0 bundle1 bundle2 bundle3 | | | | | | | | | | | | | | ==> bundle0: r0 bundle1: r1 bundle2: r0 bundle3: r2</code></pre>Step 3: Assignment Loop and Splitting Heuristics</h3> The heart of the allocator is the main loop that allocates locations to bundles. This is at least conceptually simple: pull a bundle off of a queue, "probe" potential locations one at a time to see if it will fit (has no overlap with points in time for which that location is already reserved), assign it the first place it fits. But there is significant complexity in the details, as always. The key data structures are: (i) an "allocation map" for each physical register, kept as a BTree for fast lookups, that indicates whether the register is free or occupied at any program point and the liverange that occupies it; and (ii) a queue of bundles to process. (The design document</a> describes several others, such as the second-chance allocation queue and the structures used for stackslots, which we skip here for simplicity.) The core part of the allocator's processing occurs here: we pull one bundle at a time from the queue and attempt to place it in one of the registers (again we're ignoring stackslot constraints for simplicity). For each bundle, we can perform one of the following actions: If we find a register with no overlapping allocations already in place, we can allocate the bundle to the register; then we're done! This is the best case. </li> Otherwise, we can pick a register where some bundles with a lower "spill cost" (determined as a sum of some heuristic values for each use of a liverange in a bundle) and evict those already-allocated bundles, punting them back to the queue, then put our present bundle in this register instead. We do this only if the present bundle has a higher spill cost. </li> If this is also not an option, we can split our present bundle into pieces and try again. Heuristically, we find it works well to split at the first conflict point; in other words, allocate as much as would have fit in any register, and then put the remainder back in the queue. </li> </ul> TO ALLOCATE: GIVEN: bundle0 r0 r1 r2 r3 | |b1 |b4 | | | | | |b2 | | | | | |b3 | | | | OPTION 1: Take a free register (r3) - Possible if no overlap. Easiest option! OPTION 2: Evict, if bundle0's spill cost is higher than evicted bundles and if no completely free register exists: bundle1 bundle2 r0 r1 r2 | |b0 |b4 | | | | | | | | | | |b0 |b3 | | | | (b1 and b2 are re-enqueued) OPTION 3: Split! r0 r1 r2 --> |b1 |b0 |b4 | | | | | |b2 | | | | --> |b0 |b3 | | | |</code></pre> The presence of eviction as an option is what makes regalloc2 a backtracking allocator. It's not clear why the allocator should always finish its job, if it is allowed to undo work. In fact many bundles may be evicted in order to place just one bundle instead -- isn't this backward progress? The key to maintaining forward progress is that we only evict bundles of lower spill weight, together with the fact that spill weight monotonically decreases when splitting. Eventually, if bad luck continues far enough, a bundle will be split into individual pieces around each use, and these can always be allocated because (if the input program does not have fundamentally conflicting constraints on one instruction) these single-use bundles have the lowest possible spill weight. Step 4: Move Handling</h3> Finally, once we have a series of locations assigned to each bundle, we have "solved the problem", but... we still need to convey our solution back to the real world, where a compiler is waiting for us to provide a series of move, load, and store instructions to place values into the right spots. We split the overall problem into two pieces for the usual simplicity reasons: first, we allow ourselves to cut liveranges into as many pieces as needed, and put each piece in a different place, at a single instruction granularity. We assume that we can edit the program somehow to connect these pieces back up. That allowed the above liverange/bundle processing to become a tractable problem for a solver core to handle. Now, need to connect those liverange fragments. This is the second half of the problem: generating moves. All-in-One: Liverange Connectors, Program Moves, and Edge Moves</h4> The abstract model for the input to this stage of the allocator is that between each pair of instructions, we perform some arbitrary permutation of liveranges in locations. One way to see this permutation is as a parallel move: a data-movement action that reads values in all of their old locations (inputs of the permutation), then in parallel, writes the values to all of their new locations (outputs of the permutation). EARLY inst1 r2, r0, r1 LATE { r4 := r0 } <--- regalloc-inserted moves EARLY inst2 r0, r2, r3 LATE { r6 := r5, r5 := r6 } <--- multiple moves in parallel! (arbitrary permutations) EARLY inst3 r5, r4, r2 LATE</code></pre> This is why we make a distinction between the "After" point of instruction i and the "Before" point of instruction i+1, though a traditional compiler textbook would tell you that there is only one program point between a pair of instructions. We have two, and between these two program points lies the parallel move.10</a> The process for generating these moves is: we scan liveranges, finding points at which they have been split into pieces where the value must flow from one piece to the next. We also account for CFG edges and block parameters at this point, as well as for move instructions in the input program. Once we have accumulated the set of moves that must happen, in parallel, at a given priority at a given location, we resolve these into a sequence of individual move/load/store instructions using the algorithm we describe in the next section. One thing to note about this design is that we are handling all value movement in the program with a single resolution mechanism: regalloc-induced movement but also movement that was present in the original program. This is valuable because it allows the moves to be handled more efficiently. In contrast, we have observed issues in the past in allocators that lower moves in stages -- e.g., SSA block parameters to moves prior to regalloc, then regalloc-induced moves during regalloc -- where chains of moves occur because each level of abstraction is not aware of what other levels below or above it are doing. Parallel Move Resolution</h4> The actual problem of resolving a permutation such as: { r0 := r1 ; r1 := r2 ; r2 := r0 }</code></pre> into a sequence of moves scratch := r0 r0 := r1 r1 := r2 r2 := scratch</code></pre> is a well-studied one, and is known as the "parallel moves problem". The crux of the solution</a> is to understand the permutation as a kind of dependency graph, and sort its moves so that we pull an old value out of a given register before overwriting it. When we encounter a cycle, we can use a scratch register as above. One might think that something like Tarjan's algorithm</a> for finding strongly-connected components is needed, but in fact there is a nice property of the problem that greatly simplifies it. Because any valid permutation has at most one writer for any given register, we can only have simple cycles of moves, with other uses of old values in the cycle handled before realizing the cyclic move. Some more description</a> is available in our implementation. In fact, this is such a nice observation that we later discovered a paper</a> by Rideau et al. that names the resulting dependency graphs "windmills" for their shape (see figure below -- there can be a simple cycle in the middle, and only acyclic outward moves from cycle elements in a tree of outward shifts) and, delightfully, describes more or less the same algorithm to "tilt at windmills" and resolve the moves. Scratch Registers and Cycles</h4> The above algorithm works, but has one serious drawback: it requires a scratch register whenever we have a cyclic move. The simplest approach to this requirement is to set aside one register permanently (or actually, one per "register class": e.g., an integer register and a float/vector register). Especially on ISAs with relatively few registers, like x86-64 with 16 each of integer and float registers, this can impact performance by increasing register pressure and forcing more spills. We thus came up with a scheme</a> to allow use of all registers but still find a scratch when needed for a cyclic move. The approach begins with an idea borrowed from IonMonkey</a>, namely to look for a free register to use as a scratch by actually probing the allocation maps. This often works: the need for a cyclic move doesn't necessarily imply that we will have high register pressure, and so there are often plenty of free registers available. What if that doesn't work, though? In the above PR, we take another seemingly-simplistic approach: we use a stackslot as the scratch instead! This means that we will resolve the cyclic move into a sequence including stores and loads, but this is fine, because we're already in a situation where all registers are full and we need to spill something. We're not quite done, though: there is another very important use of the scratch register in a simplistic design, namely to resolve memory-to-memory moves! This arises because our move resolution handles both registers and stackslots in a uniform way, so some cycle elements may be stackslots (memory locations). Using a stackslot as a scratch above just compounds the problem. So we translate, in a separate second phase, memory-to-memory moves into a pair of a load (from memory into scratch) and a store (from scratch into memory). So to recap, we may find a cyclic move permutation to be necessary, and no registers to be free to use as scratch; so we use a stackslot instead. But some of the original move cycle may have been between stackslots, so we need another scratch to do make these stackslot-to-stackslot moves possible. But we're already out of scratch registers! The solution to this last issue is that we can do a last-ditch emergency spill of any register, just for the duration of one move. So we pick a "victim" register of the right kind (integer or float), spill it to a second stackslot, use this victim register for a memory-to-memory move (a load and store pair), then reload the victim. This cascading series of solutions, each a little more complex but a little rarer, is an example of a complexity-for-performance tradeoff. Overall, it is far better to allow the program to use all registers; this will reduce spills. And most parallel moves are not cyclic, so scratch registers are rarely needed anyway. And when a cyclic move is needed, we often have a free register, because this condition is mostly orthogonal to high register pressure. It is only when all of the bad cases line up -- cycle, no free registers, and memory-to-memory moves -- that we reach for the highest-cost approach (decomposing one move into four), and so the most important aspect of this fallback is not that it is fast but that it is correct and can handle all cases. Everything Else</h3> This has been a not-so-whirlwind tour of the allocator pipeline in regalloc2, but despite my longwindedness, we had to skip many details! For example, the way in which stackslots are allocated for spilled values, the way in which split pieces of a single original bundle share a single spill location ("spill bundles"), the way in which we clean up after move insertion with Redundant Move Elimination (a sort of abstract interpretation that tracks symbolic locations of values), and more, are skipped here but are all described in the design document. One could truly write a book on the engineering of a register allocator, but the above will have to suffice; now, we must move on and draw some lessons! Four Lessons</h2> Performance</h3> Cache Locality and Scans</h4> One enduring theme in the regalloc2 architecture is data structure design for performance. As I began the project by transliterating IonMonkey code, building Rust equivalents to the data structures in the original C++, I found several things: The original data structures were heavily pointer-linked. For example, liveranges within bundles and uses within liveranges were kept as linked lists, to allow for fast insertion and removal in the middle, and splicing. A linked list is the classical CS answer to these requirements. </li> There were quite a few linear-time queries of these data structures. For example, when generating moves between liveranges of a virtual register, a scan would traverse the linked list of these liveranges, observe the range covering one end of a control-flow transition, and do a linear-time scan</a> (through the linked list) for the liverange at the other end! </li> </ul> These two design trends combine to make CPU caches exceptionally unhappy. First there is the algorithmic inefficiency, then there is the cache-unfriendly demand access to random liveranges, each of which is a pointer-chasing scan. regalloc2 adopts two general themes that work against these problems: The overall data structure design consists of contiguous-in-memory inline structs rather than linked lists. For example, the list of liveranges in a bundle is a SmallVec<[LiveRangeListEntry; 4]></code>, i.e. a list with up to four entries inline and otherwise heap-allocated, and the entry struct contains the program-point range inline. Combining this more compact layout with certain invariants -- usually, some sort of sorted-order invariant -- allows for efficient lookups and list merges even without linked-list splicing. </li> At a higher level, regalloc2 tries to avoid random lookups as much as possible. Sometimes this is unavoidable, but where it is not, a linear scan that produces some output as it goes is much more cache-friendly. </li> </ul> It is worth examining the particular technique we use to resolve moves across control-flow edges. This requires looking up where a virtual register is allocated at either end of the edge -- two arbitrary points in the linear sequence of instructions. The problem is solved in IonMonkey (as we linked above) by scanning over ranges to find basic block ends and then doing a linear-time linked-list traversal to find the "other end", for overall quadratic time. Instead we scan the liveranges for a virtual register once and produce "half-moves" into a Vec</code>.</a> These "half-moves" are records of either the "source" side of a move, at the origin point of a CFG edge, or the "destination" side of a move, at the destination point of a CFG edge. After our single scan, we sort the list of half-moves by a key (the vreg and destination block) so that the source and destination(s) appear together. We can then scan this list once and generate all moves in bulk. If that sounds something like MapReduce</a>, that is not an accident: the technique of leveraging a sort with a well-chosen key was invented to allow for efficient parallel computation, and here allows the two "ends" of the move to be processed independently. This technique provides better algorithmic efficiency, much better cache residency (we have two steps that boil down to "scan input list linearly and produce output list linearly"), and leans on the standard-library implementation of sort()</code>, which is likely to be faster than anything we can come up with. Profiling of regalloc2 runs shows sometimes up to 10% or so of runtime spent in sort()</code>, but this is far better than the alternative, in which we do a random pointer-chasing lookup at every step. Compact Data</h4> Another lesson learned over and over during regalloc2 optimization is this: data compactness matters! A single struct</code> growing from 16 to 24 bytes could lead to significant slowdowns if a large input leads to allocation and traversals over an array of 10,000 such structs. Every improvement in memory footprint is a reduction in cache misses. We play many games with bitpacking to achieve this. For example, regalloc2 puts its Operand</code> in 32 bits</a>, and this includes a virtual register number, a constraint, a physical register number to possibly go with that constraint, the position (early/late), kind (def/use), and register class of the operand. Some of this optimization requires compromise: as a result of our encoding scheme, for example, we can allow only 2M (2^{21) virtual registers per function body. But in practice most applications will have other limits that take effect before this matters. (And in any case, many compilers play these same sorts of tricks, so megabytes-large function bodies are problematic in all sorts of ways.) And we sometimes}find ways to pack a few more bits</a> (more such PRs are always welcome!). We play similar tricks with program points, spill weights (we store them as bfloat16</a> because spill weights need not be too precise, only relatively comparable, and using only 16 bits lets us pack some flags in the upper 16 and save a u32</code>), and more. Finally, trading off indirection and data-inlining is important: e.g., a LiveRangeList</code></a> keeps the program-point range (32 + 32 bits) inline, then a 32-bit index to indirect to everything else about the liverange, because checking for bundle overlap is the most common reason for traversing this list and reducing cache misses in this inner loop is paramount. Reducing Work</h4> One final performance technique that at once both sounds completely obvious and superficial, yet is quite powerful, is: "simply do less work!" One can often get lost in profiler results, wondering how to shave off some hotspots by compacting some data or reworking some inner-loop logic, only to miss that one is implicitly assuming that the actual computation to be done is invariant. In other words, one might look for the fastest way to compute a particular subproblem or framing of the problem, rather than the ultimate problem at hand (in this case, the register allocation). In the case of regalloc2, this primarily means that we can improve performance by reducing the number of bundles and liveranges. In turn, this means that we can get outsized wins by improving our merging and splitting heuristics. Early in the optimization push, I realized that regalloc2 was often finding an abnormally large number of conflicts between bundles, and splitting far too aggressively. It turned out that the liveness analysis was initially approximate, in an intentional, if premature, efficiency tradeoff to avoid a fixpoint loop in favor of a single-pass loop-backedge-based algorithm that overapproximated liveness (which is fine for correctness). The time that this saved was more than offset by the large increase in iterations of the bundle processing loop. So I reworked this into a precise analysis</a> that iterates until fixpoint. It is worthwhile to pay that extra analysis cost upfront to get exact liveness in order to make our lives (and our runtime) better later. The way in which we compute that precise liveness itself also raises an interesting way of reducing work: by carefully choosing invariants. We perform the liverange-building scan in such a way that we always observe liveranges in (reverse) program order. This lets us build the liverange data structures, which are normally sorted, with simple appends, merging with contiguous sections from adjacent blocks. This is in contrast to the original IonMonkey allocator's equivalent function</a> to add liveranges during analysis, which essentially does an insertion sort and merge, leading to O(n²) behavior. Note that the IonMonkey code has a CoalesceLimit</code> constant that caps the O(n²) behavior at some fixed limit. In contrast our liverange build in regalloc2 is always linear-time. The final way in which one can reduce work, related to data-structure and invariant choice, is by designing the input (API or data format) correctly in order to efficiently encode the problem. The register allocator that preceded regalloc2, regalloc.rs, did not have a notion of register constraints in instructions' use of virtual registers. Instead, it required the user to use move instructions: reused-input constraints become a move prior to the instruction, and fixed-register constraints become moves to/from physical registers. It then relied on a separate move-elision analysis to try to eliminate these moves. regalloc2 has a smaller input because constraints are carried on every operand. It can still generate these moves when needed, but they often are not. This results in faster allocation as well as often better generated code. Correctness: "Design for Test" and Fuzzing-First Development</h3> The next set of lessons to come from regalloc2 have to do with how to attain correctness in complex programs. I believe that regalloc2 is maybe the most intrinsically complex program I have written: its operation relies on many interlocking invariants across the allocation process, and there are many, many edge cases to get right. It is >10K lines of very dense Rust code. There should be approximately zero chance for any human to get this correct, working on real inputs, in any reasonable timeframe. And relying on something this complex to uphold security guarantees that rely on correct compilation should be terrifying. And yet somehow it seems to work, and we haven't found any miscompiles caused by RA2 itself since we switched Cranelift to use regalloc2 in April. More broadly, there was one</a> issue where constraints generated by Cranelift could not be handled in some cases, resulting in a panic11</a>; and another</a> where spillslots were not reused as they should be, resulting in worse performance; neither could result in incorrect generated code. In the integration of RA2 into Cranelift, there were two</a> bugs</a> that could, but both were found within 24 hours by the fuzzers. (That doesn't mean there won't be any more of course -- but things have been surprisingly boring and quiet!) The main superpower, if one can call it that, that enabled this to work out is fuzzing. And in particular, a step-by-step approach to fuzzing in which I built fuzzing oracles, test harnesses, and fuzz targets as I built the allocator itself, and drove development with it. Until about 4 months in when I wired up the first version of the Cranelift integration, regalloc2 had only ever performed register allocation for fuzz-target-generated inputs. It still doesn't have a test harness for manually-written tests; there seems to be no need, as the fuzzer is remarkably prescient at finding bugs. I find it helpful to think of this philosophy in terms of the design-for-test</a> idea from digital hardware design. In brief, the idea is that one builds additional features or interfaces into the hardware specifically so its internal state is visible and it can be tested in controlled, systematic ways. The first thing that I built in the regalloc2 tree was a function body generator</a> that produces arbitrary control flow, either reducible or irreducible, and arbitrary uses and defs according to what SSA allows. I then built an SSA validator</a>, and finally, fuzzed one against the other</a>. This way I built confidence that I had fuzzing input that included interesting edge cases. This would become an important tool for testing the whole allocator, but it was important to "test the tester" first and cross-check it against SSA's requirements. Of course, checking SSA requires one to compute flowgraph dominance on the CFG, and that can be fuzzed too</a>, using a from-first-principles definition of graph dominance. So the test-tester has itself been tested in this additional way. Once I had built enough tools with the lower-level tools, and sharpened them all against each other, it was time to write the register allocator itself. Once each major piece was implemented, I first fuzzed it with the SSA function generator to check for panics (assertion failures, mostly). Getting a clean run, given the relatively generous spread of asserts throughout the codebase, gave some confidence that the allocator was doing something reasonable. But to truly be confident that the results were semantically correct answers, we needed to lean more heavily on some program analysis techniques. In another blog post</a> I detailed our "register allocator checker". In brief, this is a symbolic verification engine that checks that the resulting register allocations produce the same dataflow connectivity as the original, pre-regalloc program. To fully verify regalloc2, I ported the checker over, and drove the whole pipeline -- SSA function generator, allocator, and checker -- with a fuzz target</a>. This workflow was remarkably (sometimes maddeningly!) effective. I started with a supposedly complete allocator, and ran the fuzzer. Within a few seconds it found a "counterexample" where, according to the checker, regalloc2 produced an incorrect allocation. I built annotation</a> tooling to produce views of the allocator's liveranges and other metadata</a> over the original program. I pored over this and debug-log output of the allocator's various stages, eventually worked out the bug (often some corner-case I had not considered, or sometimes an unexpected interaction between two different parts of the system) and came up with a fix. With the particular fuzz-bug fixed, I started up the main fuzzer again. libFuzzer's startup seems to run over the entire corpus before generating new inputs, so sometimes my bugfixes would quickly cause regressions in other cases I had already handled before. After juggling solutions and finding some way to maintain correctness in all cases, I would let the fuzzer run again, usually finding my next novel fuzzbug within a few minutes. This was my life for a month or so. Fuzzers, especially over complex programs with strict oracles, are relentless: they leave no rock unturned, they find every bug you could imagine and some you can't, and they accept no excuses. But one day... you run the fuzzer and you find that it keeps running. And running. Three hours later, it's still running. There is no better feeling in the software-engineering universe, and frankly fuzzing with a strong oracle (like symbolic checking or differential execution fuzzing) is probably the second-strongest assurance one will get that one's code is correct (with respect to the "spec" implied by the testcase generator and oracles, mind!) short of actual formal verification. This was the project that changed my opinion on fuzzing from "nice to have supplemental correctness technique" to "the only way to develop complex software". Compatibility and Migration Path</h3> The last lesson I want to draw from my regalloc2 experience is how one might think about compatibility and migrations, in the context of large "replace a whole unit" updates to software projects. The regalloc2 effort occurred within the context of the Cranelift project, and was designed primarily for use in Cranelift (though it can be used, and apparently is being used, as a standalone library elsewhere as well). As such, a primary design directive for regalloc2 could be "do whatever is needed to fit into Cranelift's assumptions about the register allocator". On the other hand, conforming to the imprint left by the last register allocator is a good way to sacrifice a rare chance to explore different corners of the design space. The design of the API of regalloc.rs made in 2020 was quite good for the time -- simple, easy to use, and purpose-built for Cranelift -- but we subsequently learned several lessons. For example, regalloc.rs required the program to be already lowered out of SSA, resulting in somewhat inefficient interactions between blockparam-generated moves and regalloc-generated moves. Ideally we wanted to do something better here. A timeline for context: regalloc2 proper was working, with its fuzzer as its only client, after about 6 weeks of initial implementation (late March to early May 2021). I cheerfully dove into a refactoring of Cranelift at that point to adapt to the new abstractions. Less cheerfully after a few weeks of effort, I stopped this direct-port effort at around 547 type errors remaining (having never gotten past a full typecheck). There was simply too much changing all at once, and it was clearly not going to be a reasonable single diff to review or to trust for correctness. I had underestimated how much would have to change; pulling one string loosened three others. It was clear that some sort of transition would need to happen in multiple stages, so I next built a compatibility shim</a> as a new "algorithm" in regalloc.rs that was a thin wrapper around regalloc2. This involved significant work in regalloc2 to expand its range of accepted inputs: support for non-SSA code, support for "modify" operands as well as uses and defs, and explicit handling of program-level moves with integration into the move generation logic. This was working by August of 2021. Performance results were not as good as initially expected with "native" regalloc2 API usage, but were a promising intermediate step nonetheless. However, for somewhat complicated reasons, review of that PR stalled, and I spent time in other parts of Cranelift (the ISLE</a> DSL and instruction-selector backends using it). When I eventually came back to RA2, in February 2022, several things had changed: some refactoring (as a result of ISLE) made adaptation to "SSA-like" form in x86 instructions easier, and the enhancements to regalloc2 as part of the regalloc.rs compatibility shim also let us use RA2 directly and migrate away from "modify" operands, moves, etc., in an incremental way. So I made a second attempt at porting Cranelift to use regalloc2 directly, this time succeeding</a>, to fairly good results</a>. We've been using RA2 since that PR merged in mid-April 2022, about a year after RA2 began. I learned a few valuable lessons from this saga, but the main one is: incremental migration paths are everything. The above PR may look horribly scary but much of the churn was "semantically boring": RA2 supported, in the end, most of the same abstractions as regalloc.rs, with only blockparam handling changing fundamentally. This is a sort of hybrid of the "compatibility shim" and "direct use of new API" approaches: new API, but supporting a superset of the semantic demands of the old API. One can then migrate single API use-sites at a time away from "legacy semantics" and eventually delete the warts (e.g., "modify" operands in addition to pure uses/defs) if one desires, but that is decoupled from the main atomic switchover. I indeed hope to do such cleanup in Cranelift, in due time. Along with that, it is useful to think of finite budget for semantic/design-level cleanup per change. Rewrites are opportune times to push a project into a better design-space and benefit from lessons learned, sometimes in ways that would be hard or impossible to do with a truly incremental approach. However, at the margins where the rewrite connects to the outside world, this shift causes tension and so is fundamentally constrained or else has to pull the whole world along with it. I am happy that regalloc2 pulls responsibility for SSA lowering into the allocator; it can be handled more efficiently there. Likewise I am happy that the compatibility-shim effort filled in support for regalloc.rs features that made the rest of the transition easier. Unending and Unwinnable Nature of Heuristic-Tuning</h3> The final lesson I wish to pull out of this experience is one that has become apparent in the time since the initial transition to RA2: any program that solves an NP-complete problem in a complex way, with a hybridized ball of hundreds of individual heuristics and techniques that somehow works most of the time, is always going to make someone unhappy in some case and at some point unambiguous wins become very hard to find. That is not at all to say that it's not worth continuing attempts at optimization; sometimes improvements do become apparent. But they become much rarer after an initial hillclimb to the top of a "competent implementation of one point in design-space" local maximum. While looking for more performance, I experimented with many different split heuristics. Especially difficult are splits' relationship to loops: when one has a hot inner loop, one really wants to place a split-point that implies an expensive move (load or store) outside the inner loop. But getting this right in all cases is subtle, because the winning tradeoff depends on register pressure inside the loop, how many values are live across the loop and to the following code, how many uses occur in the loop and how frequently (rare path vs. common path), and so on. In the end, I actually abandoned a number of more complex cost heuristics (an example is in this never-merged commit</a>) and went with several simple heuristics: minimize the cost of the implied move at a split</a>, and explicitly hoist split-points outside of loops</a>. This worked best overall, but did leave a little performance unclaimed in some microbenchmarks. Sometimes clearer improvements are still possible. One example of a recent investigation: in #3785</a>, we noticed that switching to RA2 had caused an extra move instruction to appear in a particular sequence. This seems minor, but it is always good to understand why it might have occurred and if it points to some deeper issue. After some investigation</a> it became apparent that the splitting heuristics were suboptimal in the particular case of a liverange that spans from a register-constrained use to a stack-constrained use. The details are beyond the scope of this post (thank goodness, it's long enough already!); but empirically I found that trimming liveranges around a split-site in a slightly different way tended to improve results. So, some changes will be an unmitigated win, but not every tradeoff is so. At the very least, the nature of a register allocator is that one will likely have an unending stream of "could work better in this case" sorts of issues. Can't win 'em all (but keep trying nonetheless!). Conclusions</h2> We're finally at the conclusions -- thanks to all who have persisted in reading this far! regalloc2 has been an immensely rewarding project for me, despite (or perhaps because of) the ups-and-downs inherent in building an honest-to-goodness, actually-works, somewhat-competitive-with-peer-compilers register allocator. It was a far larger project than I had anticipated: when I began, I told my manager it would probably be a few weeks to evaluate scope, maybe a month of work total. Witness Hofstadter's Law</a> in action: that is, it will always take longer than you think it will, even when accounting for Hofstadter's Law. I hope some of the above lessons have been illuminating, and perhaps this post has given some sense of how many interesting problems the register-allocator space contains. It's a well-studied area for at least 40 years now, with countless approaches and clever tricks to learn and to combine in new ways; the work is far from over! Acknowledgments</h2> Many, many thanks to: Julian Seward</a> and Benjamin Bouvier</a> for numerous discussions about register allocation throughout 2020, and Julian for several followup discussions after regalloc2 started to exist; Julian Seward and Amanieu d'Antras</a> for initial code-review of regalloc2 proper; Amanieu for a number of really high-quality PRs to improve RA2 and add feature support; and Nick Fitzgerald</a> for code-review of the (quite extensive) Cranelift refactoring to use regalloc2. Enormous thanks to Nick for reading over this entire post and providing feedback as well. ^1which is to say, the original three</a>-part</a> series</a> covered a range of topics summarizing the goals and ideas of Cranelift's new backend design, but we haven't stopped working to improve things since then! The series is now four-thirds complete; by the time I'm done it may be five-thirds or more... </div> ^{2 In fact, it is perhaps the most important problem to solve for a fast Wasm-focused compiler, because most other common compiler optimizations will have been done at least to some degree to the Wasm bytecode; register allocation is the main transform that bridges the semantic gap from stack bytecode to machine code. </div>}^{3 Other sorts of constraints are possible too; in general, a liverange is constrained by all of the "register mentions" in instructions that touch the liverange's vreg, and we have to satisfy all of these constraints at once. A constraint may be "any register of this kind", or "this particular physical register", or "a slot on the stack", or "the same register as given to another liverange", for example. And beyond constraints, we may have soft "hints" as well, which if followed, reduce the need to move values around. </div>}^4regalloc2 supports arbitrary control flow (i.e., does not impose any restrictions on reducibility</a>); its only requirement is that critical edges</a> are split, which Cranelift ensures by construction during lowering. </div> ^{5 Full credit for this idea, as well as most of the constraint design in regalloc2, goes to IonMonkey's register allocator. </div>}^{6 As we'll note under "Lessons" below, during development of a compatibility layer that allowed regalloc2 to emulate regalloc.rs, an earlier register allocator, we actually added a "modify" kind of operand that directly corresponds to the semantics of rax</code> above, namely read-then-written all in one register. We subsequently used it in several places while migrating Cranelift. But for simplicity we hope to eventually remove this (once all uses of it are gone). </div>}^7It's actually a sparse bitset</a> that, when large enough, stores a hashmap whose values are contiguous 64-bit chunks of the whole bitset. This is because, for large functions with thousands of virtual registers, keeping thousands of bits per basic block would be impractical. However, the naive sparse approach, where we keep a HashSet<VReg></code> or equivalent, is also costly because it spends 32 bits per set element (plus load-factor overhead). We observed that the live registers at a given point are often "clustered": there are some long-range live values from early in the function, and then a bunch of recently-defined registers. (This depends also on virtual registers being numbered roughly in program order, which is generally a good heuristic to rely on.) So we have a few u64</code>s and pay the sparse map cost for those, then have a dense map within each 64-bit chunk. </div> ^8Credit must go to IonMonkey for this trick</a> as well, though the details of how to edit the liveranges appropriately to get the right interference semantics were far from clear</a> and the path to our current approach was "paved by fuzzbug failures", so to speak. </div> ^{9 Some literature on SSA form calls the connected set of liveranges via phi-nodes or block parameters "webs". Our notion of a bundle encompasses this case but is a bit more general; in principle we can merge any liveranges into a bundle as long as they don't overlap. </div>}^10Actually, there are up to seven</a> parallel moves between instructions, at priorities according to the way that various constraint edge-cases are lowered. For example, when a single vreg must be placed in multiple physical registers due to multiple uses with different fixed-register constraints, the move that makes this happen occurs at MultiFixedReg</code> priority, which comes after the main inter-instruction permutation (it is logically part of the input setup for the following instruction). And ReusedInput</code> moves happen after that, because any one of the fixed-register inputs could be reused as an input. The detailed reasoning for the order here is beyond the scope of this blogpost, but suffice it to say that the fuzzer helped immensely in getting this ordering right! </div> ^{11 Pertinent to the broader point about fuzzing, this combination of constraints was not generated by RA2's fuzz target, which is why the resulting corner cases were not seen during development. As soon as the fuzzing testcase generator was extended to do so, the fuzzer found a counterexample within a few seconds, and helped to verify the constraint rewrites in RA2's frontend that fixed this issue. </div>} Cranelift, Part 3: Correctness in Register Allocation 2021-03-15T00:00:00+00:00 This post is the last in a three-part series about Cranelift</a>. In the first post</a>, I covered overall context and the instruction-selection problem; in the second post</a>, I took a deep dive into compiler performance via careful algorithmic design. In this post, I want to dive into how we engineer for and work to ensure correctness, which is perhaps the most important aspect of any compiler project. A compiler is usually a complex beast: to obtain reasonable performance, one must perform quite complex analyses and carefully transform an arbitrary program in ways that preserve its meaning. It is likely that one will make mistakes and miss subtle corner cases, especially in the cracks and crevices between components. Despite all of that, correct code generation is vital because the consequences of miscompilation are potentially so severe: basically any guarantee (security-related or otherwise) that we make at a higher level of the system stack relies on the (quite reasonable!) assumption that the computer will execute the source code we have written faithfully. If the compiler translates our code to something else, then all bets are off. There are ways that one can apply good engineering principles to reduce this risk. An extremely powerful technique derives from the insight that checking a result is usually easier than computing it, and if we randomly generate many inputs, run our compiler (or other program) on these inputs, and check its output, we can get to a statistical approximation of the claim "for all inputs, the compiler generates the correct output". The more random inputs we try, the stronger this statement becomes. This technique is known as fuzzing</a> with a program-specific oracle</a>, and I could write a lengthy ode to its uncanny power to find bugs (many others have, already). In this post, I will cover how we worked to ensure correctness in our register allocator, regalloc.rs</a>, by developing a symbolic checker</a> that uses abstract interpretation to prove correctness for a specific register allocation result. By using this checker as a fuzzing oracle, and driving just the register allocator with a focused fuzzing target, we have been able to uncover some very interesting and subtle bugs, and achieve a fairly high confidence in the allocator's robustness. What is Register Allocation?</h2> Before we dive in, we need to cover a few basics. Most importantly: what is the register allocation problem</a>, and what makes it hard? In a typical programming language, a program can have an arbitrary number of variables or values in scope. This is a very useful abstraction: it is easiest to describe an algorithm when one does not have to worry about where to store the values. For example, one could write the program: void f() { int x0 = compute(0); int x1 = compute(1); // ... int x99 = compute(99); // --- consume(x0); consume(x1); // ... consume(x99); }</code></pre> At the midpoint of the program (the ---</code> mark), there are 100 int</code>-sized values that have been computed and are later used. When the compiler produces machine code for this function, where are those values stored? For small functions with only a few values, it is easy to place every value in a CPU register. But most CPUs do not have 100 general-purpose registers for storing integers1</a>; and in general, most languages either do not place limits on the number of local variables or else impose limits that are much, much higher than the typical number of CPU registers. So we need some approach that scales beyond, say, about 16 values (x86-64) or about 32 values (aarch64) in use at once. A very simple answer is to allocate a memory location for each local variable. In fact this is exactly what the C programming model provides: all of the xN</code> variables above semantically live in memory, and we can take the address &xN</code>. If one does this, one will find that the addresses are part of the stack. When the function is called, it allocates a new area on the stack called the stack frame and uses it to store local variables. This is far from the best we can do, though! Consider what this means when we actually perform some operation on the locals. If we read two locals, perform an addition, and store the result in a third, like so: x0 = x1 + x2;</code></pre> then in machine code, because most CPUs do not have instructions that can read two in-memory values and write back a third in-memory result, we would need to emit something like the following: ld r0, [address of x1] ld r1, [address of x2] add r0, r0, r1 // r0 := r0 + r1 st r0, [address of x0]</code></pre> Compiling code in this way is very fast because we need to make almost no decisions: a variable reference always becomes a memory load, for example. This is how a "baseline JIT compiler</a>" typically works, actually: for example, in the SpiderMonkey JS and Wasm JIT compiler, the baseline JIT tier -- which is meant to produce passable code very, very quickly -- actually keeps a stack of values in memory that correspond one-to-one to the JS bytecode or Wasm bytecode's value stack. (You can read the code here</a>: it actually keeps a few of the most recent values, at the top of operand stack, in fixed registers and the rest in memory.) Unfortunately, accessing memory multiple times for every operation is very slow. What's more, it is often the case that values are reused soon after being produced: for example, we might have x0 = x1 + x2; x3 = x0 * 2;</code></pre> When we compute x3</code> using x0</code>, do we reload x0</code>'s value from memory immediately after storing it? A smarter compiler should be able to remember that it had just computed the value, and should keep it in a register, avoiding the round-trip through memory altogether. This is register allocation: it is assigning a value in the program to a register for storage. What makes register allocation interesting is that (as noted above) there are fewer CPU registers than the number of allowable program values, so we have to choose some subset of values to keep in registers. This is often constrained in certain ways: for example, an add</code> instruction on RISC-like CPUs can only read from, and write to, registers, so a value's storage location must be a register immediately before it is used by a +</code> operator. Fortunately, the location assignments can change over time, so that at different points in the machine code, a register can be assigned to hold different values. The job of the register allocator is to decide how to shuffle values between memory and registers, and between registers, so that at any given time the values that need to be in registers are so. In our design, the register allocator will accept as input a type of almost-machine-code called "virtual-register code", or VCode</code>. This has a sequence of machine instructions, but registers named in the instructions are virtual registers: the compiler can use as many of them as it needs. The register allocator will (i) rewrite the register references in the instructions to be actual machine register names, and (ii) insert instructions to shuffle data as needed. These instructions are called spills when they move a value from a register to memory; reloads when the move a value from memory back to a register; and moves when they move values between registers. The memory locations where values are stored when not in registers are called spill slots. An example of the register-allocation problem is shown below on a program with four instructions: This allocation is performed onto a machine with two registers (r0</code> and r1</code>). On the left, the original program is written in an assembly-like form with virtual registers. On the right, the program has been modified to use only real registers. Between each instruction, we have written a mapping from virtual registers to real registers. The register allocator's task is just ("just"!) to compute these mappings and then edit the instructions, taking their register references through these mappings. Note that the program, at one point, has three live values, or values that still must be preserved because they will be used later: between the first and second instructions, all of v0</code>, v1</code> and v2</code> are live. The machine has only two registers, so it cannot hold all live values in them; it must spill at least one. This is the reason for the spill instruction, written as a store to the stack slot [sp+0]</code>. How Hard is Register Allocation?</h2> In general, the register allocator will first analyze the program to work out which values are live at which program points. This liveness information and related constraints specify a combinatorial optimization</a> problem: certain values must be stored somewhere at each point, constraints limit which choices can be made and some choices will conflict with some others (e.g., two values cannot occupy a register at the same time), and a set of choices implies some cost (in data movement). The allocator will solve this optimization problem as well as it can using heuristics of some sort, depending on the register allocator. Is this a hard problem? In fact, it is not only hard in a colloquial sense, but NP-complete</a>: this means that it is as hard as any other NP problem, for which we know only exponential-time brute-force algorithms in the worst case.2</a> 3</a> The reason is that the problem does not have optimal substructure</a>: it cannot be decomposed into non-interacting parts that can each be solved separately and then built up into an overall solution; rather, decisions at one point affect decisions elsewhere, potentially anywhere else in the function body. Thus, in the worst case, we can't do better than a brute-force search if we want an optimal solution. There are many good approximations to optimal register allocation. A common one is linear-scan register allocation</a>, which can run in almost-linear time (with respect to the code size). Allocators that can afford to spend more time are more complex: for example, in regalloc.rs</a>, in addition to the linear-scan implementation</a> (written by my brilliant colleague Benjamin Bouvier), we have a "backtracking" algorithm</a> (written by my other brilliant colleague Julian Seward) that can edit and improve its choices as it discovers higher-priority uses for registers. The details of how these algorithms work do not really matter here, except to say that they are very complicated and hard to get right. An algorithm that appears relatively simple at the conceptual level or in pseudocode quickly runs into interesting and subtle considerations as real-world constraints creep in. The regalloc.rs codebase is about 25K lines of deeply-algorithmic Rust code; any reasonable engineer would expect this to include at least several bugs! Compounding the urgency here, a register-allocation bug can result in arbitrary incorrect results, because the register allocator is in charge of "wiring up" all of the dataflow in the program. If we exchange one arbitrary value with another arbitrary value in the program, anything could happen. How to Verify Correctness</h2> So we want to write a correct register allocator. How do we even start on a task like this? It might help to break down what we mean by "correct". Note that the register allocation problem has a nice property: the programs both before and after allocation have a well-defined semantics. In particular, we can think of register allocation as a transformation that converts programs running on an infinite-register machine (where we can use as many virtual registers as we want) to a finite-register machine (where the CPU has a fixed set of registers). If the original program on the infinite-register machine yields the same result as the transformed (register-allocated) program on the finite-register machine, then we have achieved a correct register allocation. How do we test this equivalence? Single-Program, Single-Input Equivalence</h3> The simplest way to test whether two programs are equivalent is to run them and compare the results! Let's say we do this: for a single program, choose some random inputs, and run the virtual-registerized program alongside its register-allocated version on the appropriate interpreters. Compare register and memory state at the end. What does it mean if the final machine states match? It means that for this one program, our register allocator produces a transformed program that is correct for this one program input. Note the two qualifications here. First, we have not necessarily shown that the register allocation is correct given another program input. Perhaps a different input causes a branch to go down another program path, and the register allocator introduced an error on that path. Second, we have not shown anything for any other program; we have only tested a single program and its register-allocated output. We can attempt to address the first limitation -- correctness only under one input -- by taking more sample points. For example, we could choose a thousand random program inputs, and even drive this random choice with some sort of feedback that tries to maximize control-flow coverage or other "interesting" behavior (as fuzzers do). We could probably achieve reasonable confidence that this single register allocation result is correct, given enough test cases. However, this is still very expensive: we are asking to run the whole program N times to get a sample size of N. Even a single execution may be expensive: the program on which we have performed register allocation might be a compiler, or a videogame, for example. Single-Program, For-all-Input Equivalence</h3> Can we avoid the need to run the program at all to test that its register-allocated version is correct? The answer is surprisingly simple: yes, we can, by simply altering the domain that the program executes on. Ordinarily we think of CPU registers as containing concrete numbers -- say, 64-bit values. What if they contained symbols instead? By generalizing over program values with symbols, we can often represent the state of the system in terms of inputs without caring what those inputs are. For example, given the program: ld v0, [A] ld v1, [B] ld v2, [C] add v3, v0, v1 add v4, v2, v3 return v4</code></pre> register-allocated to: ld r0, [A] ld r1, [B] ld r2, [C] add r0, r0, r1 add r0, r2, r0 return r0</code></pre> without symbolic reasoning, we could store arbitrary integers to memory locations A</code>, B</code> and C</code> and simulate the program's execution before and after register allocation, never seeing a mismatch, but this would not prove anything unless we iterated through all possible values. However, if we suppose that after the three loads, r0</code> contains v0</code> (as a symbolic value, whatever it is), r1</code> contains v1</code>, and r2</code> contains v2</code>, and that r0</code> contains v3</code> after the first add and v4</code> after the second add, we can see the correspondence by matching up the symbols. This is a very simple example, and perhaps under-sells the insight and power of this approach; we will come back to it later when we talk about Abstract Interpretation below. In any case, what we have shown is that for a single instance of the register-allocation problem, we can prove that it transformed the program in a correct way. Concretely, this means that the machine code that we generate will execute just as if we were interpreting the virtual-register code; if we can correctly generate virtual-register code, then our compiler is correct. That's excellent! Can we go further? For-all-Programs Equivalence</h3> We could prove a-priori that the register allocator will always transform any program in a way that is correct. In other words, we could abstract not only over the input values to the program, but over the program itself. If we can prove this, then we have no need to run any sort of check at runtime. Abstracting over program inputs lets us avoid the need to run the program; we know the register allocation is correct for all inputs. In an analogous way, abstracting over the program to be register-allocated would let us avoid the need to run the register allocator; we know the register allocator is correct for all programs and for all inputs to those programs. One can imagine that this is much harder. In fact, it has been done, but is a significant proof-engineering effort, and is a realm of active research: this basically requires writing a machine-verifiable proof that one's compiler algorithms are correct. Such proven-correct compilers exist: e.g., CompCert</a> has been proven to compile C correctly to machine code for several platforms. Unfortunately, such efforts are strongly limited by the proof-engineering effort that is required, and thus this approach is unlikely to be feasible for a compiler unless it is their primary goal. Our Approach: Allocator with Checker</h3> Given all of the above, we choose what we believe is the most reasonable tradeoff: we build a symbolic checker for the output of the register allocator. This does not let us make a static claim that our register allocator is correct, but it does let us prove that it is correct for any given compiler run; and if we use this as a fuzzing oracle, we can build statistical confidence that it is correct for all compiler runs. Checking the Register Allocator</h2> Our overall flow is pictured below: There are two ways in which we can add a register-allocator checker into the system. The first, on the left, we call "runtime checking": in this mode, every register allocator execution is checked and the machine code using the allocations is not permitted to execute (i.e. the compiler does not return a result) until the checker verifies equivalence. This is the safest mode: it provides the same guarantees as a proven-correct allocator ("for-all-programs equivalence" above). However, it imposes some overhead on every compilation, which may not be desirable. For this reason, while running the register allocator with the checker is a supported option in Cranelift, it is not the default. The second mode is one in which we apply the checker to a fuzzing workflow, and is the approach we have generally preferred (we have a fuzz target</a> in regalloc.rs that generates arbitrary input programs and runs the checker on each one; and we are running this continuously</a> as part of Wasmtime's membership in Google's oss-fuzz</a> continuous-fuzzing initiative). In this mode, we use the checker as an application-specific oracle for a fuzzing engine: as the fuzzing engine generates random programs (test cases), we run the register allocator over these programs, run the checker on the result, and tell the engine whether the register allocator passed or failed. The fuzzer will flag any failing test cases for a human developer to debug. If the fuzzer runs for a long time without finding any issues, we can then have more confidence that the register allocator is correct, even without running the checker; and the longer the fuzzer runs, the greater our confidence becomes. The application-specific oracle sigificantly improves over more generic fuzzer feedback mechanisms, such as program crashes or incorrect output: a register-allocator bug may not immediately manifest in incorrect execution, or when it does, the resulting crash may have no obvious connection to the actual mis-allocated register. The checker is able to point to a specific register use at a specific instruction and say "this register is wrong". Such a result makes for much smoother debugging! Let's now walk through how we build the "checker" whose goal is to verify a particular register allocation is correct. We will come at the solution in stages, first reasoning about the easiest case -- straight-line code -- and then introducing control flow. At the end, we'll have a simple algorithm that runs in linear time (relative to code size) and whose simplicity allows us to be reasonably confident in its guarantees. Symbolic Equivalence and Abstract Interpretation</h3> Recall that we described above a sort of symbolic interpretation of execution: one can reason about CPU registers containing "symbolic" values, where each symbol represents a virtual register in the original code. For example, we can take the code mov v0, 1 mov v1, 2 add v2, v0, v1 return v2</code></pre> and a register-allocated form of that code mov r0, 1 mov r1, 2 add r0, r0, r1 return r0</code></pre> and somehow find a set of substitutions that makes them equivalent: mov r0, 1 [ r0 = v0 ] mov r1, 2 [ r1 = v1 ] add r0, r0, r1 [ r0 = v2 ] return r0</code></pre> But how do we solve for these substitutions? Recall that above we hinted at a form of execution that operates on symbols rather than values. We can simply take the semantics of the original instruction set, and reformulate it to operate on symbolic values instead, and then step through the code to find a representation of all executions at once. This is called symbolic execution, and with some enhancements described below, is the basis of abstract interpretation</a>4</a>. It is a very powerful technique! What are the semantics of the instruction set that are relevant here? It turns out, because the register allocator does not modify any of the program's original instructions5</a>, we can understand each instruction as mostly an arbitrary, opaque operator. The only important pieces of information are which registers it reads (before its operation) and which it writes (after its operation).6</a> It turns out that to verify the output of the register allocator when it spills values, and when it moves values between registers, we need to have special knowledge of spills, reloads, and moves. Hence, we can reduce the input program to a sort of minimal ISA that captures only what is important for symbolic reasoning (the real definition is here</a>; we simplify a bit for this post): Spill <spillslot>, <CPU register></code>: copy data (symbol representing virtual register) from a register to a spill slot. </li> Reload <CPU register>, <spillslot></code>: copy data from a spill slot to a register. </li> Move <CPU register>, <CPU register></code>: move data from one CPU register to another (N.B.: only regalloc-inserted moves are recognized as a Move</code>, not moves in the original input program.) </li> Op read:<CPU register list>, read_orig:<virtual register list> write:<CPU register list> write_orig:<virtual register list></code>: some arbitrary operation that reads some registers and writes some other registers. </li> </ul> The last instruction is the most interesting: notice that it carries the original virtual registers as well as the post-register-allocation CPU registers for the instruction. The need for this will become clearer below, but the intuition is that we need to see both in order to establish the correspondence between the two. We can produce the above instructions while the register allocator is scanning over the code and editing it; that part is a straightforward translation</a>. Once we have the abstracted program, we can "execute" it over the domain of symbols. How do we do this? With the following rules: We maintain some state, just as a real CPU does: for each CPU register, and for each location in the stack frame, we track a symbol (rather than an integer value). This symbol can be a virtual-register name, if we know that the storage location currently contains that register's value. It can also be Unknown</code>, if the checker doesn't know, or Conflicted</code>, if the value could be one of several virtual registers. (The difference between the latter two will become clear when we discuss control-flow below. For now it's enough to see that we abstract the state to: either we know the slot contains a program value, symbolically, or we know nothing.) </li> When we see a Spill</code>, Reload</code>, or Move</code>, we copy the symbolic state from the source location (register or spill slot) to the destination location. In other words, we know that these instructions always move the integer value of a register or memory word, whatever it may be; so if we have knowledge about the source location, symbolically for all possible executions, then we can extend that knowledge to the destination as well. </li> When we see an Op</code>, we do some checks then some updates: For each read (input) register, we examine the symbolic value stored in the given CPU register (post-allocation location). If that symbol matches the virtual register that the original instruction used, then the allocator has properly conveyed the virtual register's value to its use here, and thus the allocation is correct (preserves program dataflow). If not, we can signal a checker error, and look for the bug in our register allocator. We know for sure it must be a bug (i.e., there are no false positives), because we only track a symbol for a storage location when we have proven (for all executions!) that that storage must contain that virtual register. </li> For each write (output) register, we set the symbolic value stored in the given CPU register to be the given (pre-allocation) virtual register. In other words, each write produces a symbol. This symbol then flows through the program, moving via spills/reloads/moves, until it reaches consumers. </li> </ul> </li> </ul> And that's it! We can prove in a fairly straightforward way that this is exactly correct -- produces no false positives or false negatives -- for straight-line code (code with no jumps). We can do this by induction: if the symbolic state is correct before an instruction, then the above rules just encode the data movement that the concrete program performs, and the symbolic state will be updated in the same way, so the symbolic state after the instruction is also correct. Note that this is linear as well -- so it's very fast, with a single scan over straight-line code. This is possible because we have help from the register allocator: we know about spills, reloads, and register allocator-inserted moves, and we have pre- and post-allocation registers for all other instructions. Consider what we would have to do if we did not know about these, but only saw machine instructions. In that case, any load, store or move instruction could have come from the allocator or from the original program. We would have nothing but a graph of operators with connectivity between them, and we would have to solve a graph isomorphism problem. That is much harder, and much slower! So are we done? Not quite: we have only considered straight-line code. What happens when we encounter a jump? Control-Flow Joins, Lattices, and Iterative Dataflow Analysis</h3> Control-flow makes analysis interesting because it allows for multiple possibilities. Consider a simple program with an if-then-else pattern (a "control-flow diamond", as it is sometimes called, due to its shape): Let's say that a symbolic analysis decides that on the left branch, r0</code> has symbolic state A</code>, and on the right branch, it has symbolic state B</code>. What state does it have in the lower block, after the two paths re-join? We can give a precise answer if we are allowed to "predicate", or make the answer conditional on some other program state. For example, if we knew that the if-condition were represented by some symbol C</code> that has a boolean type, we could invent an abstract expression language and then write if C { A } else { B }</code>. However, this quickly becomes untenable. We will find that programs with loops lead to unbounded symbolic expressions. (To see this, consider that a symbolic representation can have a size larger than its inputs. Any cyclic data dependency around a loop could thus generate an infinitely-large symbolic representation.) Even with only acyclic control flow, path-sensitive symbolic expressions can grow exponentially with program size: consider that a program with N</code> basic blocks and no loops can have O(2^N)</code> paths through those blocks, and fully precise symbolic expressions would need to capture the effects of each of those paths. We thus need some way to approximate. Note that an abstract interpretation of a program need not precisely capture all of the program's behavior losslessly. For example, we might perform a simple abstract interpretation analysis that only tracks possible numeric signs (positive, negative, unknown) for integer variables. So it is always fine to "summarize" and drop detail to remain tractable. Let us thus consider how we might "merge" state when multiple possibilities exist. It turns out that there is a very nice mathematical object that captures the notion of "merging" in a way that is very useful: the lattice</a>. A lattice consists of a set of elements and a partial order between them, together with a least element "bottom" and a greatest element "top", an operator called "meet" that finds the "greatest lower bound" for any two elements (the largest element that is less than its two operands) and a "join" that finds the "least upper bound" (the dual of the above). (Figure credit: Wikimedia</a>, CC BY-SA 3.0</a>) An extremely useful property of lattices is that their merging operations of meet and join are commutative, associative and reflexive. This is a formal way of saying that the result only depends on the set of elements "thrown into the mix", in any order and with any repetition. In other words, the meet of many elements is a function only of the set of elements, not of the order in which we process them. How is this useful? If we define particular analysis states -- and as a reminder, in our specific case, these are maps from CPU registers and spillslots to symbolic virtual registers -- to be lattice elements, and define a "meet function" that somehow merges the states -- then we can use this merging behavior to implement a sort of program analysis over all programs, including those with loops, without unbounded analysis growth! This is called the "meet-over-all-paths" solution and is a standard way that compilers perform dataflow analysis</a> today.7</a> To understand how a lattice describes "merging" in a program analysis in a useful way, one can see the lattice ordering relation (the arrows in the figure above) as denoting that one state is more or less refined (contains more or less knowledge) than another. One starts at the "greatest" or "top" element: anything could be true; we know nothing. We then move to progressively more refined states. One analysis state is ordered "less than" another if it captures all the constraints we have learned in the other state, plus some new ones. The "meet" operator, which computes the greatest lower bound, will thus give us an analysis state that captures all of the knowledge in both inputs, and no more.8</a> The general approach to performing an analysis on an arbitrary CFG is as follows: We define our analysis state as a lattice. </li> We trace the current analysis state at each program point, or point between instructions. Initially, the state at every program point is the "top" lattice element; as values meet, they move "down" the lattice, toward the "bottom" element. </li> We process the effect of each instruction, computing the state at its post-program-point from its pre-program-point. </li> When analysis state reaches a control-flow edge, we propagate the state across the edge, and meet it with the incoming state from all other edges. This may then lead us to recompute states in the destination block. </li> We run a "fixpoint" loop, processing updates as analysis states at block entries change, until no more changes occur. </li> </ol> In this way, we find a solution to the dataflow problem that satisfies all of the instruction semantics for any path through the program. It may not be fully precise (i.e., it may not answer every question) -- because it is often impossible to capture a fully precise answer for executions that include loops, and impractical for programs with significant control-flow -- but it is sound, in the sense that any claims we make from the analysis result will be correct. A Register Checker as a Dataflow Analysis Problem</h2> We now have all of the pieces that we need in order to check the register-allocator output for any program. We saw above that we could model the machine state symbolically for any straight-line code, which allows us to detect register allocator errors exactly (no false negatives and no false positives) as long as there is no control flow. We then discussed the usual static analysis approach to control flow. How can we combine the two? The answer is that we define a lattice of symbolic register state, and then walk through the same per-instruction semantics as above in a fixpoint dataflow analysis. Put simply, for each storage location (CPU register or spill slot), we have a lattice: The "unknown" state is the "top" lattice value. This means simply that we don't know what is in the register because the analysis hasn't converged yet (or no write has occurred). The "conflicted" state is the "bottom" lattice value. This means that two or more symbolic definitions have merged. Rather than try to represent a superposition of both with some sort of predication or loop summary, we simply give up and move to a state that indicates "bad value". This is not a checker error as long as it is never used, and it can be overwritten with a good value at any time; but if the value is used as an instruction source, then we flag an error. The meet function, then, is very simple</a>: two registers meet to "conflicted" unless they are the same register; "unknown" meets with anything to produce that anything; and "conflicted" is contagious, in the sense that meeting any other state with "conflicted" remains "conflicted". Note that we said above that our analysis state is a map from registers and spill slots to symbolic states; not just a single symbolic state. So our lattice is actually a product</a> of each individual storage location's state, and we meet symbols piecewise. (The resulting map contains entries only for keys that appear in all meet-inputs; i.e. we take the intersection of the domains.) With the analysis state and its meet-function defined, we run a dataflow analysis loop, allow it to converge, and look for errors; and we're done! And that's it!9</a> Effectiveness: Can it Find Bugs?</h2> The short answer is that yes, it can find some pretty subtle bugs</a>! The benefit of the regalloc.rs checker is twofold: It has found real bugs. In the above example, there was a conceptual error in the reference-types (precise GC rooting) support: in certain cases where a spillslot was allocated for a pointer-typed value but never used, it could be added to the stackmap (list of pointer-typed spillslots) provided to the GC. This bug needs a specific set of circumstances to happen: we have to have enough register pressure that we decide to allocate a spillslot for a virtual register, but then hit the (rare) code-path in which we don't actually need to do the spill because a register became available. We never hit this in our other, hand-written tests of GC (Wasm reference types), despite some pretty</a> extensive</a> tests</a> at least in SpiderMonkey's WebAssembly test suite driving the Cranelift backend. The fuzzer was able to drive toward full coverage, hit this rare code-path, and then allow the checker to discover the error. </li> It serves as a gold-standard test while developing new register allocators. Feedback while developing the linear-scan allocator (whose reference-type / precise GC rooting support came a bit later than the backtracking allocator's) indicated that the checker found many real issues and allowed for faster and more confident progress. </li> </ul> Related Work</h2> It's surprisingly difficult to find prior work on checkers that validate individual runs of a register allocator. There are several fully-verified compilers in existence; CompCert</a> and CakeML</a> are two that can compile realistic languages (C and ML, respectively). These compilers have fully verified register allocators in the sense that the algorithm itself is proven correct; there is no need to run a checker on an individual compilation result. The engineering effort to achieve this is much higher than to write a checker, however (in the latter case, ~700 lines of Rust). CakeML's approach to proving the register allocator correct is described by Tan et al. in "The Verified CakeML Compiler Backend</a>" (J. Func Prog 29, 2019). They appear to have nicely factored the problem so that the compilation is correct as long as a valid graph coloring or "permutation" (mapping of program values to storage slots) is provided. This allows reasoning about the core issue (dataflow equivalence before and after allocation) separately from the details of the allocator (graph coloring algorithm). Proof-producing compilers exist as well: for example, Crellvm</a> is a recent extension of several LLVM passes that generates a (machine-checkable) correctness proof alongside the transformed program. This approach is conceptually at the same level as our register-allocator checker: it results in the validation of a single compiler run, but is much easier to build than a full a-priori correctness proof. This effort does not yet appear to address register allocation, however. Rideau and Leroy in "Validating Register Allocation and Spilling</a>" (CC 2010) describe a similar taxonomy to ours, separating "once and for all" correctness proofs from "translation validation checks" and providing the latter. Their validator, however, defines a fairly complex transfer function that builds a set of equality constraints that must be solved. It appears that the validator does not leverage hints from the allocator, specifically w.r.t. spills, reloads and inserted moves as distinguished from stores, loads and moves in the original program; without these hints, a much more general and complex dataflow-equivalence scheme is needed. Nandivada et al. in "A Framework for End-to-End Verification andEvaluation of Register Allocators</a>" (SAS 2007) describe a system very similar to our checker in which physical register contents (as virtual-register or "pseudo" symbols) are encoded into a post-regalloc IR that is then typechecked. Their typechecker can uncover the same sorts of regalloc errors that our checker can. Thus, their approach is largely equivalent to ours; the main difference is that we do not encode the problem as typechecking on a dedicated IR but rather a standalone static analysis. Conclusion</h2> This post concludes the three-post series (one</a>, two</a>) describing the work we've done to develop all the pieces of Cranelift's new backend over the past year! It has been a very interesting and educational ride for me personally; I discovered an entirely new world of interesting problems to solve in the compiler backend, as distinct from the "middle end" (IR-level optimizations) that is more commonly taught and studied. Additionally, the focus on fast compilation is an interesting twist, and one that I believe is not studied enough. It is easy to justify higher analysis precision and better generated code through ever-more-complex techniques; the benefit to be found in design tradeoffs for fast compilation is more subjective and more dependent on workload. It is my hope that these writeups have illuminated some of the thinking that went into our design decisions. Our work is by no means done, however! The roadmap for Cranelift work in 2021</a> lists a number of ideas that we've discussed to achieve higher compiler performance and better code quality. I am excited to explore these more in the coming year; they may even result in more blog posts. Until then, happy compiling! For discussions about this post, please feel free to join us on our Zulip instance in this thread</a>. Thanks to /u/po8</a> on Reddit for several suggestions</a> which I have incorporated. Thanks also to bjorn3 for several suggestions. Finally, thanks to Fernando M Q Pereira</a> for bringing my attention to his paper</a> in SAS 2007 that proposes a very similar idea, which I've added to the related work section. Any and all feedback is welcome! ^{1 Why do CPUs have a limited number of registers? The bound is mostly due to ISA encoding limitations: there are only so many bits in an instruction to name a particular register source or destination. When the CPU designer chooses how many registers to define, providing more will improve performance (up to a point) because the CPU can hold more state at one time, but will also impose an increasing cost in code size and CPU complexity.}Computer architect's tangent: due to register renaming</a>, a modern high-performance out-of-order CPU will have many more physical registers, with architectural register names mapped to physical registers at any given program point by the register-renaming hardware (in common parlance, the register allocation table or RAT), but the ISA encoding restrictions limit the number that have architectural names at any time. The existence of register renaming sometimes causes confusion in discussions of register allocation -- why rename onto so few registers when we have so many? -- well, we could do much better if we had more bits to refer to them all! Architectural standardization is another reason for this: we would not want to recompile code every time the PRF (physical register file) became larger. Simpler to say "x86-64 has 16 integer registers" and be done with it.10</a> </div> ^2We don't know if exponential time is the best we can do in the worst case, though most computer scientists suspect so. This is the famous P=NP</code> problem, and if you can solve it, you win a million dollars</a>. </div> ^3A slight correction from /u/po8's comment</a>: register allocation on structured programs</a> can be done</a> in polynomial time, i.e., better than an exponential brute-force search. However, the problem remains quite complex! </div> ^4Abstract interpretation was introduced by Radhia and Patrick Cousot in their seminal 1977 POPL paper "Abstract interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints" (pdf</a>). </div> ^{5 Except for move elimination, but we can ignore that for now -- it is possible to adapt the abstract interpretation rules to account for it later. </div>}^{6 In regalloc.rs we also have a notion of an instruction that "modifies" a register, which is like a combined read and write except that the value must be mapped to the same register for both. This isn't fundamental to the point we're illustrating so we'll skip over it for now. </div>}^7This dataflow analysis approach was proposed by Gary Kildall</a> in the POPL 1973 paper "A unified approach to global program optimization</a>". (He is perhaps better-known for writing the microcomputer OS CP/M</a>, a predecessor to DOS.) Kildall's Dataflow analysis builds on the control-flow graph ideas invented several years prior by Fran Allen</a>; in her 1970 paper Control Flow Analysis</a>, she proposes interval-based dataflow analysis, which is the other main approach known and used today. </div> ^{8 Note that we have been somewhat vague here about directionality. What does "more constrained" or "more refined" mean? There are actually two directions an analysis may work, and these have to do with how it handles imprecision. A "may-analysis", or "widening analysis", computes what the program may do. It generally begins with an "empty set" of sorts -- a variable has no possible values, a statement has no side-effects, a register contains nothing -- and then uses a union-like meet operator to aggregate all possibilities. The real program behavior will be some subset of these possibilities. In contrast, a "must-analysis", or "narrowing analysis", computes only what we know the program must do. It generally begins with the "universe set" and then uses intersection-like meet operators. The real program's behavior is a superset of this analysis's description. We can't have both, usually, because an analysis cannot generally be fully precise. By convention, we always start analysis values at "top" and use "meet" to move down the lattice as the analysis converges, though we could just as well start at "bottom" and move up with "join", since flipping the lattice's order relation and swapping meet and join produces another lattice. </div>}^{9 Well, not quite, as you might have guessed. One significant detail I've omitted is how we handle reference types and precise garbage collection. Precise GC rooting entails tracking a specific kind of type information for each register and spillslot: specifically, whether each storage location contains a pointer that the GC should observe when it performs a garbage collection. It is important in many applications for this to be "precise", which means that we can only say that a register contains a pointer if it actually does, and we must include all registers that contain pointers. Precision is important because the GC will assume any root pointer it traces points to a valid object (so false positives are bad); and must know about every pointer in case it is a moving GC and relocates an object (so false negatives are bad). In our particular variant of the problem, we need this information at safepoints: these are points at which the GC could be invoked. (It would be too expensive to plan for a GC invocation at every point in the program.) Furthermore, we needed to support GCs that could only trace pointers on the stack (hence, spillslots), not in registers. So we needed to induce additional spills around safepoints to ensure pointers were only live on the stack, not in registers. To check this, we extended the abstract value lattice to note whether each virtual register is a pointer-typed value or not. Then, at every safepoint, we (i) ensure that every actual pointer-typed value in a spillslot is listed in the stackmap provided to the GC, and (ii) clear any other pointer-typed stack location not listed in the stack map to an Unknown</code> state. Why the latter? Because an actual pointer-typed value in a stack slot might be "dead" (not used again), and so is legal to omit from the stackmap; instead of immediately flagging an error when one is excluded, we simply ensure that a later use of it is invalid. </div>}^10Note that some computer architectures do task the compiler with some form of register renaming. For example, the Intel Itanium (IA-64</a>) had a novel sort of "rotating register reference" feature for loops, and trusted the compiler with managing a full 128 integer and 128 floating-point registers. Modern GPUs also have thousands of "registers" managed by the compiler. </div> Cranelift, Part 2: Compiler Efficiency, CFGs, and a Branch Peephole Optimizer 2021-01-22T00:00:00+00:00 This post is the second in a three-part series about Cranelift</a>. In the first post</a>, I described the context around Cranelift and our project to replace its backend code-generation infrastructure, and detailed the instruction-selection problem and how we solve it. The remaining two posts will be deep-dives into some interesting engineering problems. In this post, I want to dive into the compiler performance aspect of our work more deeply. (In the next post we'll explore correctness.) There are many interesting aspects of compilation speed I could talk about, but one particularly difficult problem is the handling of control flow: how do we translate structured control flow at the Wasm level into control-flow graphs at the IR level, and finally to branches in a linear stream of instructions at the machine-code level? Doing this translation efficiently requires careful attention to the overall pass structure, with the largest wins coming when one can completely eliminate a category of work. We'll see this in how we combine several passes in a traditional lowering design (critical-edge splitting, block ordering, redundant-block elimination, branch relaxation, branch target resolution) into inline transforms that happen during other passes (lowering of the CLIF, or Cranelift IR, into machine-specific IR; and later, binary emission). This post basically describes the MachBuffer</code></a>, a "smart machine-code buffer" that knows about branches and edits them on-the-fly as we emit them, and the BlockLoweringOrder</code></a>, which allows us to lower code in final basic-block order, with split critical edges inserted implicitly, by traversing a never-materialized implicit graph. The work was done mostly in Cranelift PR #1718</a>, which resulted in a ~10% compile-time improvement and a ~25% compile+run-time improvement on a CPU-intensive benchmark (bz2</code>). Control-Flow Graphs</h2> Before we discuss any of that, we need to review control-flow graphs (CFGs)! The CFG is a fundamental data structure used in almost all modern compilers. In brief, it represents how execution (i.e., program control) may flow through instructions, using graph nodes to represent linear sequences of instructions and graph edges to represent all possible control-flow transfers at branch instructions. At the end of the instruction selection process, which we learned about in the previous post</a>, we have a function body lowered into VCode that consists of basic blocks</a>. A basic block is a contiguous sequence of instructions that has no outbound branches except at the end, and has no inbound branches except at the beginning. In other words, it is "straight-line" code: execution always starts at the top and proceeds to the end. An example control-flow graph (CFG) consisting of four basic blocks is shown below: Control-flow graphs are excellent data structures for compilers to use. By making the flow of execution explicit as graph edges, rather than reasoning about instructions in order in memory as the processor sees them, many analyses can be performed more easily. For example, dataflow analysis</a> problems can be solved easily because the CFG makes traversal of possible control-flow transfers easy. Graph-based representations of the program also allow easier moving and insertion of code: it is less error-prone to manipulate an explicit graph than to reason about implicit control-flow (e.g. fallthrough from a not-taken conditional branch). Finally, the graph representation factors out the question of block ordering, which can be important for performance; we can address this problem separately by choosing how we serialize the graph nodes (blocks). For these reasons, most compiler IRs, including Cranelift's CLIF and VCode</code>, are CFG-based. (Historical note: control-flow graphs were invented by the late Frances Allen</a>, who largely established the algorithmic foundations that modern compilers use. Her paper A catalogue of optimizing transformations</a>1</a> covers essentially all of the important optimizations used today and is well worth a read.) CPUs and Branch Instructions</h2> To represent a CFG's end-of-block branches at the instruction level, we can use two-way branches: these are instructions that branch either to one basic-block target if some condition is true, or another if the condition is false. (Basic blocks can also end in simple unconditional single-target branches.) We wrote such a branch as if r0, L1, L2</code> above; this means that the block L0</code> will be followed in execution either by L1</code> or L2</code>, depending on the value in r0</code>.2</a> Branches with Fallthrough</h3> However, CPUs rarely have such two-way branch instructions. Instead, conditional control-flow in common ISAs is almost always provided with a conditional branch with fallthrough. This is an instruction that, if some condition is true, branches to another location; otherwise, does nothing, and allows execution to continue sequentially. This is a better fit for a hardware implementation for a number of reasons: it's easier to encode one target than two (the destination of the jump might be quite far away for some branches, and instructions have limited bits available), and it's usually the case that the compiler can place one of the successor blocks immediately afterward anyway. Now, this isn't much of a problem if we just want a working compiler; instead of a two-way branch if r0, L1, L2</code></pre> We can write a sequence of branches br_if r0, L1 goto L2</code></pre> where br_if</code> branches to L1</code> or falls through to the unconditional goto</code>. But this is not so efficient in many cases. Consider what would happen if we laid out basic blocks in the order L0</code>, L2</code>, L1</code>, L3</code>: L0: ... br_if r0, L1 goto L2 L2: ... goto L3 L1: ... goto L3 L3: ... return</code></pre> There are two redundant unconditional branches (goto</code> instructions), each of which uselessly branches to the following instruction. We can remove both of them with no ill effects, taking advantage instead of fallthrough, or allowing execution to proceed directly from the end of one block to the start of the next one: L0: ... br_if r0, L1 // ** Otherwise, fall through to L2 ** L2: ... goto L3 L1: ... // ** Always fall through to L3 ** L3: ... return</code></pre> This seems like an easy enough problem to solve: we just need to recognize when a branch is redundant and remove it, right? Well, yes, but we can do much better than that in some cases; we'll dig into this problem in significantly more depth below! Machine-code Encoding: Branch Offsets</h3> So far, we've written our machine instructions in a way that humans can read, using labels to refer to locations in the instruction stream. At the hardware level, however, these labels do not exist; instead, the machine code branches contain target addresses (usually encoded as relative offsets from the branch instruction). In other words, we do not see goto L3</code>, but rather goto +32</code>. This gives rise to several complications when emitting machine code from a list of instruction struct</code>s. At the most basic level, we have to resolve labels to offsets and then patch the branches appropriately. This is analogous to (but at a lower level than) the job of a linker</a>: we resolve symbols to concrete values after deciding placement, and then edit the code according to relocations to refer to those symbols. In other words, whenever we emit a branch, we make a note (a relocation, or "label use" in our MachBackend</code>) to go back later and patch it with the resolved label offset. The second, and more interesting, problem arises because not all branch instructions can necessarily refer to all possible labels! As a concrete example, on AArch64, conditional branches have a ±1 MB range, and unconditional branches have a ±128 MB range. This arises out of instruction-encoding considerations: particularly in fixed-instruction-size ISAs (such as ARM, MIPS, and RISC V), less than a full machine word of bits are available for the immediate jump offset that is embedded in the instruction word. (The instruction itself is always a machine-word wide, and we need some bits for the opcode and condition code too!) On x86, we have limits for a different reason: the variable-width encoding allows either a one-byte offset (allowing a ±128 byte range) or four-byte offset (allowing a ±2 GB range). To make a branch to a far-off label, then, on some machines we need to either use a different sort of branch than the default choice for the instruction selector, or we need to use a form of indirection, by targetting the original branch to another branch, the latter in a special form. The former is tricky because we do not know whether a target will be in-range until all code is lowered and placement is computed; so we need to either optimistically or pessimistically lower branches to the shortest or longest form (respectively) and possibly switch later. To make matters worse, as we edit branches to use a shorter or longer form, their length may change, moving other targets into or out of range; in the most general solution, this is a "fixpoint problem", where we iterate until no more changes occur. Challenges in Lowering CFGs to Machine Code</h2> So far, we have a way to produce correct machine code. To emit the final code for a two-target branch, we can emit a conditional- followed by unconditional-branch machine instruction. To resolve branch targets correctly, we can assume that any target could be anywhere in memory, and always use the long form of a branch; then we just need to come back in one final pass and fill in the offsets when we know them. We can do much better than this, though! Below I'll describe four problems and the ways that they are traditionally solved. Problem 1: Efficient use of Fallthroughs</h3> We described above how branch fallthroughs allow us to omit some some unconditional branches once we know for sure the order that basic blocks will appear in the final binary. In particular, the simple lowering of a two-way branch if r0, label_if_true, label_if_false</code> to two one-way branches br_if r0, label_if_true goto label_if_false label_if_false: ... label_if_true: ...</code></pre> has a completely redundant and useless goto</code>! In general, if a branch target is the very next instruction, we can delete that branch. However, there are slightly more complex cases where we can also find some improvements. Consider the inverted version of the above: br_if r0, label_if_false goto label_if_true label_if_false: ... label_if_true: ...</code></pre> No branch here branches to its fallthrough, so one might think that both branches are necessary. But in practice, on most CPU architectures, all conditional branches have inverted forms. For example, the x86 instruction JE</code> (jump if equal) can be inverted to JNE</code> (jump if not equal). If we are allowed to edit branch conditions as well, then we can rewrite the above as: br_if_not r0, label_if_true label_if_false: ... label_if_true: ...</code></pre> This turns out to remove many additional branches in practice. Problem 2: Empty Blocks</h3> It is sometimes the case that after optimizations, a basic block is completely empty aside from a final unconditional branch. This can occur when all of the code in an if- or else-block is optimized away or moved elsewhere in the function body. It can also occur when a block was inserted to split a critical edge (see below). Thus, a common optimization is jump threading</a>: when one branch points directly to another, we can just edit the first branch to point to the final target. Generalized, we can "chase through" any number of branches to eliminate intermediate steps. For example: ... goto L1 L1: goto L2 L2: goto L3 L3: ...</code></pre> can become: ... goto L3 // <--- edited branch L1: goto L2 L2: goto L3 L3: ...</code></pre> note that the intermediate branches were not removed: they may still be the targets of other branches. We skip over them when starting from the first branch. However, if we know some other way that these branches are unused, we can then delete them, reducing code size. Problem 3: Branch Relaxation</h3> As we noted above, the "branch relaxation" problem is that we must choose one of multiple forms for each branch instruction, each of which may have a different range (maximal distance from current program-counter location). This is complex because the needed range depends on the final locations of the branch and its target, which in turn depends on the size of instructions in the machine code; but some of those instructions are themselves branches. We thus have a circular dependency. There will always be some way to branch to an arbitrary location in the processor's address space, so there is always the trivial but inefficient solution of using worst-case branch forms. However, we can usually do much better, because the majority of branches will be to relatively small offsets. The usual approach to solving this problem involves a "fixpoint computation": an iterative loop that continues to make improvements until none are left. This is where the "relaxation" of branch relaxation comes from: we modify branch instructions to have more optimal forms as we discover that targets are within range; and as we do this, we recompute code offsets and see if this enables any other relaxations. As long as the relationship between branch range and branch instruction size is monotonic (smaller required range allows for shorter instruction), this will always converge to a unique fixpoint; but it is potentially expensive, and involves sticky data-structure design questions if we want the code editing and/or offset recomputation to be fast. Problem 4: Critical Edges</h3> For a number of reasons, we usually want to split critical edges</a> in the control-flow graph. A critical edge is any control-flow transfer edge that comes from a block with multiple out-edges, and goes to a block with multiple in-edges. We sometimes need to insert some code to run whenever the program follows a critical edge: e.g., the register allocator may need to "fix up" the machine state, moving values around in registers as expected by the target block. Consider where we might insert such code: we can't insert it prior to the jump, because this would execute no matter what out-edge is taken. Similarly, we can't insert it at the target of the jump, because this would execute for any entry into the target block, not just transfers over the particular edge. The solution is to "split" the critical edge: that is, create a new basic block, edit the branch to point to this block, and then create an unconditional branch in the block to the original target. This basic block is a place where we can insert whatever fixup code we need, and it will execute only when control flow transfers from the one specific block to the other. A critical-edge split is illustrated in the following figure: There are multiple ways in which we could handle this problem: we could preemptively split every critical edge; or we could split them on demand, only when we need to insert code. The latter would require editing the CFG in place, and for various reasons, we would prefer to avoid doing this: it invalidates many analysis results, and complicates data structures. It is also much simpler to reason about many algorithms if we can assume that edges are already split. However, splitting every edge will leave many empty blocks, because we usually do not need to insert any fixup code on an edge. In addition, splitting an edge raises the question of where to insert the split-block. If we take the simplest approach and append it to the end of the function, we probably significantly reduce the number of branch-fallthrough simplifications we can make; a smarter heuristic that placed the block near its predecessor or successor would be better. Traditional Approach: In-Place Edits</h2> The traditional approach to all of these problems is to decompose the task into a number of passes and perform in-place edits with those passes. For example, in LLVM, IR is lowered into a machine-specific form (MachineFunction</code> of MachineBasicBlock</code>s) with an explicit notion of layout and with machine-level branch instructions; then edits are made, taking care to update branches when the layout changes. For example, the following sequence of passes should handle most of the above issues: Split all critical edges, placing the split-block after the predecessor. (In LLVM, the SplitCriticalEdges</code></a> pass.) </li> Perform other optimizations, and register allocation; these may use the split-blocks. </li> Perform jump-threading transform; this will remove control-flow transfers through empty blocks. (In LLVM, the BranchFolding</code></a> pass.) </li> Compute reachability, and delete "dead blocks" (blocks that are no longer reachable). (Also done by BranchFolding</code> in LLVM.) </li> Compute a block order that tries to minimize jump distances and places at least one successor directly after every block when possible. (In LLVM, the MachineBlockPlacement</code></a> pass.) </li> Linearize the code from the CFG nodes into a single stream of machine instructions using this block order. (In LLVM, blocks are initially lowered into the MachineFunction</code> and then reordered by MachineBlockPlacement</code>.) </li> Remove branches to fallthrough blocks, and invert conditionals that create additional fallthroughs. </li> Compute block offsets based on machine-code size of current instruction sequence, assuming worst-case size for every branch. </li> Scan over branches, checking whether block locations allow for shorter forms due to nearer targets. Update branches and recompute block offsets if so. Continue until fixpoint. (In LLVM, the BranchRelaxation</code></a> does this.) </li> Fill in branch targets using final offsets. Branches are now in a form ready for machine-code emission. </li> </ol> Clearly this will work, and with some care (especially in the block-placement heuristics), it will produce very good code. But the above steps require many in-place edits. This is both slow (we are re-doing some work every time we edit the code) and forces us to use data structures that allow for such edits (e.g., linked list), which imposes a tax on every other operation on the IR. Is there a better way? Cranelift's New Approach: Streaming Edits</h2> It would be ideal if we could avoid some of the code-transform passes described above; can we? It turns out that one can actually do the functional equivalent of all of the above as part of other, pre-existing work: We can decide the final block order ahead of time, and do our CLIF-to-VCode lowering in this order, so VCode never needs to be reordered; it is already linearized. We can also insert critical-edge splits as part of this lowering. </li> We can do all of the other work -- inverting conditionals, threading jumps, removing dead blocks, and handling various branch sizes -- in a streaming approach during machine-code emission! The key insight is that we can do a sort of "peephole optimization": we can immediately delete and re-emit branches at the tail of the emission buffer. By tracking some auxiliary state during the single emission scan, such as reachability, labels at current emission point, a list of unresolved label-refs earlier in code, and a "deadline" for short-range branches, we can do everything we need to do without ever backing up more than a few contiguous branches at the end of the buffer. </li> </ol> Let's go into each of those in more detail! Step 1: Decide Final order and Split Edges while Lowering</h3> As part of the instruction-selection pipeline described in the previous post</a>), we need to iterate over the basic blocks in the CLIF and, for each block, lower its code to VCode instructions. We would like to do this iteration in the same order as our final machine code layout so that we don't need to reorder the VCode later. The only constraint that the lowering algorithm imposes is that we examine value uses before value defs, which we can ensure by visiting a block before any of its dominators. That leaves a lot of freedom in how we do the lowering. If that were the whole problem, we could just do a postorder</a> traversal and be done with it. In fact, the problem is complicated by one other factor: critical-edge splitting! Recall that we described above that we must either preemptively split all critical edges or else find a way to edit-in-place later. To avoid the complexities of edit-in-place, we choose to split them all. Note that this is cheap as far as our CFG lowering is concerned, because our later branch optimizations will remove empty blocks almost for free. (The register allocator's analyses may become more expensive with a higher block count, but in practice we have not found this to be much of a problem.) The challenge is in generating these blocks in the correct place on the fly. To generate the lowering order, we define a virtual graph that is never actually materialized, whose nodes are implied by the CLIF blocks and edges (every CLIF edge becomes a split-edge block) and whose edges are defined only by a successor function. To generate the lowering order, we perform a depth-first search</a> over the virtual graph, recording the postorder</a>. This postorder is guaranteed to see uses before defs, as required. The DFS itself is a pretty good heuristic for block placement: it will tend to group structured-control-flow code together into its hierarchical units. There are additional details in the implementation</a> that ensure we split only critical edges rather than all edges, that record block-successor information directly as we produce lowered blocks so that the subsequent backend stages do not need to recompute it, and some other small optimizations. This is illustrated in the following figure, showing a CLIF-level CFG transformed with split edges and merged edge-blocks then linearized at a conceptual level; and the successor function actually defined to drive the DFS. Note that the naïve lowering of the split-edge CFG would result in 14 branches (due to 14 CFG edges); the final lowered machine code contains only 4 branches, while providing a slot for any needed fixup instructions on any CFG edge. Step 2: Edit Branches while Emitting</h3> Once we have lowered VCode</code>, we need to emit machine code! In a conventional design, this would require linearization, a bunch of branch optimizations, and branch-target resolution before we ever produced a byte of machine-code. But we can do much better. In Cranelift's design, a machine backend just emits every conditional branch naïvely as a two-way combination into a machine-code buffer we call the MachBuffer</code>. Critically, however, this MachBuffer</code> is not merely a Vec<u8></code>: it knows (many) things about its content, including where its branches are, what the branches' targets are, and how to invert the branches if necessary. The MachBuffer</code> will perform streaming edits on the code as it is emitted, editing only a suffix of the buffer (contiguous bytes up to the end, or current emission point), in order to convert two-way branch combos when possible into simpler forms. The abstraction that the machine backend sees is: The MachBuffer</code> allows us to emit machine-code bytes. </li> We can tell the MachBuffer</code> that a certain range of machine-code bytes that we just emitted are a branch, either conditional or unconditional, how to invert it if conditional, and a label as the branch target. </li> We can bind a label to the current emission point. </li> We parameterize the MachBuffer</code> on a LabelUse</code> trait implementation which defines all the different kinds of branch-target references, how to patch the machine code with a resolved offset, and how to emit a veneer, i.e., a longer-form branch that the original branch can indirect through in order to reach further. </li> </ul> And that's it! The MachBuffer</code> does all of the work behind the scenes: when we emit a branch, it sometimes chomps some bytes; and when we define a label, it sometimes scans through a list of deferred fixups to patch earlier machine code. A (simplified) illustrated example is shown below. The machine backend emits two-way branches naïvely by always emitting a conditional and unconditional branch (e.g. at the end of basic block L0</code>). It also provides metadata to the MachBuffer</code> that describes where the labels are, where the branches are, where the branches are targetted (as labels), and how to invert conditional branches. The MachBuffer</code> is able to perform the listed streaming edits as code is emitted, producing the final machine code at the right with no intermediate buffering or additional passes. We'll describe how this editing occurs in more detail below. Branch Peephole Optimizations</h4> The key insight of the MachBuffer</code> design is that we can edit branches at the tail of the buffer as code is emitted by tracking the "most recent" branches: specifically, the branches that are contiguous to the tail of the buffer. The first optimization that we do is branch inversion, which can sometimes eliminate unconditional branches. In the example above, when the backend has emitted the machine-code bytes for all of the L0</code> basic block, the MachBuffer</code> will know that the last two branches, contiguous to the tail, are a conditional branch to L1</code> and an unconditional branch to L2</code>. When we then see the label L1</code> (which is the first branch's target) bound to this offset, we can apply a simple rule: a conditional that jumps over an immediately following unconditional can be inverted, and the unconditional branch removed. Note that, critically, because these branches are contiguous to the tail, the edit will not affect any offsets that have already been resolved; we are free to contract the code-size here, and offsets of subsequently-emitted code will be correct without further fixups. Said another way, our approach never moves code. Rather, it only sometimes chomps or adjusts a just-emitted branch, right away, before code-emission carries on past the branch. The next optimizations we do are jump threading and dead-block removal. Recall from above that jump threading means that intermediate steps in a chain of jumps can be removed: a jump to a jump to X becomes just a jump to X. We resolve this by keeping an up-to-date alias table that tracks label-to-label aliases. The table is updated whenever the MachBuffer</code> is informed that an unconditional jump was emitted and a label was bound to its address. We then indirect through the alias table when resolving labels to final offsets. The second task, dead-block removal, occurs as a side-effect of tracking reachability of the current buffer tail. Any offset that (i) immediately follows an unconditional jump, and (ii) has no labels bound to it, is unreachable; an unconditional jump at an unreachable offset can be elided. (Actually, any code at an unreachable offset can be removed, but for simplicity and to make it easier to reason about correctness, we restrict the MachBuffer</code>'s edits to code explicitly marked as branch instructions only.) In order for this to work correctly, we need to track all labels that have been bound to the current buffer tail and adjust them if we chomp (truncate) the buffer or redirect a label. For this reason, the label-binding, label-use resolution, and branch-chomping are all tightly integrated into a set of interacting data structures: To summarize, we track: Emitted bytes;</li> All labels bound to the current offset;</li> A table of all labels to a bound offset or "unbound";</li> A table of all labels to another label as an alias or "unaliased";</li> A list of the last contiguous branches, each of which is conditional or not, with inverted form if so, and label-use, and labels that are bound to this branch instruction;</li> A list of other label-uses for fixup.</li> </ul> As we emit code, we append to the emitted-bytes buffer. We lazily invalidate the "labels at current offset" set by tracking the offset for which that set is valid; appending new code implicitly clears it. As the machine backend tells the MachBuffer</code> about branches, we append to the list of the last contiguous branches. This, too, is invalidated when code is emitted that is not a branch. When a label-use is noted and the label is already resolved, we fix up the buffer right away. Note that once a label resolves to an offset, that offset cannot change; so this fixup can be done once and the metadata discarded. All branch simplification happens when a label is bound: this is when new actions become possible. We perform the following algorithm (see MachBuffer::optimize_branches()</code></a> for details): Loop as long as there are some branches in the latest-branches list: If the current buffer tail is beyond the end of the latest branch, done (and clear list).</li> If the latest branch (which ends at current tail) has a target that resolves to current tail, chomp it and restart loop.</li> If the latest branch is unconditional and does not branch to itself: Update any labels pointing at this branch to point at its target instead.</li> Restart loop if any labels were moved.</li> </ul> </li> If latest branch is unconditional, follows another unconditional branch, and no labels are bound at this branch, then chomp it (unreachable) and restart loop.</li> If latest branch is unconditional, follows a conditional branch, and conditional branch target is current tail, then invert conditional and chomp the unconditional, and restart loop.</li> </ul> </li> When loop is done, clear latest-branches list; no more can be simplified.</li> </ul> This may look to have undesirable algorithmic complexity, but in fact it is tightly bounded: we make a forward-progress argument as labels only move down alias chains and fixed work is done per branch (each is only examined once and acted upon or purged). Overall, the algorithm runs in linear time. This linear-time algorithm that edits locally, avoids any code-movement, and streams into a buffer in final form, is far better than the multi-pass edit-in-place design of a traditional backend, both asymptotically and in practice (CPUs love streaming algorithms and minimized data movement). It seems to produce code nearly as good as a much more complex branch simplifier at a much lower cost. Correctness</h3> The algorithm for simplifying branches is one of the most critical to correctness in the (post-optimizer) compiler backend; probably only second to the register allocator. It is very subtle, and bugs can be disastrous: incorrect control flow could cause anything to happen, from impossible-to-debug incorrect program results (ask me how I know</a>! And here too</a>!) to serious security vulnerabilities. Because of this, we have taken extensive care to ensure correctness. In fact, more than a third of the lines in the MachBuffer</code> implementation</a> are a proof of correctness, based on several core invariants described here</a>. At each data-structure mutation, we show that (i) the invariants still hold, and (ii) the code mutation did not alter execution semantics. Because there is significant wisdom in the Knuth quote "I have only proved it correct, not tried it" (there are always gaps between specification and reality, and unless one generates an implementation from a machine-checked proof, then the English prose or its translation to code may have bugs too3</a>), all invariants are also fully checked on each label-bind event in debug builds of Cranelift. The various fuzzing harnesses that hammer on the new backend will thus be driving these checks continuously. Other Concerns</h3> There are many subtleties to the branch-simplification and code layout problems that were not discussed here! Most prominently, we have not covered branch veneers or the topic of branch ranges at all, though we saw the problem of "branch relaxation" above. The MachBuffer</code> handles out-of-range branches by tracking a "deadline" (the last point at which any currently outstanding label may be bound without causing a branch to go out of range); if we hit the deadline, we emit an island of branch veneers, which are commonly just long-range unconditional branches, for each unresolved label and resolve the labels to those branches. This extends the deadlines. In practice island emission will almost never occur, so it is acceptable to pessimize this case (add an extra indirection) to avoid the need to go back and edit the original branch into a longer-range form. We also haven't covered constant pools; these are handled with the same "island" mechanism, allowing emitted machine code to refer to nearby constant data. Conclusion, and Next Time</h2> This has been a deep dive into the world of branch simplification, with an emphasis on how we engineered Cranelift's new backend to provide very good compilation speed taking control-flow handling and branch lowering/simplification as an example. We believe that there may be other significant opportunities to rethink, and carefully engineer, core algorithms in the compiler backend with specific attention to maximizing streaming behavior, minimizing indirection, and minimizing passes over data. This is an interesting and exciting engineering pursuit largely because it goes beyond the world of "theoretical standard compiler-book algorithms" and calls on problem solving to find clever new design tricks. As we described near the end of this post, correctness is also an important focus -- perhaps the most important focus -- of any compiler engineering effort. Given that, I plan to write the next (and final) post in this series about how we engineered for correctness by taking a deep-dive into the register allocator checker, which is a novel symbolic checker (which can be seen as an application of abstract interpretation) that allows us to prove that any particular register-allocator execution gave a correct allocation result. I'll talk about how this checker, driven by a fuzzing frontend, found some really subtle and interesting bugs that we likely never would have found in production otherwise. With that, until next time, happy compiling! For discussions about this post, please feel free to join us on our Zulip instance in this thread</a>. Thanks to Benjamin Bouvier for reviewing this post and providing very helpful feedback! Thanks also to bjorn3 for correcting a typo in a figure. ^{1 Frances E. Allen and John Cocke. A catalogue of optimizing transformations. In Design and Optimization of Compilers (Prentice-Hall, 1972), pp. 1--30. </div>}^{2 This is a bit of a simplification of branches in the IR: in CLIF (and in most other CFG-based compiler IRs), there are several branch types. Another is the "switch" or "branch table" branch that chooses between N possible targets with an integer index. There are also simple single-target unconditional branches; and a return instruction is also a "branch" of sorts in that it ends a basic block, though it has no successors. The important takeaway is that IR-level branches are an abstraction level above machine-code control flow, allowing for a direct choice between several or many targets as one operation. </div>}^3See PR #2083</a> above, which is a bug that arose after I wrote the correctness proof, because the proof assumed target-aliasing supported arbitrarily-long branch chains but it actually followed only one level. This was a deliberate earlier implementation choice to avoid infinite loops on branch cycles. It turns out that it's possible to just avoid cycles in the alias table by construction; we carefully prove that this is so and then allow redirect-chasing through chains of branches. For extra paranoia, because a non-terminating compiler is bad and we are merely human, we still limit redirect-chasing to 1 million branches and panic beyond that (because one can never be too careful); this is a limit that will never be hit when using the Wasm frontend (due to limits on function size) and is extremely unlikely to be hit elsewhere. </div> A New Backend for Cranelift, Part 1: Instruction Selection 2020-09-18T00:00:00+00:00 This post is the first in a three-part series about my recent work on Cranelift</a> as part of my day job at Mozilla. In this first post, I will set some context and describe the instruction selection problem. In particular, I'll talk about a revamp to the instruction selector and backend framework in general that we've been working on for the last nine months or so. This work has been co-developed with my brilliant colleagues Julian Seward and Benjamin Bouvier</a>, with significant early input from Dan Gohman</a> as well, and help from all of the wonderful Cranelift hackers. Background: Cranelift</h2> So what is Cranelift? The project is a compiler framework written in Rust</a> that is designed especially (but not exclusively) for just-in-time compilation</a>. It's a general-purpose compiler: its most popular use-case is to compile WebAssembly</a>, though several other frontends exist, for example, cg_clif</a>, which adapts the Rust compiler itself to use Cranelift. Folks at Mozilla and several other places have been developing the compiler for a few years now. It is the default compiler backend for wasmtime</a>, a runtime for WebAssembly outside the browser, and is used in production in several other places as well. We recently flipped the switch to turn on Cranelift-based WebAssembly support in nightly Firefox on ARM64 (AArch64)</a> machines, including most smartphones, and if all goes well, it will eventually go out in a stable Firefox release. Cranelift is developed under the umbrella of the Bytecode Alliance</a>. In the past nine months, we have built a new framework in Cranelift for the "machine backends", or the parts of the compiler that support particular CPU instruction sets. We also added a new backend for AArch64, mentioned above, and filled out features as needed until Cranelift was ready for production use in Firefox. This blog post sets some context and describes the design process that went into the backend-framework revamp. It can be a bit confusing to keep all of the moving parts straight. Here's a visual overview of Cranelift's place among various other components, focusing on two of the major Rust crates (the Wasm frontend and the codegen backend) and several of the other programs that make use of Cranelift: Old Backend Design: Instruction Legalizations</h2> To understand the work that we've done recently on Cranelift, we'll need to zoom into the cranelift_codegen</code> crate above and talk about how it used to work. What is this "CLIF" input, and how does the compiler translate it to machine code that the CPU can execute? Cranelift makes use of CLIF</a>, or the Cranelift IR (Intermediate Representation) Format, to represent the code that it is compiling. Every compiler that performs program optimizations uses some form of an Intermediate Representation (IR)</a>: you can think of this like a virtual instruction set that can represent all the operations a program is allowed to do. The IR is typically simpler than real instruction sets, designed to use a small set of well-defined instructions so that the compiler can easily reason about what a program means. The IR is also independent of the CPU architecture that the compiler eventually targets; this lets much of the compiler (such as the part that generates IR from the input programming language, and the parts that optimize the IR) be reused whenever the compiler is adapted to target a new CPU architecture. CLIF is in Static Single Assignment (SSA)</a> form, and uses a conventional control-flow graph</a> with basic blocks (though it previously allowed extended basic blocks, these have been phased out). Unlike many SSA IRs, it represents φ-nodes with block parameters rather than explicit φ-instructions. Within cranelift_codegen</code>, before we revamped the backend design, the program remained in CLIF throughout compilation and up until the compiler emitted the final machine code. This might seem to contradict what we just said: how can the IR be machine-independent, but also be the final form from which we emit machine code? The answer is that the old backends were built around the concept of "legalization" and "encodings". At a high level, the idea is that every Cranelift instruction either corresponds to one machine instruction, or can be replaced by a sequence of other Cranelift instructions. Given such a mapping, we can refine the CLIF in steps, starting from arbitrary machine-independent instructions from earlier compiler stages, performing edits until the CLIF corresponds 1-to-1 with machine code. Let's visualize this process: A very simple example of a CLIF instruction that has a direct "encoding" to a machine instruction is iadd</code>, which just adds two integers. On essentially any modern architecture, this should map to a simple ALU instruction that adds two registers. On the other hand, many CLIF instructions do not map cleanly. Some arithmetic instructions fall into this category: for example, there is a CLIF instruction to count the number of set bits in an integer's binary representation (popcount</code>); not every CPU has a single instruction for this, so it might be expanded into a longer series of bit manipulations. There are operations that are defined at a higher semantic level, as well, that will necessarily be lowered with expansions: for example, accesses to Wasm memories are lowered into operations that fetch the linear memory base and its size, bounds-check the Wasm address against the limit, compute the real address for the Wasm address, and perform the access. To compile a function, then, we iterate over the CLIF and find instructions with no direct machine encodings; for each, we simply expand into the legalized sequence, and then recursively consider the instructions in that sequence. We loop until all instructions have machine encodings. At that point, we can emit the bytes corresponding to each instruction's encoding1</a>. Growing Pains, and a New Backend Framework?</h2> There are a number of advantages to the legacy Cranelift backend design, which performs expansion-based legalization with a single IR throughout. As one might expect, though, there are also a number of drawbacks. Let's discuss a few of each. Single IR and Legalization: Pros</h3> By operating on a single IR all the way to machine-code emission, the same optimizations can be applied at multiple stages. For example, consider a legalization expansion that turns a high-level "access Wasm memory" instruction into a sequence of loads, adds and bounds-checks. If many such sequences occur in one function, we might be able to factor out common portions (e.g.: computing the base of the Wasm memory). Thus the legalization scheme exposes as much code as possible, at as many stages as possible, to opportunities for optimization. The legacy Cranelift pipeline in fact works in this way: it runs "pre-opt" and "post-opt" optimization passes, before and after legalization respectively. </li> If most of the Cranelift instructions become one machine instruction, and few legalizations are necessary, then this scheme can be very fast: it becomes simply a single traversal to fill in "encodings", which were represented by small indices into a table. </li> </ol> Single IR and Legalization: Cons</h3> Expansion-based legalization may not always result in optimal code. So far we've seen that legalization can convert from CLIF to machine instructions with one-to-one or one-to-many mappings. However, there are sometimes also single machine instructions that implement the behavior of multiple CLIF instructions, i.e. a many-to-one mapping. In order to generate efficient code, we want to be able to make use of these instructions. For example, on x86, an instruction that references memory can compute an address like base + scale * index</code>, where base</code> and index</code> are registers and scale</code> is 1, 2, 4, or 8. There is no notion of such an address mode in CLIF, so we would want to pattern-match the raw iadd</code> (add) and ishl</code> (shift) or imul</code> (multiply) operations when they occur in the address computation. Then, we would want to somehow select the encoding on the load</code> instruction based on the fact that its input is some specific combination of adds and shifts/multiplies. This seems to break the abstraction that the encoding represents only that instruction's operation. In principle, we could implement more general pattern matching for legalization rules to allow many-to-one mappings. However, this would be a significant refactor; and as long as we were reconsidering the design in whole, there were other reasons to avoid patching the problem in this way. </li> There is a conceptual difficulty with the single-IR approach: there is no static representation of which instructions are expanded into which others and it is difficult to reason about the correctness and termination properties of legalization as a whole. Specifically, the expansion-based legalization rules must obey a partial order among instructions: if A expands into a sequence including B, then B cannot later expand into A. In practice, mappings were mostly one-to-one, and for those that weren't, there was a clear domain separation between the "input" high-level instructions and the "machine-level" instructions. However, for more complex machines, or more complex matching schemes that attempt to make better use of the target instruction set, this could become a real difficulty for the machine-backend author to keep straight. </li> There are efficiency concerns with expansion-based legalization. At an algorithmic level, we prefer to avoid fixpoint loops (in this case, "continue expanding until no more expansions exist") whenever possible. The runtime is bounded, but the bound is somewhat difficult to reason about, because it depends on the maximum depth of chained expansions. The data structures that enable in-place editing are also much slower than we would like. Typically, compilers store IR instructions in linked lists to allow for in-place editing. While this is asymptotically as fast as an array-based solution (we never need to perform random access), it is much less cache-friendly or ILP</a>-friendly on modern CPUs. We'd prefer instead to store arrays of instructions and perform single passes over them whenever possible. </li> Our particular implementation of the legalization scheme grew to be somewhat unwieldy over time. Witness this GitHub issue, in which my eloquent colleague Benjamin Bouvier</a> describes all the reasons we'd like to fix the design: #1141: Kill Recipes With Fire</a>. This is no slight to the engineers who built it; the complexity was managed as well as could be, with a very nice DSL-based code generation step to produce the legalizer from high-level rule specifications. However, reasoning through legalizations and encodings become more cumbersome than we would prefer, and the compiler backends were not very accessible to contributors. Adding a new instruction required learning about "recipes", "encodings", and "legalizations" as well as mere instructions and opcodes, and finding one's way through the DSL to put the pieces together properly. A more conventional code-lowering approach would avoid much of this complexity. </li> A single-level IR has a fundamental tension: for analyses and optimizations to work well, an IR should have only one way to represent any particular operation, i.e. should consist of a small set of canonical instructions. On the other hand, a machine-level representation should represent all of the relevant details of the target ISA. For example, an address computation might occur in many different ways (with different addressing modes) on the machine, but we would prefer not to have to analyze a special address-computation opcode in all of our analyses. An implicit rule at emission time ("a load with an add instruction as input always becomes this addressing mode") is not ideal, either. A single IR simply cannot serve both ends of this spectrum properly, and difficulties arose as CLIF strayed from either end. To resolve this conflict, it is best to have a two-level representation, connected by an explicit instruction selector. It allows CLIF itself to be as simple and as normalized as possible, while allowing all the details we need in machine-specific instructions. </li> </ol> For all of these reasons, as part of our revamp of Cranelift and a prerequisite to our new AArch64 backend, we built a new framework for machine backends and instruction selection. The framework allows machine backends to define their own instructions, separately from CLIF; rather than legalizing with expansions and running until a fixpoint, we define a single lowering pass; and everything is built around more efficient data-structures, carefully optimizing passes over data and avoiding linked lists entirely. We now describe this new design! A New IR: VCode</h2> The main idea of the new Cranelift backend is to add a machine-specific IR, with several properties that are chosen specifically to represent machine-code well (i.e., the IR is very close to machine code). We call this VCode</code>, which comes from "virtual-register code", and the VCode contains MachInst</code>s, or machine instructions. The key design choices we made are: VCode is a linear sequence of instructions. There is control-flow information that allows traversal over basic blocks, but the data structures are not designed to easily allow inserting or removing instructions or reordering code. Instead, we lower into VCode with a single pass, generating instructions in their final (or near-final) order. I'll write more about how we make this efficient in a follow-up post. This design aspect avoids the inefficiencies of linked-list data structures, allowing fast passes over arrays of instructions instead. We've kept the MachInst</code> size relatively small (16 bytes per instruction for AArch64) which aids code generation and iteration speed as well. </li> VCode is not SSA-based; instead, its instructions operate on registers. While lowering, we allocate virtual registers. After the VCode is generated, the register allocator computes appropriate register assignments and edits the instructions in-place, replacing virtual registers with real registers. (Both are packed into a 32-bit representation space, using the high bit to distinguish virtual from real.) Eschewing SSA at this level allows us to avoid the overhead of maintaining its invariants, and maps more closely to the real machine. Lowerings for instructions are allowed to, e.g., use a destination register as a temporary before performing a final write into it. If we required SSA form, we would have to allocate a temporary in this case and rely on the register allocator to coalesce it back to the same register, which adds compile-time overhead. </li> VCode is a container for MachInst</code>s, but there is a separate MachInst</code> type for each machine backend. The machine-independent part is parameterized on MachInst</code> (which is a trait in Rust) and is statically monomorphized to the particular target for which the compiler is built. Modeling a machine instruction with Rust's excellent facilities for strongly-typed data structures, such as enum</code>s, avoids the issue of muddled instruction domain (is a CLIF instruction machine-independent, machine-dependent, or both?) and allows each backend to store the appropriate information for its encoding. </li> </ul> One can visualize a VCode function body as consisting of the following information (simplified; a real example is further below): Note that the instructions are simply stored in an array, and the basic blocks are recorded separately as ranges of array (instruction) indices. As we described above, we designed this data structure for fast iteration, but not for editing. We always ensure that the first block (b0</code>) is the entry block, and that consecutive block indices have contiguous instruction-index ranges (i.e., are placed next to each other). Each instruction is mostly opaque from the point of view of the VCode container, with a few exceptions: every instruction exposes its (i) register references, and (ii) basic-block targets, if a branch. Register references are categorized into the usual "uses" and "defs" (reads and writes).2</a> Note as well that the instructions can refer to either virtual registers (here denoted v0</code>..vN</code>) or real machine registers (here denoted r0</code>..rN</code>). This design choice allows the machine backend to make use of specific registers where required by particular instructions, or by the ABI (parameter-passing conventions). The semantics of VCode are such that the register allocator recognizes live ranges of the real registers, from defs to uses, and avoids allocating virtual registers to those particular real registers for their live ranges. After allocation, all machine instructions are edited in place to refer only to real registers. Aside from registers and branch targets, an instruction contained in the VCode may contain whatever other information is necessary to emit machine code. Each machine backend defines its own type to store this information. For example, on AArch64, here are several of the instruction formats, simplified: pub enum Inst { /// An ALU operation with two register sources and a register destination. AluRRR { alu_op: ALUOp, rd: Writable<Reg>, rn: Reg, rm: Reg, }, /// An ALU operation with a register source and an immediate-12 source, and a register /// destination. AluRRImm12 { alu_op: ALUOp, rd: Writable<Reg>, rn: Reg, imm12: Imm12, }, /// A MOVZ with a 16-bit immediate. MovZ { rd: Writable<Reg>, imm: MoveWideConst, size: OperandSize, }, /// A two-way conditional branch. Contains two targets; at emission time, a conditional /// branch instruction followed by an unconditional branch instruction is emitted, but /// the emission buffer will usually collapse this to just one branch. See a follow-up /// blog post for more! CondBr { taken: BranchTarget, not_taken: BranchTarget, kind: CondBrKind, }, // ... }</code></pre> These enum arms could be considered similar to "encodings" in the old backend, except that they are defined in a much more straightforward way. Whereas old Cranelift backends had to define instruction encodings using a DSL, and these encodings were assigned a numeric index and a special bit-packed encoding for additional instruction parameters, here the instructions are simply stored in type-safe and easy-to-use Rust data structures. We will not discuss the VCode data-structure design or instruction interface much further, except to note that the relevant instruction-emission functionality for a new machine backend can be implemented by providing a MachInst</code> trait implementation</a> for one's instruction type (and then lowering into it; see below). We believe, and early experience seems to indicate, that this is a much easier task than what was required to develop a backend in Cranelift's old DSL-based framework. Lowering from CLIF to VCode</h2> We've now come to the most interesting design question: how do we lower from CLIF instructions, which are machine-independent, into VCode with the appropriate type of CPU instructions? In other words, what have we replaced the expansion-based legalization and encoding scheme with? In short, the scheme is a single pass over the CLIF instructions, and at each instruction, we invoke a function provided by the machine backend to lower the CLIF instruction into VCode instruction(s). The backend is given a "lowering context</a>" by which it can examine the instruction and the values that flow into it, performing "tree matching" as desired (see below). This naturally allows 1-to-1, 1-to-many, or many-to-1 translations. We incorporate a reference-counting scheme into this pass to ensure that instructions are only generated if their values are actually used; this is necessary to eliminate dead code when many-to-1 matches occur. Tree Matching</h3> Recall that the old design allowed for 1-to-1 and 1-to-many mappings from CLIF instructions to machine instructions, but not many-to-1. This is particularly problematic when it comes to pattern-matching for things like addressing modes, where we want to recognize particular combinations of operations and choose a specific instruction that covers all of those operations at once. Let's start by defining a "tree" that is rooted at a particular CLIF instruction. For each argument to the instruction, we can look "up" the program to find its producer (def). Because CLIF is in SSA form, either the instruction argument is an ordinary value, which must have exactly one definition, or it is a block parameter (φ-node in conventional SSA formulations) that represents multiple possible definitions. We will say that if we reach a block parameter (φ-node), we simply end at a tree leaf -- it is perfectly alright to pattern-match on a tree that is a subset of the true dataflow (we might get suboptimal code, but it will still be correct). For example, given the CLIF code: block0(v0: i64, v1: i64, v2: b1): brnz v2, block1(v0) jump block1(v1) block1(v2: i64): v3 = iconst.i64 64 v4 = iadd.i64 v2, v3 v5 = iadd.i64 v4, v0 v6 = load.i64 v5 return v6</code></pre> let's consider the load instruction: v6 = load.i64 v5</code>. A simple code generator could map this 1-to-1 to the CPU's ordinary load instruction, using the register holding v5</code> as an address. This would certainly be correct. However, we might be able to do better: for example, on AArch64, the available addressing modes include a two-register sum ldr x0, [x1, x2]</code> or a register with a constant offset ldr x0, [x1, #64]</code>. The "operand tree" might be drawn like this: We stop at v2</code> and v0</code> because they are block parameters; we don't know with certainty which instruction will produce these values. We can replace v3</code> with the constant 64</code>. Given this view, the lowering process for the load instruction can fairly easily choose an addressing mode. (On AArch64, the code to make this choice is here</a>; in this case it would choose the register + constant immediate form, generating a separate add instruction for v0 + v2</code>.) Note that we do not actually explicitly construct an operand tree during lowering. Instead, the machine backend can query each instruction input, and the lowering framework will provide a struct</a> giving the producing instruction if known, the constant value if known, and the register that will hold the value if needed. The backend may traverse up the tree (via the "producing instruction") as many times as needed. If it cannot combine the operation of an instruction further up the tree into the root instruction, it can simply use the value in the register at that point instead; it is always safe (though possibly suboptimal) to generate machine instructions for only the root instruction. Lowering an Instruction</h3> Given this matching strategy, then, how do we actually do the translation? Basically, the backend provides a function that is called once per CLIF instruction, at the "root" of the operand tree, and can produce as many machine instructions as it likes. This function is essentially just a large match</code> statement over the opcode of the root CLIF instruction, with the match-arms looking deeper as needed. Here is a simplified version of the match-arm for an integer add operation lowered to AArch64 (the full version is here</a>): match op { // ... Opcode::Iadd => { let rd = get_output_reg(ctx, outputs[0]); let rn = put_input_in_reg(ctx, inputs[0]); let rm = put_input_in_rse_imm12(ctx, inputs[1]); let alu_op = choose_32_64(ty, ALUOp::Add32, ALUOp::Add64); ctx.emit(alu_inst_imm12(alu_op, rd, rn, rm)); } // ... }</code></pre> There is some magic that happens in several helper functions here. put_input_in_reg()</code> invokes the proper methods on the ctx</code> to look up the register that holds an input value. put_input_in_rse_imm12()</code> is more interesting: it returns a ResultRSEImm12</code></a>, which is a "register, shifted register, extended register, or 12-bit immediate". This set of choices captures all of the options we have for the second argument of an add instruction on AArch64. The helper looks at the node in the operand tree and attempts to match either a shift or zero/sign-extend operator, which can be incorporated directly into the add. It also checks whether the operand is a constant and if so, could fit into a 12-bit immediate field. If not, it falls back to simply using the register input. alu_inst_imm12()</code> then breaks down this enum and chooses the appropriate Inst</code> arm (AluRRR</code>, AluRRRShift</code>, AluRRRExtend</code>, or AluRRImm12</code> respectively). And that's it! No need for legalization and repeated code editing to match several operations and produce a machine instruction. We have found this way of writing lowering logic to be quite straightforward and easy to understand. Backward Pass with Use-Counts</h3> Now that we can lower a single instruction, how do we lower a function body with many instructions? This is not quite as straightforward as looping over the instructions and invoking the match-over-opcode function described above (though that would actually work). In particular, we want to handle the many-to-1 case more efficiently. Consider what happens when the add-instruction logic above is able to incorporate, say, a left-shift operator into the add instruction. The add</code> machine instruction would then use the shift's input register, and completely ignore the shift's output. If the shift operator has no other uses, we should avoid doing the computation entirely; otherwise, there was no point in merging the operation into the add. We implement a sort of reference counting to solve this problem. In particular, we track whether any given SSA value is actually used, and we only generate code for a CLIF instruction if any of its results are used (or if it has a side-effect that must occur). This is a form of dead-code elimination</a> but integrated into the single lowering pass. To know whether a value is used, we simply track a counter per value, initialized to zero. Whenever the machine backend uses a register input (as opposed to using a constant value directly, or incorporating the producing instruction's operation), it notifies the lowering driver that this register has been used. We must see uses before defs for this to work. Thus, we iterate over the function body "backward". Specifically, we iterate in postorder</a>; this way, all instructions are seen before instructions that dominate</a> them, so given SSA form, we see uses before defs. Finally, we have to consider side-effects carefully. This matters in two ways. First, if an instruction has a side-effect, then we must lower it into VCode even if its result(s) have no uses. Second, we cannot allow an operation to be merged into another if this would move a side-effecting operation over another or alter whether it might execute. We ensure side-effect correctness with a "coloring" scheme (in a forward pass, assign a color to every instruction, and update the color on every side effect and on every new basic block); the producing instruction is only considered for possible merging with its consuming instruction if it has no side-effects (hence can always be moved) or if it has the same color as the consuming instruction (hence would not move over another side effect). The lowering procedure is as follows (full version here</a>): Compute instruction colors based on side-effects.</li> Allocate virtual registers to all SSA values. It's OK if we don't use some; an unused virtual register will not be allocated any real register.</li> Iterate in postorder over instructions. If the instruction has a side-effect, or if any of its results are used, call into the machine backend to lower it.</li> Reverse the VCode instructons so that they appear in forward order. 3</a></li> </ol> Easy! Examples</h3> Let's see how this works in real life! Consider the following CLIF code: function %f25(i32, i32) -> i32 { block0(v0: i32, v1: i32): v2 = iconst.i32 21 v3 = ishl.i32 v0, v2 v4 = isub.i32 v1, v3 return v4 }</code></pre> We expect that the left-shift (ishl</code>) operation should be merged into the subtract operation on AArch64, using the reg-reg-shift form of ALU instruction, and indeed this happens (here I am showing the debug-dump format one can see with RUST_LOG=debug</code> when running clif-util compile -d --target aarch64</code>): VCode { Entry block: 0 Block 0: (original IR block: block0) (instruction range: 0 .. 6) Inst 0: mov %v0J, x0 Inst 1: mov %v1J, x1 Inst 2: sub %v4Jw, %v1Jw, %v0Jw, LSL 21 Inst 3: mov %v5J, %v4J Inst 4: mov x0, %v5J Inst 5: ret }</code></pre> This then passes through the register allocator, has a prologue and epilogue attached (we cannot generate these until we know which registers are clobbered), has redundant moves elided, and becomes: stp fp, lr, [sp, #-16]! mov fp, sp sub w0, w1, w0, LSL 21 mov sp, fp ldp fp, lr, [sp], #16 ret</code></pre> which is a perfectly valid function, correct and callable from C, on AArch64! (We could do better if we knew that this were a leaf function and avoided the stack-frame setup and teardown! Alas, many optimization opportunities remain.) There are many other examples of interesting instruction-selection cases in our filetests</a>. One of our favorite pastimes lately is to stare at disassemblies and find inefficient translations, improving the pattern-matching as required, so these are slowly getting better (my brilliant colleague Julian Seward has built a custom tool that dumps the hottest basic blocks from a given JIT execution and has found quite a number of improvements in our AArch64 and x86-64 backends). Next: Efficient Code-Generation Passes, and Checking the Register Allocator</h2> I've covered a lot of ground in this post, but there's still a lot more to say about the new Cranelift backend framework! In the second post, I'd like to talk about how we designed the passes after VCode lowering to be as efficient as possible. In particular this will involve the way in which we simplify branches, which avoids the more usual step-by-step process of removing empty basic blocks and flipping branch conditions and taking advantage of fallthrough paths, instead doing last-minute edits as the binary code is being emitted (see the MachBuffer</code></a> implementation for all the details). Then, in the third post, I'll talk about how I've used abstract interpretation to build a symbolic checker for our register allocator, which has been effective at finding several interesting bugs while fuzzing. Stay tuned! In the meantime, for any and all discussions about Cranelift, please feel free to join us on our Bytecode Alliance Zulip chat</a> (here's a topic</a> for this post)! Thanks to Julian Seward and Benjamin Bouvier for reviewing this post and suggesting several additions and corrections. ^{1 Note that this description skips several quite important steps that come after instructions have encodings. Most importantly, we still must perform register allocation, which chooses machine registers to hold each value in the IR. This may involve inserting instructions as well, when values need to be spilled to or reloaded from the stack or simply moved between registers. Then, after several other housekeeping tasks (such as resolving branches and optimizing their forms for the actual machine-code offsets), we can actually use the encodings to emit machine code. </div> ^{2 We also support a "mod" (modify) type of register reference that is both a use and def, while ensuring that the same register is allocated for the use- and the def-points. This replaces an earlier mechanism known as "tied operands" that introduced an ad-hoc constraint to the register allocator. Mods instead are handled by simply extending the live-range through the instruction. </div> ^{3 The reversal scheme is actually a bit more subtle than this. We want to emit instructions in forward order within the lowering for a single CLIF instruction, but we visit CLIF instructions backward. To make this work, we keep a buffer of lowered VCode instructions per CLIF instruction in forward order; at the end of a single CLIF instruction, these are copied in reverse order to a buffer of lowered VCode instructions for the basic block. Because we visit instructions within the block backward, this buffer contains the VCode sequence for the basic block in reverse order. Then, at the end of the block, we reverse it again onto the tail of the VCode buffer. The end result is that we see VCode instructions in forward order for each CLIF instruction in forward order, contained within basic blocks in forward order (phew!). </div>}}} blog.cfallin is live! 2020-09-17T00:00:00+00:00 Hello, and welcome to blog.cfallin</a>! I've thought for a while that it might be nice to share, occasionally, some thoughts on whatever technical tidbits interest me. This blog will likely be home to assorted ramblings on compilers, runtimes, and the like; you can find a bit more about my background at 'About'</a>. My first post, coming soon, will be about the new compiler backend framework I've developed (along with extremely capable co-conspirators) for Cranelift</a>, a compiler in Rust that will soon be used in production in Firefox, among other places. Stay tuned.

Cranelift's Instruction Selector DSL, ISLE: Term-Rewriting Made Practical

Cranelift, Part 4: A New Register Allocator

Four Lessons</h2>

Performance</h3>

Cranelift, Part 3: Correctness in Register Allocation

Cranelift, Part 2: Compiler Efficiency, CFGs, and a Branch Peephole Optimizer

A New Backend for Cranelift, Part 1: Instruction Selection

blog.cfallin is live!

cfallin.org

The acyclic e-graph: Cranelift's mid-end optimizer

Exceptions in Cranelift and Wasmtime

Compilation of JavaScript to Wasm, Part 3: Partial Evaluation

Compilation of JavaScript to Wasm, Part 2: Ahead-of-Time vs. JIT

Path Generics in Rust: A Sketch Proposal for Simplicity and Generality

Fast(er) JavaScript on WebAssembly: Portable Baseline Interpreter and Future Plans

cfallin.org

The acyclic e-graph: Cranelift's mid-end optimizer

Exceptions in Cranelift and Wasmtime

Compilation of JavaScript to Wasm, Part 3: Partial Evaluation

Compilation of JavaScript to Wasm, Part 2: Ahead-of-Time vs. JIT

Path Generics in Rust: A Sketch Proposal for Simplicity and Generality

Fast(er) JavaScript on WebAssembly: Portable Baseline Interpreter and Future Plans

Cranelift's Instruction Selector DSL, ISLE: Term-Rewriting Made Practical

Sequential Semantics for Matching</h3> One could imagine a rule like (iadd (imul a b) c) => (aarch64_madd a b c)</code> to "compile" to a series of "match operations" like the following invented operations for some matching engine:</p>

Cranelift, Part 4: A New Register Allocator

Four Lessons</h2>

Performance</h3>

Cache Locality and Scans</h4> One enduring theme in the regalloc2 architecture is data structure design for performance</em>. As I began the project by transliterating IonMonkey code, building Rust equivalents to the data structures in the original C++, I found several things:</p>

Cranelift, Part 3: Correctness in Register Allocation