Choosing the right defaults for Tiered Compilation

# Choosing the right defaults for tiered compilation

Tiered compilation (TC) is a runtime feature that is able to control the compilation speed and quality of the JIT to achieve various performance outcomes. It is enabled by default in .NET Core 3.0 builds. We are considering what the default TC configuration should be for the final 3.0 release. We have been investigating the performance impact (positive and/or negative) for a variety of application scenarios, with the goal of selecting a default that is good for all scenarios, and providing configuration switches to enable developers to opt apps into other configurations.

We would like your feedback on this exercise and want to share how we are thinking about TC currently.

## TC Feature Explained (briefly)

TC is based on the underlying [re-jit capability](https://github.com/dotnet/coreclr/blob/master/Documentation/Profiling/davbr-blog-archive/ReJIT%20-%20The%20Basics.md) in the runtime, which enables methods to be compiled more than once (typically with different code). The re-jit capability was initially built to support instrumenting profilers.

The fundamental benefit and capability of TC is to enable (re-)jitting methods with lower but faster to produce or higher quality but slower to produce code in order to increase performance of an application as it goes through various stages of execution, from startup through stead-state. This contrasts with the non-TC approach, where every method is compiled a single way (the same as the high-quality tier), which biases to steady-state over startup performance. 

TC isn't solely about jitted code. TC is able to re-jit R2R code to higher-quality jitted code. Ahead-of-time compiled [ready-to-run (R2R) images](https://github.com/dotnet/coreclr/blob/master/Documentation/botr/readytorun-overview.md) are biased towards startup performance, and are worse for stead-state performance than high-quality jitted code. This capability of TC can significantly improve steady state performance for compute-intensive applications like web servers.

Only methods that are called multiple times are re-jitted, after calls to that methods satisfy a threshold, currently defined at [30 calls](https://github.com/dotnet/coreclr/blob/58d9cf157f54e8fd61eaaf56b3f8045075d171cd/src/inc/clrconfigvalues.h#L649). Many methods are called only a few times, and don't warrant optimization.

We call code that is either already available (specifcally R2R code) or can be inexpensively produced at startup "tier 0". We call optimized code that is generated after startup "tier 1". Tier 1 code is the code that is generated after a method has been called multiple times, as described above.

At startup, tier 0 code can be one of the following:

* Ahead-of-time compiled R2R code.
* Tier 0 jitted code, produced by "Quick JIT". Quick JIT applies fewer optimizations (similar to "minopts") to compile code faster.

## Context

We [first introduced TC with .NET Core 2.1](https://devblogs.microsoft.com/dotnet/tiered-compilation-preview-in-net-core-2-1/). We intended at that time to enable TC by default. We found regressions with some ASP.NET benchmarks, so opted to leave the feature off by default. We have heard that some users (including Microsoft products) have enabled TC based on observed benefits. That's great, and is part of the information we are collecting to make the decision on how to configure TC for 3.0.

As part of the .NET Core 3.0 release, we have [invested significant effort into improving and optimizing TC](https://github.com/dotnet/coreclr/labels/area-TieredCompilation), again with the goal of enabling TC by default. At this point, we are focussed less on further improvements to TC and more on the final ship configuration.

Recently, we saw a report of [concerning performance with TC and AWS Lambda](https://twitter.com/zaccharles/status/1108182711573905408). We are working with both [Zac Charles](https://twitter.com/zaccharles) and [Norm Johanson](https://twitter.com/socketnorm) to better understand the results and try the same testing with more real-world Lambda applications.  Zac and Norm have been excellent to work with. Major kudos to Zac for all the leg-work he's done helping us! Note that the results in the blog post were based on a Lambda application that just calls ToUpper() on a string. It doesn't make sense to base our analysis solely on an application that small.

**Updated:** [Benchmarking .NET Core 3.0-preview4 on Lambda](https://medium.com/@zaccharles/benchmarking-net-core-3-0-preview4-on-lambda-24bde8bb3712)

We have a conversation started with the Azure Functions team to see if similar benchmarks produce similar results in that environment. The Functions team told us that they tried TC with .NET Core 2.1 and opted not to enable it because they didn't see a benefit with their testing, however, they are about to start testing .NET Core 3.0. We will work with the Functions team to specifically look at the impact of TC on their performance benchmarks.

We're not making .NET Core product decisions exclusively for the serverless application type, however, the post that Zac wrote and other community feedback ([example](https://stackoverflow.com/questions/54353643/how-to-disable-the-coreclr-tiered-compilation)) made us ask a few questions:

* Is TC a good feature to have enabled by default? Is it generally beneficial or does it only show benefits with certain types of applications?
* Is TC bad for people benchmarking with .NET Core? Will they need to read documentation to benchmark .NET Core correctly, specifically to accomodate for TC?
* Almost all of the TC investigations have been on web apps. What about WPF and Windows Forms client applications, which are new in 3.0? What about more sophisticated console apps like PowerShell? What about constrained devices like Raspberry Pi or Docker containers with <= 1 cpu allocated?

The rest of this doc details our plan for answering these questions, and to using performance data we generate to define a final configuration for TC for .NET Core 3.0.

## Desired Outcomes

First, we'll start with the characteristics we would want to see in order to make TC default.

* No or limited regressions (<5% due to TC; 3.0 w/TC disabled is baseline); regressions could be: startup time, steady-state throughput, allocation rates, memory usage, ...
* Significant improvements for some scenarios with a bias to steady state execution (for example, as measured by RPS for web apps)
* Developers benchmarking .NET Core do not need to read documentation to get accurate results

## Define Performance Baselines

* **2.2 Customer default** -- R2R enabled, TC disabled
* **2.2 TechEmpower configuration** -- R2R disabled, TC disabled

## Measurement Modes

* TC enabled (same as Preview 3 default)
* TC enabled, QuickJit disabled (same as Preview 4 default)
* TC disabled (Same as 2.2 default)
* R2R disabled, TC disabled (as as 2.2 TechEmpower configuration)

## Action Plan

We intend to make a decision on the .NET Core default mode for TC in May or June. We will use the following action plan. 

Measure cold startup, warm startup, throughput and working set, in the defined measurement modes, for a broad set of applications:

* UI client apps: WPF, Windows Forms, UWP
* Console apps: Roslyn compiler (compiling roslyn), and PowerShell (pure startup and long-running script)
* ASP.NET: TechEmpower and Music Store
* Serverless: Azure Functions and AWS Lambda
* PAAS: Azure websites

Note: some performance metrics may not be critical/relevant for all application types.

Execution plan:

* Collect and publish performance data
* Investigate anomalies
* Consider experiments to improve results, and rinse and repeat, some of which will need to be postponed until a later release
* Make changes in a preview and watch for feedback
* Document the final decision with recommendations, as appropriate

Desired community engagement:

* Provide general feedback, perferably with data justifying viewpoints
* Run performance tests and report results and any associated analysis (file issues on dotnet/coreclr repo)

## Theories and Thoughts

We have developed to a few theories. They are not guiding the investigation, but are ideas that we want to prove or disprove.

* The AWS Lambda throughput benchmarks are negatively impacted by Quick JIT. The Lamdba environment is very constrained, resulting in poor throughput for an extended time, until tier 1 code can be generated. For some applications running in such an environment, they may never hit optimal execution because they may not run long enough.
* System.Private.Corelib.dll was moved from being compiled with fragile NGEN to ready to run format in .NET Core 3.0. We believe that some startup performance regressions are due to this change.

## Key Resources

* CoreCLR PR: https://github.com/dotnet/coreclr/pull/23599
* SDK PR: https://github.com/dotnet/sdk/pull/3064


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Choosing the right defaults for Tiered Compilation #12515

Choosing the right defaults for tiered compilation

TC Feature Explained (briefly)

Context

Desired Outcomes

Define Performance Baselines

Measurement Modes

Action Plan

Theories and Thoughts

Key Resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Choosing the right defaults for Tiered Compilation #12515

Description

Choosing the right defaults for tiered compilation

TC Feature Explained (briefly)

Context

Desired Outcomes

Define Performance Baselines

Measurement Modes

Action Plan

Theories and Thoughts

Key Resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions