Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Conversation

@davmason
Copy link

@davmason davmason commented Sep 16, 2019

This is still a work in progress, there are some failing tests and I still need to run perf tests on it, but I want to get eyes on it as early as possible.

This change does three things:

  • Remove nop padding from R2R images
  • Remove jumpstamps from the runtime and switch rejit to use exclusively precodes
  • Merge my changes with Kount's recent changes in code versioning

Remaining work before it's ready to merge:

  • Get all tests passing
  • Hopefully refactor ReJitManager::UpdateActiveILVersions so it no longer requires a runtime suspension (moved to a separte issue #26790)
  • Run perf tests on it to prove there are no startup regressions

@davmason davmason added this to the 5.0 milestone Sep 16, 2019
@davmason davmason self-assigned this Sep 16, 2019
@jkotas
Copy link
Member

jkotas commented Sep 17, 2019

This change needs READYTORUN_MAJOR_VERSION bump since it changes the contract of precompiled code. The existing runtimes are not able to run the new code anymore.

@davmason
Copy link
Author

@jkotas I don't need to bump MINIMUM_READYTORUN_MAJOR_VERSION though, right? Since the runtime can still use the images with nop padding in them, it's just old runtimes can't use the new ones.

@jkotas
Copy link
Member

jkotas commented Sep 17, 2019

I don't need to bump MINIMUM_READYTORUN_MAJOR_VERSION though, right?

Right.

@davmason
Copy link
Author

@jkotas @noahfalk I was running in to failures where we would have an R2R image with methods that had precodes baked in but not native code slots. When tests disabled tiered compilation and rejit (not a lot, but there are a few) initial jitting would fail for these methods because we expect if there is a precode we also have a native code slot. My latest commit e20fac2 changes it so if we generate a precode in a saved MethodDesc we also generate a saved native code slot.

I haven't been able to convince myself that this is always the right thing to do, but it does make the tests pass. I'd appreciate any input on this.

@kouvel
Copy link

kouvel commented Sep 18, 2019

failures where we would have an R2R image with methods that had precodes baked in but not native code slots

I wasn't aware that some MethodDescs are baked into R2R images, I thought that was only done for NGen and was avoided in R2R to have better versionability, though it would be reasonable for CoreLib methods. What type of methods are these?

if we generate a precode in a saved MethodDesc we also generate a saved native code slot

I don't see an issue with adding a native code slot other than some size increase. I guess it would depend on what information is available at the time, if it's possible to check the preconditions that would satisfy versionability excluding run-time components then it could be limited to only those cases. My understanding is that there are many types of methods that get precodes and are not versionable (stubs for instance), so not sure about the extent of the issue.

@jkotas
Copy link
Member

jkotas commented Sep 18, 2019

I wasn't aware that some MethodDescs are baked into R2R images, I thought that was only done for NGen and was avoided in R2R to have better versionability,

Correct. MethodDescs are never baked into R2R images, not even for CoreLib.

davmason and others added 3 commits September 18, 2019 16:48
Co-Authored-By: Koundinya Veluri <[email protected]>
This reverts commit a60bb59.
Co-Authored-By: Koundinya Veluri <[email protected]>
Copy link

@kouvel kouvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so far

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far, given the draft status I wasn't sure if any significant change was still to come.


#ifdef FEATURE_CODE_VERSIONING
if (IsVersionableWithoutJumpStamp())
if (IsVersionable())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsVersionableWithoutJumpStamp() was previously a single bit check and now IsVersionable() aggregates a variety of different checks. I think @kouvel had the bit cached because the checks were being called frequently enough that the expense of the uncached check became noticeable. When you profile the new code you and @kouvel may want to keep an eye if the expense of IsVersionable() merits any change in the caching strategy.

Copy link

@kouvel kouvel Sep 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for calls to IsVersionableWith*() functions. Now for methods that are not eligible for tiered compilation it would also evaluate IsEligibleForReJIT(). I think ReJitManager::IsReJITEnabled() could be made faster by caching the configured value for CLRConfig::EXTERNAL_ProfAPI_RejitOnAttach in EEConfig as done for other config vars. That would turn IsEligibleForReJIT() into just a few checks on memory up to the point where IsEligibleForReJIT() returns false, which hopefully won't be noticeable.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be possible to turn IsReJITEnabled() into a single check. IsEligibleForTieredCompilation() is called directly in many places, so it may still be beneficial to continue having the cached bit represent that instead of IsVersionable().

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually looks like EXTERNAL_ProfAPI_RejitOnAttach is already cached in a static local variable, so it may be ok already

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of the checks in ReJitManager::IsReJITEnabled are cached in local statics, so I don't believe there should be a perf concern here. Doing the perf testing will prove that right or wrong.

@davmason davmason marked this pull request as ready for review September 19, 2019 19:28
@davmason
Copy link
Author

I just removed the draft status, I was originally intending to do some more work and hopefully remove the runtime suspension during a ReJIT, but I think this change is big enough and complicated enough on its own. Filed #26790 to track the additional work.

// rejitted code, or to reinstate original code on a revert).
else if ((g_pConfig->GenOptimizeType() != OPT_SIZE) ||
pMD->IsVersionableWithJumpStamp())
else if ((g_pConfig->GenOptimizeType() != OPT_SIZE))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will save 4 bytes of code per method in crossgen images.
And never emitting the variable sized nops in the method prolog will likely save another one or two bytes per method.
It would be interesting to get the codesize of the larger crossgen images before and after your change,

(pMDMethod->GetMethodType() == METHOD_TYPE_NORMAL || pMDMethod->GetMethodType() == METHOD_TYPE_INSTANTIATED))

#ifdef FEATURE_REJIT
||

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of all these compound && and || clauses.

My preference is to write code that checks one or two things then returns true or false.
And repeat.

So:

if (!g_pConfig->TieredCompilation())
{
    return false;
}
 // Policy - If QuickJit is disabled and the module does not have any pregenerated code, the method would       
// be ineligible for tiering currently to avoid some unnecessary overhead
if (!g_pConfig->TieredCompilation_QuickJit() && !GetModule()->HasNativeOrReadyToRunImage())
{
    rerturn false;
}

@davmason
Copy link
Author

davmason commented Sep 27, 2019

I ran startup perf tests and see no regressions (details below). At this point I have responded to all feedback and this is in a state where I feel confident checking it in. If there are any outstanding issues, please let me know so I can address them.

cc @brianrob @billwert

Here is the size on disk change (if you want to see other assemblies, let me know)

System.Private.Corelib size
  Before:     9,296 KB on disk
  After :     9,262 KB on disk

Savings: 34 KB

All perf tests were run warm by running for 2 iterations and discarding the results before collecting the data below. The baseline here is master before my changes.

The high level summary is I ran startup tests for WPF, ASP.Net and Powershell and all results were within the noise - no gains, but no losses either.

Here is the timing for the WPF app Bill provided:

    Before:   
              601 msec
              582 msec
              620 msec
              587 msec
              575 msec
              Average: 593 msec
    After :   
              611 msec
              594 msec
              582 msec
              588 msec
              583 msec
              Average: 592 msec

I tested powershell using instructions from @kouvel, grabbing the latest powershell build and replacing the coreclr binaries with those built from master. The script is as follows (it measures the average time to start up powershell and exit):

$iterationCount = 32
$totalDuration = 0.0

1..$iterationCount | % {
  $duration = (Measure-Command { .\pwsh\pwsh.exe -noprofile -c exit }).TotalMilliseconds
  $duration.ToString("0.000")
  $totalDuration += $duration
}

"Average: " + ($totalDuration / $iterationCount).ToString("0.000")

Here is the timings:

Baseline powershell:
  309.215
  293.471
  295.329
  298.485
  290.840
  307.471
  302.760
  309.857
  302.411
  314.054
  302.502
  295.069
  332.554
  305.581
  291.474
  304.776
  299.794
  317.381
  293.785
  295.340
  302.556
  292.813
  305.382
  293.320
  292.424
  301.300
  293.092
  315.784
  292.458
  297.204
  292.774
  293.598
  Average: 301.089

With my changes:
  300.420
  294.193
  297.597
  290.947
  301.054
  313.396
  292.750
  299.602
  292.826
  305.875
  295.943
  293.836
  293.632
  292.098
  301.508
  293.552
  297.416
  296.621
  314.508
  295.243
  297.459
  325.361
  307.380
  303.980
  302.602
  290.661
  325.002
  294.576
  295.850
  292.906
  316.386
  294.502
  Average: 300.303

To test ASP.NET startup I used MusicStore, although none of the automated ways to run it appear to be working so I had to manually build it.


MusicStore baseline:
  D:\jumpstamp_binaries\musicstore_bin>CoreRun.exe MusicStore.dll
  Server started in 704ms

  Starting request to http://localhost:5000
  Response: OK
  Request took 442ms

  Cold start time (server start + first request time): 1146ms


  Batch 0: running 1001 requests
  Steadystate min response time: 2.00ms
  Steadystate max response time: 237.00ms
  Steadystate average response time: 3.45ms
  Steadystate median response time: 3.04ms
  Batch 1: running 1001 requests
  Steadystate min response time: 2.00ms
  Steadystate max response time: 5.00ms
  Steadystate average response time: 2.89ms
  Steadystate median response time: 2.78ms
  Batch 2: running 1001 requests
  Steadystate min response time: 2.00ms
  Steadystate max response time: 4.00ms
  Steadystate average response time: 2.87ms
  Steadystate median response time: 2.78ms
  Batch 3: running 1001 requests
  Steadystate min response time: 2.00ms
  Steadystate max response time: 144.00ms
  Steadystate average response time: 3.06ms
  Steadystate median response time: 2.83ms
  Batch 4: running 1001 requests
  Steadystate min response time: 2.00ms
  Steadystate max response time: 4.00ms
  Steadystate average response time: 2.86ms
  Steadystate median response time: 2.77ms

MusicStore with Changes:
  D:\jumpstamp_binaries\musicstore_bin>CoreRun.exe MusicStore.dll
  Server started in 707ms

  Starting request to http://localhost:5000
  Response: OK
  Request took 449ms

  Cold start time (server start + first request time): 1156ms


  Batch 0: running 1001 requests
  Steadystate min response time: 2.00ms
  Steadystate max response time: 37.00ms
  Steadystate average response time: 3.34ms
  Steadystate median response time: 3.09ms
  Batch 1: running 1001 requests
  Steadystate min response time: 2.00ms
  Steadystate max response time: 221.00ms
  Steadystate average response time: 3.11ms
  Steadystate median response time: 2.81ms
  Batch 2: running 1001 requests
  Steadystate min response time: 2.00ms
  Steadystate max response time: 4.00ms
  Steadystate average response time: 2.85ms
  Steadystate median response time: 2.74ms
  Batch 3: running 1001 requests
  Steadystate min response time: 2.00ms
  Steadystate max response time: 5.00ms
  Steadystate average response time: 2.83ms
  Steadystate median response time: 2.73ms
  Batch 4: running 1001 requests
  Steadystate min response time: 2.00ms
  Steadystate max response time: 163.00ms
  Steadystate average response time: 3.01ms
  Steadystate median response time: 2.77ms

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (always satisfying to see all the red)

@davmason davmason merged commit 7fd7985 into dotnet:master Sep 30, 2019
@davmason davmason deleted the jumpstamp_merge branch September 30, 2019 22:05
Copy link

@briansull briansull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants