JIT: prolog zeroing accounting changes #23498

AndyAyersMS · 2019-03-27T20:13:06Z

Stop double-counting some locals when computing how many slots need to be
zeroed in the prolog.

Bump threshold for rep stosb style init from 16 bytes to 32 bytes on x64.

Stop double-counting some locals when computing how many slots need to be zeroed in the prolog. Bump threshold for rep stosb style init from 16 bytes to 32 bytes on x64.

AndyAyersMS · 2019-03-27T20:27:08Z

Follow-up to comments at tail end of #23403.

Upshot is that we will now use multiple stores to zero up to 32 bytes on a 64 bit platform instead of a loop or rep stos.

Still trying to get a fair evaluation of the perf impact. On a simple test like:

    [MethodImpl(MethodImplOptions.NoInlining)]
    private void Dummy(Span<byte> a, Span<byte> b) { }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public void StackZero() { Span<byte> a, b; a = bytes; b = bytes; Dummy(a, b); }

it is substantial, about a 2x improvement. Wider tests (using the performance repo) show a lot of potential good and bad results (more good than bad) and I am working through some of the cases and will summarize when I have something interesting to say.

Code size is generally improved:

Total bytes of diff: -20050 (-0.05% of base)
    diff is an improvement.

Top file regressions by size (bytes):
         122 : CommandLine.dasm (0.03% of base)
          45 : NuGet.Packaging.dasm (0.03% of base)
          19 : System.ComponentModel.Annotations.dasm (0.04% of base)
          11 : Microsoft.DotNet.InternalAbstractions.dasm (0.22% of base)
           6 : NuGet.Packaging.Core.Types.dasm (0.04% of base)

Top file improvements by size (bytes):
       -3849 : System.Private.CoreLib.dasm (-0.09% of base)
       -3071 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.05% of base)
       -2137 : Microsoft.CodeAnalysis.CSharp.dasm (-0.05% of base)
       -1266 : Microsoft.CodeAnalysis.dasm (-0.07% of base)
        -745 : System.Collections.Immutable.dasm (-0.07% of base)

98 total files with size differences (84 improved, 14 regressed), 31 unchanged.

Top method regressions by size (bytes):
          66 ( 2.10% of base) : System.Linq.Parallel.dasm - NullableDecimalAverageAggregationOperatorEnumerator`1:MoveNextCore(byref):bool:this (7 methods)
          66 ( 2.18% of base) : System.Linq.Parallel.dasm - NullableDecimalSumAggregationOperatorEnumerator`1:MoveNextCore(byref):bool:this (7 methods)
          14 ( 0.30% of base) : CommandLine.dasm - Result`2:ToString():ref:this (7 methods)
          14 ( 0.41% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - MethodToClassRewriter`1:VisitMethodSymbol(ref):ref:this (7 methods)
          14 ( 0.23% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SynthesizedWrapperMethod:.ctor(ref,ref,ref,ref):this (7 methods)

Top method improvements by size (bytes):
         -96 (-1.01% of base) : System.Private.CoreLib.dasm - MemoryExtensions:TrimEnd(struct,struct):struct (29 methods)
         -91 (-9.35% of base) : Microsoft.DotNet.Cli.Utils.dasm - Result`1:AsValue(ubyte):struct:this (7 methods)
         -91 (-9.29% of base) : Microsoft.DotNet.Cli.Utils.dasm - Result`1:AsValue(short):struct:this (7 methods)
         -91 (-9.63% of base) : Microsoft.DotNet.Cli.Utils.dasm - Result`1:AsValue(int):struct:this (7 methods)
         -91 (-9.56% of base) : Microsoft.DotNet.Cli.Utils.dasm - Result`1:AsValue(double):struct:this (7 methods)

Top method regressions by size (percentage):
          11 ( 3.90% of base) : System.Collections.dasm - TreeSubSet:get_MinInternal():struct:this
          11 ( 3.90% of base) : System.Collections.dasm - TreeSubSet:get_MaxInternal():struct:this
           6 ( 3.49% of base) : Microsoft.DotNet.Cli.Utils.dasm - DebugHelper:WaitForDebugger()
           5 ( 3.11% of base) : Microsoft.DotNet.InternalAbstractions.dasm - RuntimeEnvironment:GetRIDArch():ref
          11 ( 2.65% of base) : System.Net.Security.dasm - SslStreamPal:AcquireCredentialsHandle(ref,int,int,bool):ref

Top method improvements by size (percentage):
         -17 (-24.29% of base) : System.Security.Cryptography.Algorithms.dasm - AlgorithmIdentifierAsn:HasNullEquivalentParameters():bool:this
         -17 (-24.29% of base) : System.Security.Cryptography.Cng.dasm - AlgorithmIdentifierAsn:HasNullEquivalentParameters():bool:this
         -17 (-23.61% of base) : System.Private.CoreLib.dasm - TimeSpanParse:TryParseTimeSpanConstant(struct,byref):bool
         -12 (-17.65% of base) : System.Data.Common.dasm - SqlString:ToSqlBoolean():struct:this
         -12 (-17.65% of base) : System.Data.Common.dasm - SqlString:ToSqlByte():struct:this

4516 total methods with size differences (3603 improved, 913 regressed), 195607 unchanged.

Regressions are from cases where we have large offsets that require multibyte encoding. Improvements are from not needing to preserve as many registers in some cases.

Not considering xmm init just yet as that is a more involved change and post 3.0 we might want to unify or at least share strategy for prolog init with initBlk.

Also not (yet) considering trying to order the stores by increasing offset or be smarter about non initlocals cases where we have structs with non-gc fields.

cc @dotnet/jit-contrib

fiigii · 2019-03-27T20:39:58Z

src/jit/codegencommon.cpp

+/* If we have more than 4 untracked locals, use block initialization */
+/* TODO-Review: If we have large structs, bias toward not using block initialization since
+   we waste all the other slots.  Really need to compute the correct
+   and compare that against zeroing the slots individually */


I do not quite understand this comment... Could you please improve it with more details if it is still useful?

I agree. I don't understand why slots need to be zeroed individually. This should just be linear stack space and it is all being zeroed, so you should be able to just zero-it however is most efficient for that size.

Not sure I completely understand it either -- I think the comment is saying we should do a more thorough examination of GC structs and bias towards field by field zeroing when there are large GC structs with relatively few GC fields (at least, when we're not also zeroing the non-gc fields).

If no one really understands it fully (I don't), then perhaps we should reword the comment to simply say what you just said.

briansull

Looks Good

briansull · 2019-03-27T20:40:17Z

src/jit/codegencommon.cpp

+            if (!counted)
+            {
+                initStkLclCnt += varDsc->lvStructGcCount;
+            }


You might want to set counted to true here.

Not really necessary, since the compiler will eliminate it as unused in the remainder of the loop body, but it might be prudent to go ahead and set it, with a comment, in the event that further changes are made at some point (that would cause us to want to query counted downstream).

CarolEidt

LGTM

CarolEidt · 2019-03-27T20:56:19Z

src/jit/codegencommon.cpp


    for (varNum = 0, varDsc = compiler->lvaTable; varNum < compiler->lvaCount; varNum++, varDsc++)
    {
+        bool counted = false;


This should have a comment. It's not immediately clear that there are multiple ways that we might count a local as requiring initialization, so we don't want to count it twice.

CarolEidt · 2019-03-27T20:58:38Z

src/jit/codegencommon.cpp

+/* If we have more than 4 untracked locals, use block initialization */
+/* TODO-Review: If we have large structs, bias toward not using block initialization since
+   we waste all the other slots.  Really need to compute the correct
+   and compare that against zeroing the slots individually */


If no one really understands it fully (I don't), then perhaps we should reword the comment to simply say what you just said.

CarolEidt · 2019-03-27T21:00:42Z

src/jit/codegencommon.cpp

+            if (!counted)
+            {
+                initStkLclCnt += varDsc->lvStructGcCount;
+            }


Not really necessary, since the compiler will eliminate it as unused in the remainder of the loop body, but it might be prudent to go ahead and set it, with a comment, in the event that further changes are made at some point (that would cause us to want to query counted downstream).

AndyAyersMS · 2019-03-27T22:26:48Z

Most significant improvements (from performance repo CoreCLR tests). Filtered out a few by hand that did not seem credible.

Faster	base/diff	Base Median (ns)	Diff Median (ns)
System.Memory.MemoryMarshal<Int32>.TryGetArray	2.78	9.63	3.47
System.Memory.MemoryMarshal<Byte>.TryGetArray	2.21	9.81	4.44
System.Tests.Perf_String.Insert(s1: "Test", i: 2, s2: " Test")	1.92	23.70	12.36
System.Collections.IndexerSet.Array(Size: 512)	1.64	278.27	169.46
System.Tests.Perf_String.Insert(s1: "dzsdzsDDZSDZSDZSddsz", i: 7, s2: "Test")	1.63	22.64	13.90
System.Collections.CopyTo<Int32>.ReadOnlyMemory(Size: 2048)	1.58	157.75	99.91

Still sorting through potential regressions...

CarolEidt · 2019-03-27T22:43:30Z

src/jit/codegencommon.cpp

+    //
+    // Secondary factor is the presence of large structs that
+    // potentially only need some fields set to zero. We likely don't
+    // model this very well, but have left the logic as is for now.


This is great - much more meaningful and useful!

tannergooding · 2019-03-27T23:01:58Z

src/jit/codegencommon.cpp

+    //
+    // Primary factor is the number of slots that need zeroing. We've
+    // been counting by sizeof(int) above. We assume for now we can
+    // only zero register width bytes per store.


Is this integer register width bytes per store (i.e. 32-bit or 64-bit) or does it also take into account other registers (such as xmm which is 128-bit or ymm which is 256-bit)?

Currently just integer registers.

xmm would likely offer further improvement, especially if there's no need for a cleanup store. But that would be a somewhat more involved change, and we're trying hard to close down jit work for 3.0. And if we get to that point, we should also think more about the strategy here and if or how we could share logic or code with other block inits we do.

For ymm, potential is less clear; the stack is only guaranteed to be 16 byte aligned.

Might be worth logging a tracking issue for future investigation.

I was mostly just wondering since this just says "register width" and the clarification is useful 😄

Think we can use #19076 to track further improvements.

Also perhaps #13827.

AndyAyersMS · 2019-03-28T00:01:48Z

Newly added JIT_HardwareIntrinsics._X86_Regression_GitHub_23438 is failing for me in CI; investigating.

AndyAyersMS · 2019-03-28T00:05:59Z

Finished looking at a bunch of potential regressions. None of them seem to hold up under scrutiny -- many look the same or were improvements on rerun, and the remaining few that still looked worse did not have codegen diffs (one seemed to be a 32 bit alignment issue, another had GC kicking in). Will list them here in the interest of full disclosure.

Slower	diff/base	Base Median (ns)	Diff Median (ns)
Devirtualization.EqualityComparer.ValueTupleCompareWrapped	3.01	4.76	14.33
System.Memory.Constructors.ReadOnlySpanFromArray	2.51	0.23	0.57
System.Collections.Sort.Array(Size: 512)	1.84	3742.65	6902.07
System.Collections.IndexerSet.Span(Size: 512)	1.58	168.02	265.51
System.Collections.CopyTo.Span(Size: 2048)	1.55	97.84	151.85
System.Collections.CopyTo.ReadOnlySpan(Size: 2048)	1.53	99.31	152.07
System.Tests.Perf_String.IndexerCheckLengthHoisting	1.44	25.86	37.18
PerfLabTests.LowLevelPerf.GenericGenericMethod	1.40	127311.62	178717.48
System.Collections.IndexerSetReverse.Span(Size: 512)	1.31	203.07	265.72
System.MathBenchmarks.Double.Atan2	1.23	89569.64	109864.78

AndyAyersMS · 2019-03-28T00:35:44Z

Codegen is bad in the new test, with or without my change:

; Assembly listing for method GitHub_23438.Program:TestSse2X64StoreNonTemporal_Int64():bool
...
       4C0FC3042500000000   vmovnti  qword ptr [0000H], r8

@tannergooding any ideas? This came in with #23461; maybe some concurrent change regressed it?

tannergooding · 2019-03-28T00:46:11Z

I hadn't inspected the quality of the codegen for the new tests as they are just validating that they are properly covering 64-bit inputs.

The code is just doing the following, so I'm unsure why it would be generating a full addressing instruction...

private static unsafe bool TestSse2X64StoreNonTemporal_Int64()
{
    long result;
    Sse2.X64.StoreNonTemporal(&result, long.MaxValue);
    return AreEqual(long.MaxValue, result);
}

The only thing I can think of, however, would be the addressing mode changes @CarolEidt recently made maybe?

AndyAyersMS · 2019-03-28T00:49:58Z

Looks like we're just dropping the address in codegen:

Generating: N004 (  3,  2) [000001] -c-----N----         t1 =    LCL_VAR_ADDR byref  V00 loc0          NA REG NA
Generating: N006 (  3, 10) [000003] ------------         t3 =    CNS_INT   long   0x7fffffffffffffff REG r8 $180
IN0001:        mov      r8, 0x7FFFFFFFFFFFFFFF
                                                             /--*  t1     byref  
                                                             +--*  t3     long   
Generating: N008 (  7, 14) [000004] ---XG-------              *  HWIntrinsic void   long StoreNonTemporal REG NA $1c0
IN0002:        vmovnti  qword ptr [0000H], r8

CarolEidt · 2019-03-28T00:54:02Z

Hmmm ... wouldn't be at all surprising if those two commits crossed paths. I'll take a look.

AndyAyersMS · 2019-03-28T00:55:37Z

This is "special codegen" intrinsic and it looks like we expect the address to be in a register, but never put it there.

tannergooding · 2019-03-28T00:56:54Z

@CarolEidt, it is still calling emitIns_AR_R directly: https://github.com/dotnet/coreclr/pull/22944/files#diff-847442c7efdf7a78e78711bea30fd4b2R1674

I think you changed this to emitInsStoreInd elsewhere in the PR. We should probably double check all of the special-codegen cases.

tannergooding · 2019-03-28T01:02:34Z

Ignoring the general-purpose handling (such as that in genHWIntrinsic_R_RM, where we check the op type to decide if we are calling the _A, _AR, _C, or _S variant), I think I see the following places where we are making assumptions about the addressing type of the operand still (hopefully, I didn't miss any):

AndyAyersMS · 2019-03-28T19:03:44Z

I believe the various test failures are unrelated.

Have also tried running some ASP.Net scenarios and results seem to be a wash.

AndyAyersMS · 2019-03-28T22:20:46Z

Attempt to rerun some failed legs in coreclr-ci hit this error:

##[error]Artifact testbuild_Linux_x64_checked_innerloop_Logs already exists for build 135651.

Guess I will retrigger everthing.

AndyAyersMS · 2019-04-02T00:45:16Z

Going to close/reopen once more.

AndyAyersMS · 2019-04-02T16:06:38Z

Finally got a clean CI run....

Stop double-counting some locals when computing how many slots need to be zeroed in the prolog. Bump threshold for rep stosb style init from 16 bytes to 32 bytes on x64. Commit migrated from dotnet/coreclr@900ae5b

JIT: prolog zeroing accounting changes

7cb102d

Stop double-counting some locals when computing how many slots need to be zeroed in the prolog. Bump threshold for rep stosb style init from 16 bytes to 32 bytes on x64.

fiigii reviewed Mar 27, 2019

View reviewed changes

briansull approved these changes Mar 27, 2019

View reviewed changes

CarolEidt approved these changes Mar 27, 2019

View reviewed changes

review feedback

f622b09

CarolEidt reviewed Mar 27, 2019

View reviewed changes

tannergooding reviewed Mar 27, 2019

View reviewed changes

AndyAyersMS closed this Mar 28, 2019

AndyAyersMS reopened this Mar 28, 2019

AndyAyersMS closed this Apr 2, 2019

AndyAyersMS reopened this Apr 2, 2019

AndyAyersMS merged commit 900ae5b into dotnet:master Apr 2, 2019

AndyAyersMS deleted the FrameZeroAccounting branch April 2, 2019 16:06

JIT: prolog zeroing accounting changes #23498

JIT: prolog zeroing accounting changes #23498

Uh oh!

Conversation

AndyAyersMS commented Mar 27, 2019

Uh oh!

AndyAyersMS commented Mar 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

briansull left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CarolEidt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS commented Mar 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AndyAyersMS commented Mar 28, 2019

Uh oh!

AndyAyersMS commented Mar 28, 2019

Uh oh!

AndyAyersMS commented Mar 28, 2019

Uh oh!

tannergooding commented Mar 28, 2019

Uh oh!

AndyAyersMS commented Mar 28, 2019

Uh oh!

CarolEidt commented Mar 28, 2019

Uh oh!

AndyAyersMS commented Mar 28, 2019

Uh oh!

tannergooding commented Mar 28, 2019

Uh oh!

tannergooding commented Mar 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndyAyersMS commented Mar 28, 2019

Uh oh!

AndyAyersMS commented Mar 28, 2019

Uh oh!

AndyAyersMS commented Apr 2, 2019

Uh oh!

AndyAyersMS commented Apr 2, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

tannergooding commented Mar 28, 2019 •

edited

Loading