Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Conversation

@AndyAyersMS
Copy link
Member

Stop double-counting some locals when computing how many slots need to be
zeroed in the prolog.

Bump threshold for rep stosb style init from 16 bytes to 32 bytes on x64.

Stop double-counting some locals when computing how many slots need to be
zeroed in the prolog.

Bump threshold for rep stosb style init from 16 bytes to 32 bytes on x64.
@AndyAyersMS
Copy link
Member Author

Follow-up to comments at tail end of #23403.

Upshot is that we will now use multiple stores to zero up to 32 bytes on a 64 bit platform instead of a loop or rep stos.

Still trying to get a fair evaluation of the perf impact. On a simple test like:

    [MethodImpl(MethodImplOptions.NoInlining)]
    private void Dummy(Span<byte> a, Span<byte> b) { }

    [MethodImpl(MethodImplOptions.NoInlining)]
    public void StackZero() { Span<byte> a, b; a = bytes; b = bytes; Dummy(a, b); }

it is substantial, about a 2x improvement. Wider tests (using the performance repo) show a lot of potential good and bad results (more good than bad) and I am working through some of the cases and will summarize when I have something interesting to say.

Code size is generally improved:

Total bytes of diff: -20050 (-0.05% of base)
    diff is an improvement.

Top file regressions by size (bytes):
         122 : CommandLine.dasm (0.03% of base)
          45 : NuGet.Packaging.dasm (0.03% of base)
          19 : System.ComponentModel.Annotations.dasm (0.04% of base)
          11 : Microsoft.DotNet.InternalAbstractions.dasm (0.22% of base)
           6 : NuGet.Packaging.Core.Types.dasm (0.04% of base)

Top file improvements by size (bytes):
       -3849 : System.Private.CoreLib.dasm (-0.09% of base)
       -3071 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.05% of base)
       -2137 : Microsoft.CodeAnalysis.CSharp.dasm (-0.05% of base)
       -1266 : Microsoft.CodeAnalysis.dasm (-0.07% of base)
        -745 : System.Collections.Immutable.dasm (-0.07% of base)

98 total files with size differences (84 improved, 14 regressed), 31 unchanged.

Top method regressions by size (bytes):
          66 ( 2.10% of base) : System.Linq.Parallel.dasm - NullableDecimalAverageAggregationOperatorEnumerator`1:MoveNextCore(byref):bool:this (7 methods)
          66 ( 2.18% of base) : System.Linq.Parallel.dasm - NullableDecimalSumAggregationOperatorEnumerator`1:MoveNextCore(byref):bool:this (7 methods)
          14 ( 0.30% of base) : CommandLine.dasm - Result`2:ToString():ref:this (7 methods)
          14 ( 0.41% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - MethodToClassRewriter`1:VisitMethodSymbol(ref):ref:this (7 methods)
          14 ( 0.23% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - SynthesizedWrapperMethod:.ctor(ref,ref,ref,ref):this (7 methods)

Top method improvements by size (bytes):
         -96 (-1.01% of base) : System.Private.CoreLib.dasm - MemoryExtensions:TrimEnd(struct,struct):struct (29 methods)
         -91 (-9.35% of base) : Microsoft.DotNet.Cli.Utils.dasm - Result`1:AsValue(ubyte):struct:this (7 methods)
         -91 (-9.29% of base) : Microsoft.DotNet.Cli.Utils.dasm - Result`1:AsValue(short):struct:this (7 methods)
         -91 (-9.63% of base) : Microsoft.DotNet.Cli.Utils.dasm - Result`1:AsValue(int):struct:this (7 methods)
         -91 (-9.56% of base) : Microsoft.DotNet.Cli.Utils.dasm - Result`1:AsValue(double):struct:this (7 methods)

Top method regressions by size (percentage):
          11 ( 3.90% of base) : System.Collections.dasm - TreeSubSet:get_MinInternal():struct:this
          11 ( 3.90% of base) : System.Collections.dasm - TreeSubSet:get_MaxInternal():struct:this
           6 ( 3.49% of base) : Microsoft.DotNet.Cli.Utils.dasm - DebugHelper:WaitForDebugger()
           5 ( 3.11% of base) : Microsoft.DotNet.InternalAbstractions.dasm - RuntimeEnvironment:GetRIDArch():ref
          11 ( 2.65% of base) : System.Net.Security.dasm - SslStreamPal:AcquireCredentialsHandle(ref,int,int,bool):ref

Top method improvements by size (percentage):
         -17 (-24.29% of base) : System.Security.Cryptography.Algorithms.dasm - AlgorithmIdentifierAsn:HasNullEquivalentParameters():bool:this
         -17 (-24.29% of base) : System.Security.Cryptography.Cng.dasm - AlgorithmIdentifierAsn:HasNullEquivalentParameters():bool:this
         -17 (-23.61% of base) : System.Private.CoreLib.dasm - TimeSpanParse:TryParseTimeSpanConstant(struct,byref):bool
         -12 (-17.65% of base) : System.Data.Common.dasm - SqlString:ToSqlBoolean():struct:this
         -12 (-17.65% of base) : System.Data.Common.dasm - SqlString:ToSqlByte():struct:this

4516 total methods with size differences (3603 improved, 913 regressed), 195607 unchanged.

Regressions are from cases where we have large offsets that require multibyte encoding. Improvements are from not needing to preserve as many registers in some cases.

Not considering xmm init just yet as that is a more involved change and post 3.0 we might want to unify or at least share strategy for prolog init with initBlk.

Also not (yet) considering trying to order the stores by increasing offset or be smarter about non initlocals cases where we have structs with non-gc fields.

cc @dotnet/jit-contrib

/* If we have more than 4 untracked locals, use block initialization */
/* TODO-Review: If we have large structs, bias toward not using block initialization since
we waste all the other slots. Really need to compute the correct
and compare that against zeroing the slots individually */
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not quite understand this comment... Could you please improve it with more details if it is still useful?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I don't understand why slots need to be zeroed individually. This should just be linear stack space and it is all being zeroed, so you should be able to just zero-it however is most efficient for that size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I completely understand it either -- I think the comment is saying we should do a more thorough examination of GC structs and bias towards field by field zeroing when there are large GC structs with relatively few GC fields (at least, when we're not also zeroing the non-gc fields).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no one really understands it fully (I don't), then perhaps we should reword the comment to simply say what you just said.

Copy link

@briansull briansull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good

if (!counted)
{
initStkLclCnt += varDsc->lvStructGcCount;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to set counted to true here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really necessary, since the compiler will eliminate it as unused in the remainder of the loop body, but it might be prudent to go ahead and set it, with a comment, in the event that further changes are made at some point (that would cause us to want to query counted downstream).

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


for (varNum = 0, varDsc = compiler->lvaTable; varNum < compiler->lvaCount; varNum++, varDsc++)
{
bool counted = false;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have a comment. It's not immediately clear that there are multiple ways that we might count a local as requiring initialization, so we don't want to count it twice.

/* If we have more than 4 untracked locals, use block initialization */
/* TODO-Review: If we have large structs, bias toward not using block initialization since
we waste all the other slots. Really need to compute the correct
and compare that against zeroing the slots individually */

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no one really understands it fully (I don't), then perhaps we should reword the comment to simply say what you just said.

if (!counted)
{
initStkLclCnt += varDsc->lvStructGcCount;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really necessary, since the compiler will eliminate it as unused in the remainder of the loop body, but it might be prudent to go ahead and set it, with a comment, in the event that further changes are made at some point (that would cause us to want to query counted downstream).

@AndyAyersMS
Copy link
Member Author

Most significant improvements (from performance repo CoreCLR tests). Filtered out a few by hand that did not seem credible.

Faster base/diff Base Median (ns) Diff Median (ns)
System.Memory.MemoryMarshal<Int32>.TryGetArray 2.78 9.63 3.47
System.Memory.MemoryMarshal<Byte>.TryGetArray 2.21 9.81 4.44
System.Tests.Perf_String.Insert(s1: "Test", i: 2, s2: " Test") 1.92 23.70 12.36
System.Collections.IndexerSet.Array(Size: 512) 1.64 278.27 169.46
System.Tests.Perf_String.Insert(s1: "dzsdzsDDZSDZSDZSddsz", i: 7, s2: "Test") 1.63 22.64 13.90
System.Collections.CopyTo<Int32>.ReadOnlyMemory(Size: 2048) 1.58 157.75 99.91

Still sorting through potential regressions...

//
// Secondary factor is the presence of large structs that
// potentially only need some fields set to zero. We likely don't
// model this very well, but have left the logic as is for now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great - much more meaningful and useful!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

//
// Primary factor is the number of slots that need zeroing. We've
// been counting by sizeof(int) above. We assume for now we can
// only zero register width bytes per store.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this integer register width bytes per store (i.e. 32-bit or 64-bit) or does it also take into account other registers (such as xmm which is 128-bit or ymm which is 256-bit)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently just integer registers.

xmm would likely offer further improvement, especially if there's no need for a cleanup store. But that would be a somewhat more involved change, and we're trying hard to close down jit work for 3.0. And if we get to that point, we should also think more about the strategy here and if or how we could share logic or code with other block inits we do.

For ymm, potential is less clear; the stack is only guaranteed to be 16 byte aligned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth logging a tracking issue for future investigation.

I was mostly just wondering since this just says "register width" and the clarification is useful 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we can use #19076 to track further improvements.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also perhaps #13827.

@AndyAyersMS
Copy link
Member Author

Newly added JIT_HardwareIntrinsics._X86_Regression_GitHub_23438 is failing for me in CI; investigating.

@AndyAyersMS
Copy link
Member Author

Finished looking at a bunch of potential regressions. None of them seem to hold up under scrutiny -- many look the same or were improvements on rerun, and the remaining few that still looked worse did not have codegen diffs (one seemed to be a 32 bit alignment issue, another had GC kicking in). Will list them here in the interest of full disclosure.

Slower diff/base Base Median (ns) Diff Median (ns)
Devirtualization.EqualityComparer.ValueTupleCompareWrapped 3.01 4.76 14.33
System.Memory.Constructors.ReadOnlySpanFromArray 2.51 0.23 0.57
System.Collections.Sort.Array(Size: 512) 1.84 3742.65 6902.07
System.Collections.IndexerSet.Span(Size: 512) 1.58 168.02 265.51
System.Collections.CopyTo.Span(Size: 2048) 1.55 97.84 151.85
System.Collections.CopyTo.ReadOnlySpan(Size: 2048) 1.53 99.31 152.07
System.Tests.Perf_String.IndexerCheckLengthHoisting 1.44 25.86 37.18
PerfLabTests.LowLevelPerf.GenericGenericMethod 1.40 127311.62 178717.48
System.Collections.IndexerSetReverse.Span(Size: 512) 1.31 203.07 265.72
System.MathBenchmarks.Double.Atan2 1.23 89569.64 109864.78

@AndyAyersMS
Copy link
Member Author

Codegen is bad in the new test, with or without my change:

; Assembly listing for method GitHub_23438.Program:TestSse2X64StoreNonTemporal_Int64():bool
...
       4C0FC3042500000000   vmovnti  qword ptr [0000H], r8

@tannergooding any ideas? This came in with #23461; maybe some concurrent change regressed it?

@tannergooding
Copy link
Member

I hadn't inspected the quality of the codegen for the new tests as they are just validating that they are properly covering 64-bit inputs.

The code is just doing the following, so I'm unsure why it would be generating a full addressing instruction...

private static unsafe bool TestSse2X64StoreNonTemporal_Int64()
{
    long result;
    Sse2.X64.StoreNonTemporal(&result, long.MaxValue);
    return AreEqual(long.MaxValue, result);
}

The only thing I can think of, however, would be the addressing mode changes @CarolEidt recently made maybe?

@AndyAyersMS
Copy link
Member Author

Looks like we're just dropping the address in codegen:

Generating: N004 (  3,  2) [000001] -c-----N----         t1 =    LCL_VAR_ADDR byref  V00 loc0          NA REG NA
Generating: N006 (  3, 10) [000003] ------------         t3 =    CNS_INT   long   0x7fffffffffffffff REG r8 $180
IN0001:        mov      r8, 0x7FFFFFFFFFFFFFFF
                                                             /--*  t1     byref  
                                                             +--*  t3     long   
Generating: N008 (  7, 14) [000004] ---XG-------              *  HWIntrinsic void   long StoreNonTemporal REG NA $1c0
IN0002:        vmovnti  qword ptr [0000H], r8

@CarolEidt
Copy link

Hmmm ... wouldn't be at all surprising if those two commits crossed paths. I'll take a look.

@AndyAyersMS
Copy link
Member Author

This is "special codegen" intrinsic and it looks like we expect the address to be in a register, but never put it there.

@tannergooding
Copy link
Member

@CarolEidt, it is still calling emitIns_AR_R directly: https://github.com/dotnet/coreclr/pull/22944/files#diff-847442c7efdf7a78e78711bea30fd4b2R1674

I think you changed this to emitInsStoreInd elsewhere in the PR. We should probably double check all of the special-codegen cases.

@tannergooding
Copy link
Member

tannergooding commented Mar 28, 2019

@AndyAyersMS
Copy link
Member Author

I believe the various test failures are unrelated.

Have also tried running some ASP.Net scenarios and results seem to be a wash.

@AndyAyersMS
Copy link
Member Author

Attempt to rerun some failed legs in coreclr-ci hit this error:

##[error]Artifact testbuild_Linux_x64_checked_innerloop_Logs already exists for build 135651.

Guess I will retrigger everthing.

@AndyAyersMS AndyAyersMS reopened this Mar 28, 2019
@AndyAyersMS
Copy link
Member Author

Going to close/reopen once more.

@AndyAyersMS AndyAyersMS closed this Apr 2, 2019
@AndyAyersMS AndyAyersMS reopened this Apr 2, 2019
@AndyAyersMS AndyAyersMS merged commit 900ae5b into dotnet:master Apr 2, 2019
@AndyAyersMS AndyAyersMS deleted the FrameZeroAccounting branch April 2, 2019 16:06
@AndyAyersMS
Copy link
Member Author

Finally got a clean CI run....

picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Stop double-counting some locals when computing how many slots need to be
zeroed in the prolog.

Bump threshold for rep stosb style init from 16 bytes to 32 bytes on x64.


Commit migrated from dotnet/coreclr@900ae5b
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants