-
Notifications
You must be signed in to change notification settings - Fork 2.6k
JIT: prolog zeroing accounting changes #23498
Conversation
Stop double-counting some locals when computing how many slots need to be zeroed in the prolog. Bump threshold for rep stosb style init from 16 bytes to 32 bytes on x64.
|
Follow-up to comments at tail end of #23403. Upshot is that we will now use multiple stores to zero up to 32 bytes on a 64 bit platform instead of a loop or Still trying to get a fair evaluation of the perf impact. On a simple test like: [MethodImpl(MethodImplOptions.NoInlining)]
private void Dummy(Span<byte> a, Span<byte> b) { }
[MethodImpl(MethodImplOptions.NoInlining)]
public void StackZero() { Span<byte> a, b; a = bytes; b = bytes; Dummy(a, b); }it is substantial, about a 2x improvement. Wider tests (using the performance repo) show a lot of potential good and bad results (more good than bad) and I am working through some of the cases and will summarize when I have something interesting to say. Code size is generally improved: Regressions are from cases where we have large offsets that require multibyte encoding. Improvements are from not needing to preserve as many registers in some cases. Not considering xmm init just yet as that is a more involved change and post 3.0 we might want to unify or at least share strategy for prolog init with Also not (yet) considering trying to order the stores by increasing offset or be smarter about non initlocals cases where we have structs with non-gc fields. cc @dotnet/jit-contrib |
src/jit/codegencommon.cpp
Outdated
| /* If we have more than 4 untracked locals, use block initialization */ | ||
| /* TODO-Review: If we have large structs, bias toward not using block initialization since | ||
| we waste all the other slots. Really need to compute the correct | ||
| and compare that against zeroing the slots individually */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not quite understand this comment... Could you please improve it with more details if it is still useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I don't understand why slots need to be zeroed individually. This should just be linear stack space and it is all being zeroed, so you should be able to just zero-it however is most efficient for that size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I completely understand it either -- I think the comment is saying we should do a more thorough examination of GC structs and bias towards field by field zeroing when there are large GC structs with relatively few GC fields (at least, when we're not also zeroing the non-gc fields).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no one really understands it fully (I don't), then perhaps we should reword the comment to simply say what you just said.
briansull
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good
| if (!counted) | ||
| { | ||
| initStkLclCnt += varDsc->lvStructGcCount; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to set counted to true here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really necessary, since the compiler will eliminate it as unused in the remainder of the loop body, but it might be prudent to go ahead and set it, with a comment, in the event that further changes are made at some point (that would cause us to want to query counted downstream).
CarolEidt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
|
||
| for (varNum = 0, varDsc = compiler->lvaTable; varNum < compiler->lvaCount; varNum++, varDsc++) | ||
| { | ||
| bool counted = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have a comment. It's not immediately clear that there are multiple ways that we might count a local as requiring initialization, so we don't want to count it twice.
src/jit/codegencommon.cpp
Outdated
| /* If we have more than 4 untracked locals, use block initialization */ | ||
| /* TODO-Review: If we have large structs, bias toward not using block initialization since | ||
| we waste all the other slots. Really need to compute the correct | ||
| and compare that against zeroing the slots individually */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no one really understands it fully (I don't), then perhaps we should reword the comment to simply say what you just said.
| if (!counted) | ||
| { | ||
| initStkLclCnt += varDsc->lvStructGcCount; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really necessary, since the compiler will eliminate it as unused in the remainder of the loop body, but it might be prudent to go ahead and set it, with a comment, in the event that further changes are made at some point (that would cause us to want to query counted downstream).
|
Most significant improvements (from performance repo CoreCLR tests). Filtered out a few by hand that did not seem credible.
Still sorting through potential regressions... |
| // | ||
| // Secondary factor is the presence of large structs that | ||
| // potentially only need some fields set to zero. We likely don't | ||
| // model this very well, but have left the logic as is for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great - much more meaningful and useful!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| // | ||
| // Primary factor is the number of slots that need zeroing. We've | ||
| // been counting by sizeof(int) above. We assume for now we can | ||
| // only zero register width bytes per store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this integer register width bytes per store (i.e. 32-bit or 64-bit) or does it also take into account other registers (such as xmm which is 128-bit or ymm which is 256-bit)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently just integer registers.
xmm would likely offer further improvement, especially if there's no need for a cleanup store. But that would be a somewhat more involved change, and we're trying hard to close down jit work for 3.0. And if we get to that point, we should also think more about the strategy here and if or how we could share logic or code with other block inits we do.
For ymm, potential is less clear; the stack is only guaranteed to be 16 byte aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth logging a tracking issue for future investigation.
I was mostly just wondering since this just says "register width" and the clarification is useful 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think we can use #19076 to track further improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also perhaps #13827.
|
Newly added |
|
Finished looking at a bunch of potential regressions. None of them seem to hold up under scrutiny -- many look the same or were improvements on rerun, and the remaining few that still looked worse did not have codegen diffs (one seemed to be a 32 bit alignment issue, another had GC kicking in). Will list them here in the interest of full disclosure.
|
|
Codegen is bad in the new test, with or without my change: @tannergooding any ideas? This came in with #23461; maybe some concurrent change regressed it? |
|
I hadn't inspected the quality of the codegen for the new tests as they are just validating that they are properly covering 64-bit inputs. The code is just doing the following, so I'm unsure why it would be generating a full addressing instruction... private static unsafe bool TestSse2X64StoreNonTemporal_Int64()
{
long result;
Sse2.X64.StoreNonTemporal(&result, long.MaxValue);
return AreEqual(long.MaxValue, result);
}The only thing I can think of, however, would be the addressing mode changes @CarolEidt recently made maybe? |
|
Looks like we're just dropping the address in codegen: |
|
Hmmm ... wouldn't be at all surprising if those two commits crossed paths. I'll take a look. |
|
This is "special codegen" intrinsic and it looks like we expect the address to be in a register, but never put it there. |
|
@CarolEidt, it is still calling I think you changed this to |
|
Ignoring the general-purpose handling (such as that in
|
|
I believe the various test failures are unrelated. Have also tried running some ASP.Net scenarios and results seem to be a wash. |
|
Attempt to rerun some failed legs in coreclr-ci hit this error: Guess I will retrigger everthing. |
|
Going to close/reopen once more. |
|
Finally got a clean CI run.... |
Stop double-counting some locals when computing how many slots need to be zeroed in the prolog. Bump threshold for rep stosb style init from 16 bytes to 32 bytes on x64. Commit migrated from dotnet/coreclr@900ae5b
Stop double-counting some locals when computing how many slots need to be
zeroed in the prolog.
Bump threshold for rep stosb style init from 16 bytes to 32 bytes on x64.