-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Propagate preferences #19429
Propagate preferences #19429
Conversation
|
@dotnet-bot test Windows_NT x64 perf |
|
@dotnet-bot test Windows_NT x64 throughput |
fef89cb to
03773a6
Compare
|
@dotnet-bot test Windows_NT x64 perf |
|
@dotnet-bot test Windows_NT x64 perf |
|
This would resolve the issue with wrappers in https://github.com/dotnet/coreclr/issues/18542? (The pointer sized ones) |
It improves them; it eliminates the ping-pong copies (five copies removed from the |
|
@dotnet-bot test Windows_NT x64 Checked CoreFX Tests |
|
@dotnet-bot test Windows_NT x64 perf |
58cd07b to
5a42b08
Compare
|
@dotnet-bot test Windows_NT x64 perf |
|
I've yet to download your changes and see the diffs but FWIW cases like |
|
test OSX10.12 x64 Checked Innerloop Build and Test |
|
@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test |
|
test OSX10.12 x64 Checked Innerloop Build and Test test Windows_NT x64 perf |
|
test Windows_NT x64 perf |
|
@dotnet-bot test Windows_NT x64 perf |
|
Curiosity got the better of me.... Looking at In the largest non-generic - lea ecx, [rdi-8]
- mov edi, ecx
- lea rcx, bword ptr [rsi+8]
- mov rsi, rcx
- mov rcx, rsi
- mov rsi, rcx
+ add edi, -8
+ add rsi, 8Which is similar to the peephole optimize #21959 but for better reasons? /cc @AndyAyersMS Top regression Starts with a reduction, which has a knock on repeated increase from an argument reversal? - xorps xmm1, xmm1
+ xorps xmm7, xmm7
- movaps xmm7, xmm0
- addsd xmm7, xmm1
+ addsd xmm7, xmm0
; repeated change
mulsd xmm0, qword ptr [reloc @RWD112]
- addsd xmm7, xmm0
+ addsd xmm0, xmm7
+ movaps xmm7, xmm0
movaps xmm0, xmm6
mulsd xmm0, qword ptr [reloc @RWD120]More interesting to me are async and span, so I had a look at the regressions there AsyncTaskMethodBuilder`1:GetStateMachineBox(byref):ref:this just seems a change in register for 1 byte (x 26 methods) - mov dword ptr [rax+52], 0xD1FFAB1E
mov r14, rax
+ mov dword ptr [r14+52], 0xD1FFAB1EHad a look at the regression in ReadOnlySpan`1:op_Implicit(struct):struct, which doesn't seem significant lea -> add + mov G_M62951_IG03:
xor r9, r9
- xor edx, edx
+ xor r10d, r10d
jmp SHORT G_M62951_IG06
G_M62951_IG04:
mov r9d, r8d
mov r10d, edx
add r9, r10
mov r10d, dword ptr [rax+8]
mov r10d, r10d
cmp r9, r10
ja SHORT G_M62951_IG09
G_M62951_IG05:
add rax, 16
- movsxd r8, r8d
- lea r9, bword ptr [r8+rax]
+ movsxd r9, r8d
+ add r9, rax
+ mov r10d, edx
G_M62951_IG06:
mov bword ptr [rcx], r9
- mov dword ptr [rcx+8], edx
+ mov dword ptr [rcx+8], r10d
mov rax, rcxThe ReadOnlySpan`1:get_Item(struct):struct:this + push r14
+ pop r14
call [ReadOnlySpan`1:Slice(int,int):struct:this]
mov rax, rdi
- ; Total bytes of code 119, prolog size 13 for method ReadOnlySpan`1:get_Item(struct):struct:this
+ ; Total bytes of code 126, prolog size 15 for method ReadOnlySpan`1:get_Item(struct):struct:thisAnd improvements AsyncMethodBuilderCore:Start(byref) loses a register -; Lcl frame size = 64
+; Lcl frame size = 72
G_M39858_IG01:
push rbp
- push rdi
push rsi
- sub rsp, 64
+ sub rsp, 72
...
G_M39858_IG08:
- lea rsp, [rbp-10H]
+ lea rsp, [rbp-08H]
pop rsi
- pop rdi
pop rbp
ret OTOH MemoryStream:ReadAsync(ref,int,int,struct):ref:this gains a register push rbp
+ push r14
push rdi
push rsi
push rbx
sub rsp, 48ConfiguredValueTaskAwaiter:....AwaitUnsafeOnCompleted(ref):this which was one of the things I was looking at in https://github.com/dotnet/coreclr/issues/18542 improves nicely - test rax, rax
+ mov rcx, rax
+ test rcx, rcx
- mov rcx, rax
- mov rcx, gword ptr [rax+0488H]
+ mov rdx, gword ptr [rax+0488H]
- movsx rdx, word ptr [rsi+8]
+ movsx r9, word ptr [rsi+8]
+ mov rcx, rbx
+ mov r8, rdi
- mov r8, rdi
- mov r9d, edx
- mov gword ptr [rsp+28H], rcx
- mov gword ptr [rsp+28H], rcx
- mov r8, rdi
- mov r9d, ed
- mov rcx, rbx
- mov rdx, gword ptr [rsp+28H]
- ; Total bytes of code 191, prolog size 7 for method ConfiguredValueTaskAwaiter:System.Runtime.CompilerServices.IStateMachineBoxAwareAwaiter.AwaitUnsafeOnCompleted(ref):this
- ; Total bytes of code 167, prolog size 7 for method ConfiguredValueTaskAwaiter:System.Runtime.CompilerServices.IStateMachineBoxAwareAwaiter.AwaitUnsafeOnCompleted(ref):thisAdding "Peephole optimize redundant and shuffle moves" #21959 on top still gives further improvements (only looking at |
|
Any news about this one? It eliminates lot of the ping-pong copies I happen to see quite often. |
|
@CarolEidt I am working on a few optimizations suffering from this. Is there any ETA for this PR to enter into daily build? |
|
Sorry for the delays - my priorities are elsewhere at the moment. The previous perf runs showed some unfortunate impacts on Span benchmarks, which I believe I've addressed but I need to make sure I've got those changes pushed, and re-run the perf tests. I'll attempt to get that done ASAP. |
Instead of selecting a single relatedInterval for preferencing, follow the chain of relatedIntervals (preferenced intervals). Change when preferences are propagated to the relatedInterval; do it in allocateRegisters, so that we can update it when we see a last use. Also tune when and how intervals are preferenced, including allowing multiple uses on an RMW node to have the target as their preference. Fixes #11463 Contributes to #16359
|
test Windows_NT x64 perf |
|
@dotnet/jit-contrib PTAL X86 windows: |
|
Once I've got a couple of approvals on this, I plan to run a bunch of jitStressRegs legs. |
|
@CarolEidt perf failure is being tracked by #22179 |
|
@adiaaida - thanks; I'm proposing we skip the perf tests and go ahead and merge this once reviewed. |
AndyAyersMS
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far. I want to go over it once more.
src/jit/lsra.h
Outdated
| { | ||
| return this; | ||
| } | ||
| // if (nextRefPosition->lastUse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this if should be removed or uncommented... ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the if (and the dead code below it) and added a comment.
Once this is merged, I'll file a new issue to further tune this - specifically in this area and also wrt the spills in Span.IndexerBench.
AndyAyersMS
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked things over again... still looks good. Take a look at the one comment from before.
|
test Windows_NT x64 Checked Innerloop Build and Test |
* Propagate preferences Instead of selecting a single relatedInterval for preferencing, follow the chain of relatedIntervals (preferenced intervals). Change when preferences are propagated to the relatedInterval; do it in allocateRegisters, so that we can update it when we see a last use. Also tune when and how intervals are preferenced, including allowing multiple uses on an RMW node to have the target as their preference. Fixes dotnet/coreclr#11463 Contributes to dotnet/coreclr#16359 Commit migrated from dotnet/coreclr@0f8b7e8
Instead of selecting a single relatedInterval for preferencing,
follow the chain of relatedIntervals (preferenced intervals).
Change when preferences are propagated to the relatedInterval;
do it in allocateRegisters, so that we can update it when we see a last use.
Also tune when and how intervals are preferenced, including allowing multiple
uses on an RMW node to have the target as their preference.
Fixes #11463
Contributes to #16359