-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Optimize Span.Copy and Span.TryCopyTo #15947
Optimize Span.Copy and Span.TryCopyTo #15947
Conversation
|
/cc @KrzysztofCwalina @jkotas @stephentoub I accidentally changed some whitespace in the original Memmove routine. Ignore it; it won't be committed. |
| return; | ||
| } | ||
|
|
||
| nuint byteCount = (nuint)elementsCount * (nuint)Unsafe.SizeOf<T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the special cases for single element and Unsafe.AreSame still worth it? I think we may be a good idea to move the AreSame calls into Memmove, not have the special case for single element; and maybe get rid of the CopyTo<T> helper by manually inlining it into the few callsites where it is used.
|
@dotnet/jit-contrib This is using the internal |
src/mscorlib/src/System/Buffer.cs
Outdated
| MCPY00: | ||
| // Copy bytes which are multiples of 16 and leave the remainder for MCPY01 to handle. | ||
| MCPY00: | ||
| // Copy bytes which are multiples of 16 and leave the remainder for MCPY01 to handle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: fix the spacing/indentation here and elsewhere (for the labels)
I accidentally changed some whitespace in the original Memmove routine. Ignore it; it won't be committed.
edit: Ah, ok. Ignore my comment.
ahsonkhan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
| Span.CopyTo<T>(ref destination.DangerousGetPinnableReference(), ref _pointer.Value, (nuint)_length); | ||
| retVal = true; | ||
| } | ||
| return retVal; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This results in better code gen? If so, why can't the JIT just do that rather than requiring it be written one way or the other?
cc: @AndyAyersMS
|
Both of However, we haven't seen much (any?) use of So at a minimum we should also look at the codegen for Windows x86 and for Linux x64. |
|
@stephentoub the jit is quite sensitive to details of the input IL. Having one return site versus two can lead to different decisions about inlining and the use of temps and code layout and so on. All compilers are somewhat sensitive to the shapes of their input, but I think it is fair to say that the jit is perhaps a bit too sensitive. In this particular case, if the goal is to convince the jit to automatically inline |
| [Intrinsic] | ||
| [NonVersionable] | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| public static bool IsBelow<T>(ref T left, ref T right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please update this to match https://github.com/dotnet/corefx/issues/26451#issuecomment-359887092
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR no longer requires the IsBelow method; I'll remove the dependency.
src/mscorlib/src/System/Buffer.cs
Outdated
| // Finally, apply DeMorgan's theorem: | ||
| // if ((src < destEnd) && (dest < srcEnd)) { /* buffers overlap */ } | ||
|
|
||
| if (Unsafe.IsBelow(ref src.Value, ref destEnd) && Unsafe.IsBelow(ref dest.Value, ref srcEnd)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is incorrect and will be reverted back to its original implementation (or higher-performance equivalent).
|
Great to see these perf improvements, @GrabYourPitchforks. Thanks. |
|
@AndyAyersMS On Win x64 I'm seeing proper inlining and tail-calling as expected, and there's no stack spillage. However, on Win x86 I'm seeing unexpected stack spillage inside the |
|
Feel free to post some assembly snippets. x86 doesn't have very many allocatable registers, so it is not too surprising if there are extra uses of the stack compared to x64. |
Also introduces an internal ref version of Buffer.Memmove
- Have single return from TryCopyTo - Move 'nuint' parameters up the call stack for more efficient use of registers
|
I posted the disassembly of I'll post the x86 disassembly separately so we can investigate it. |
Memmove should handle this correctly
The JIT does the tailcall optimization only when it can prove that none of the addresses pointing to local frame are used as arguments. I suspect that it is not able to prove that for the |
|
It may be side-effect of |
This breaks the Span dependency and allows access by any internal caller, such as arrays or strings
Moving the multiplication after the (elementCount != 0) check allows codegen to reuse existing registers, saving a few mov operations
|
@AndyAyersMS @jkotas And here's the x86 disassembly: original and modified. Ignore the comparisons at the very beginning and very end of the function as I've moved some things around a bit. In particular check out the parts around the Original (pointer-based)53c15feb 8bd8 mov ebx,eax
53c15fed c1eb06 shr ebx,6
53c15ff0 f30f6f02 movdqu xmm0,xmmword ptr [edx]
53c15ff4 f30f7f01 movdqu xmmword ptr [ecx],xmm0
53c15ff8 f30f6f4210 movdqu xmm0,xmmword ptr [edx+10h]
53c15ffd f30f7f4110 movdqu xmmword ptr [ecx+10h],xmm0
53c16002 f30f6f4220 movdqu xmm0,xmmword ptr [edx+20h]
53c16007 f30f7f4120 movdqu xmmword ptr [ecx+20h],xmm0
53c1600c f30f6f4230 movdqu xmm0,xmmword ptr [edx+30h]
53c16011 f30f7f4130 movdqu xmmword ptr [ecx+30h],xmm0
53c16016 83c140 add ecx,40h
53c16019 83c240 add edx,40hModified (ref-based)53b9cbf5 8bf1 mov esi,ecx
53b9cbf7 c1ee06 shr esi,6
53b9cbfa 8b7d08 mov edi,dword ptr [ebp+8]
53b9cbfd 8b5d0c mov ebx,dword ptr [ebp+0Ch]
53b9cc00 f30f6f07 movdqu xmm0,xmmword ptr [edi]
53b9cc04 f30f7f03 movdqu xmmword ptr [ebx],xmm0
53b9cc08 f30f6f4710 movdqu xmm0,xmmword ptr [edi+10h]
53b9cc0d f30f7f4310 movdqu xmmword ptr [ebx+10h],xmm0
53b9cc12 f30f6f4720 movdqu xmm0,xmmword ptr [edi+20h]
53b9cc17 f30f7f4320 movdqu xmmword ptr [ebx+20h],xmm0
53b9cc1c f30f6f4730 movdqu xmm0,xmmword ptr [edi+30h]
53b9cc21 f30f7f4330 movdqu xmmword ptr [ebx+30h],xmm0
53b9cc26 8b7d0c mov edi,dword ptr [ebp+0Ch]
53b9cc29 83c740 add edi,40h
53b9cc2c 897d0c mov dword ptr [ebp+0Ch],edi
53b9cc2f 8b7d08 mov edi,dword ptr [ebp+8]
53b9cc32 83c740 add edi,40h
53b9cc35 897d08 mov dword ptr [ebp+8],edi |
989afae to
7aa2dba
Compare
|
Note to reviewers: it's best to look at this commit-by-commit rather than file-by-file. 86b05d6 creates the ref-based 77fad61 changes Span.CopyTo's workhorse method to take b5f075f moves the "are these buffers actually the same?" check down deep into the guts of b129535 removes the special-casing of empty and single-element buffers from Span.CopyTo, opting instead to let Memmove handle it. This makes processing longer input buffers faster, but I haven't yet measured the impact to empty and single-element buffers. fe2a593 refactors the Span.CopyTo logic into the Buffer class directly so that non-Span types can take advantage of it. The AggressiveInlining is there because the JIT wasn't inlining this routine into its caller, even though all the routine does is shift f527fea optimizes the public Span.CopyTo routine to avoid checking the length twice. a9b61af improves the codegen for calling non-blittable memmove by pushing the multiplication / shift operation after the 7aa2dba allows skipping the checked arithmetic that's inside the IntPtr explicit conversion operator and allows us to use unchecked arithmetic instead, saving some comparisons. This only affects x86. |
|
With this PR, here's the final x64 codegen for a call to ; edi := src.Length
; ebx := dst.Length
; rbp := dst.Pointer
; rsi := src.Pointer
00007ff7`cffc0cb9 3bfb cmp edi,ebx
00007ff7`cffc0cbb 0f879d000000 ja ERROR_DEST_TOO_SMALL
00007ff7`cffc0cc1 4c63c7 movsxd r8,edi
00007ff7`cffc0cc4 49c1e002 shl r8,2 ; I'm using a Span<int> for this demo
00007ff7`cffc0cc8 488bcd mov rcx,rbp
00007ff7`cffc0ccb 488bd6 mov rdx,rsi
00007ff7`cffc0cce e87d82345f call System_Private_CoreLib!System.Buffer.Memmove(System.ByReference`1<Byte>, ...)All that happens at the call site is that the length is checked, and if the check passes the registers are lined up correctly and a call is made directly to the new For non-blittable types, the codegen is a little longer, but it's not too bad. We still get wins from removing all of the overhead that used to exist. ; rsp+78h := dest span
; rsp+88h := src span
00007ff7`cb750d69 488b4c2478 mov rcx,qword ptr [rsp+78h]
00007ff7`cb750d6e 448b842480000000 mov r8d,dword ptr [rsp+80h]
00007ff7`cb750d76 4439842490000000 cmp dword ptr [rsp+90h],r8d
00007ff7`cb750d7e 0f87b3000000 ja ERROR_DEST_TOO_SMALL
00007ff7`cb750d84 488b942488000000 mov rdx,qword ptr [rsp+88h]
00007ff7`cb750d8c 448b842490000000 mov r8d,dword ptr [rsp+90h]
00007ff7`cb750d94 4d63c0 movsxd r8,r8d
00007ff7`cb750d97 483bca cmp rcx,rdx ; bail if the buffers are identical
00007ff7`cb750d9a 740e je AFTER_CALL
00007ff7`cb750d9c 4d85c0 test r8,r8 ; bail if src is empty
00007ff7`cb750d9f 7409 je AFTER_CALL
00007ff7`cb750da1 49c1e003 shl r8,3 ; I'm using Span<object> where references are 8 bytes
00007ff7`cb750da5 e8562ec25f call coreclr!MemoryNative::BulkMoveWithWriteBarrier (00007ff8`2b373c00)The codegen for x86 and x64 is similar at the call site to Span.CopyTo/TryCopyTo. I've left separate comments detailing the codegen for the Memmove implementation. |
|
BTW one thing to consider is that if an exception is thrown (when using normal Because I have PDBs available the stack trace correctly shows my calling routine, and line 31 is indeed where I have written |
Is there a good reason for that? It should be pretty straightforward to move the special case handling to BulkMoveWithWriteBarrier. |
| // check, and one for the result of TryCopyTo. Since these checks are equivalent, | ||
| // we can optimize by performing the check once ourselves then calling Memmove directly. | ||
|
|
||
| if ((uint)_length <= (uint)destination.Length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You cannot use nuint for the portable span today. Anything that uses nuint should be under #if !FEATURE_PORTABLE_SPAN or .Fast.cs files. You may want to copy the shared directory over to corefx and fix it up to make it build - so that it is prepared for mirroring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying the shared directly over still builds correctly on corefx, but that's because corefx has its own implementation of Span separate from the shared directory implementation. I opened https://github.com/dotnet/corefx/issues/26611 to track making changes to corefx to get it to consume the shared implementation.
In the meantime per our separate conversation we shouldn't let this block this PR.
src/mscorlib/src/System/Buffer.cs
Outdated
| // Non-inlinable wrapper around the QCall that avoids poluting the fast path | ||
|
|
||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| internal static void Memmove<T>(ref T destination, ref T source, int elementCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid this intermediate layer and moved the nuint -> int cast to the caller.
Note that the JIT is not able to inline generic methods called from other generic methods in shared generic code today. This will avoid some of the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 39abfd0.
|
LGTM otherwise. Thanks! |
It's not the special-case handling. For some reason my |
This should've set the value at the referenced address, not changed the reference
Take advantage of the fact that we're already computing srcEnd and destEnd to try to short-circuit the main overlap / disjoint check
|
More commits for reviewers: dac2cbd fixes a logic error where we're sometimes forgetting to copy a single byte from src to dest.
|
| if (!TryCopyTo(destination)) | ||
| // Using "if (!TryCopyTo(...))" results in two branches: one for the length | ||
| // check, and one for the result of TryCopyTo. Since these checks are equivalent, | ||
| // we can optimize by performing the check once ourselves then calling Memmove directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Semi-related to this PR, since you've been looking at the perf a lot, I've had a bunch of cases where I need to use {Try}CopyTo and I know from previous checks that the destination is long enough. Given that, I generally preferred to use TryCopyTo and then either ignore the return value or assert that it's true, but that was when CopyTo called TryCopyTo and I could avoid the same extra branch you mention in the comment. Thoughts on the right approach here? Any sense of how much the length check costs, and whether there's any benefit in exposing an "unsafe" variant that skips the check entirely? Maybe for the uses in corelib I should just call the new memmove directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The length checks are cheap. I do not think we want to be avoiding them, even for corelib.
If there are situations where they are obviously redundant, we should make sure that the JIT is optimizing the redundancy out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened https://github.com/dotnet/coreclr/issues/16042 to track removing redundant comparisons.
The proposed replacement logic didn't perform any better against a more realistic test battery
* Introduce a ref-based version of Buffer.Memmove * Remove pinning logic from Span.CopyTo * Tweak flow graph of Span.CopyTo / TryCopyTo in order to encourage better codegen * Push some uncommon logic (one-element buffers, perfectly overlapping buffers) down to Memmove Signed-off-by: dotnet-bot <[email protected]>
* Introduce a ref-based version of Buffer.Memmove * Remove pinning logic from Span.CopyTo * Tweak flow graph of Span.CopyTo / TryCopyTo in order to encourage better codegen * Push some uncommon logic (one-element buffers, perfectly overlapping buffers) down to Memmove Signed-off-by: dotnet-bot-corefx-mirror <[email protected]>
* Introduce a ref-based version of Buffer.Memmove * Remove pinning logic from Span.CopyTo * Tweak flow graph of Span.CopyTo / TryCopyTo in order to encourage better codegen * Push some uncommon logic (one-element buffers, perfectly overlapping buffers) down to Memmove Signed-off-by: dotnet-bot <[email protected]>
* Introduce a ref-based version of Buffer.Memmove * Remove pinning logic from Span.CopyTo * Tweak flow graph of Span.CopyTo / TryCopyTo in order to encourage better codegen * Push some uncommon logic (one-element buffers, perfectly overlapping buffers) down to Memmove Signed-off-by: dotnet-bot-corefx-mirror <[email protected]>
See #15076 for perf measurements.
There are four specific optimizations being considered here.
TryCopyTo(e.g., single return statement) to encourage code gen to inline the method.CopyTocore implementation in order to avoid the overhead of setting up pinned locals on the stack. This also allows code gen to perform a tail call rather than a standard call into the core Memmove implementation.nuintas a parameter to theCopyTocore implementation to take advantage ofTryCopyTobeing inlined, allowing code gen to be more efficient with how it emitsmov/movsxdinstructions.Memmoveimplementation which is a sister to the pointer-based implementation. Right now onlySpancalls into this, but presumably other types likeArraycould be changed to do so.The perf measurements in #15076 show how some of these interplay with each other. For example, the test case which copies a
Span<object>doesn't benefit from removing fixed (since code gen would've pruned that path), but it does benefit from the flow graph changes, so that can be used as an estimate for how much the flow graph changes alone contribute to the total performance profile of all test cases.I've done some basic correctness tests but have not performed proper testing of the Memmove implementation. I've also not yet profiled this PR with overlapping buffers.
Finally, like the earlier
Array.Reverseproposal, this leveragesByReference<T>to mimic a mutable ref. Presumably the ref reassignment feature of C# 7.3 will allow us to get rid of this. But the perf gain from usingByReference<T>in this manner might make it worthwhile to do this while waiting for the language feature to come online.