Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Conversation

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Jan 20, 2018

See #15076 for perf measurements.

There are four specific optimizations being considered here.

  1. Tweak the flow graph of TryCopyTo (e.g., single return statement) to encourage code gen to inline the method.
  2. Remove the fixed statement from the CopyTo core implementation in order to avoid the overhead of setting up pinned locals on the stack. This also allows code gen to perform a tail call rather than a standard call into the core Memmove implementation.
  3. Accept nuint as a parameter to the CopyTo core implementation to take advantage of TryCopyTo being inlined, allowing code gen to be more efficient with how it emits mov / movsxd instructions.
  4. Create a ref-based Memmove implementation which is a sister to the pointer-based implementation. Right now only Span calls into this, but presumably other types like Array could be changed to do so.

The perf measurements in #15076 show how some of these interplay with each other. For example, the test case which copies a Span<object> doesn't benefit from removing fixed (since code gen would've pruned that path), but it does benefit from the flow graph changes, so that can be used as an estimate for how much the flow graph changes alone contribute to the total performance profile of all test cases.

I've done some basic correctness tests but have not performed proper testing of the Memmove implementation. I've also not yet profiled this PR with overlapping buffers.

Finally, like the earlier Array.Reverse proposal, this leverages ByReference<T> to mimic a mutable ref. Presumably the ref reassignment feature of C# 7.3 will allow us to get rid of this. But the perf gain from using ByReference<T> in this manner might make it worthwhile to do this while waiting for the language feature to come online.

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Jan 20, 2018

/cc @KrzysztofCwalina @jkotas @stephentoub

I accidentally changed some whitespace in the original Memmove routine. Ignore it; it won't be committed.

return;
}

nuint byteCount = (nuint)elementsCount * (nuint)Unsafe.SizeOf<T>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the special cases for single element and Unsafe.AreSame still worth it? I think we may be a good idea to move the AreSame calls into Memmove, not have the special case for single element; and maybe get rid of the CopyTo<T> helper by manually inlining it into the few callsites where it is used.

@jkotas
Copy link
Member

jkotas commented Jan 20, 2018

@dotnet/jit-contrib This is using the internal ByReference<T> intrinsic type and methods in new pattern that they were not used before. Is there any additional review or testing to do to ensure that it works well?

MCPY00:
// Copy bytes which are multiples of 16 and leave the remainder for MCPY01 to handle.
MCPY00:
// Copy bytes which are multiples of 16 and leave the remainder for MCPY01 to handle.
Copy link

@ahsonkhan ahsonkhan Jan 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fix the spacing/indentation here and elsewhere (for the labels)

I accidentally changed some whitespace in the original Memmove routine. Ignore it; it won't be committed.

edit: Ah, ok. Ignore my comment.

Copy link

@ahsonkhan ahsonkhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Span.CopyTo<T>(ref destination.DangerousGetPinnableReference(), ref _pointer.Value, (nuint)_length);
retVal = true;
}
return retVal;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results in better code gen? If so, why can't the JIT just do that rather than requiring it be written one way or the other?
cc: @AndyAyersMS

@AndyAyersMS
Copy link
Member

Both of ByReference<T> methods are already handled as intrinsics and these paths get heavily exercised by code that uses Span<T> .

However, we haven't seen much (any?) use of ByReference<T> at ABI boundaries so it would be a good idea to verify that the jit is passing these structs efficiently, especially on non-windows ABIs. I believe all our supported ABIs allow pointer-sized structs to be passed by value but am not 100% sure.

So at a minimum we should also look at the codegen for Windows x86 and for Linux x64.

@AndyAyersMS
Copy link
Member

@stephentoub the jit is quite sensitive to details of the input IL. Having one return site versus two can lead to different decisions about inlining and the use of temps and code layout and so on.

All compilers are somewhat sensitive to the shapes of their input, but I think it is fair to say that the jit is perhaps a bit too sensitive.

In this particular case, if the goal is to convince the jit to automatically inline TryCopyTo I would suggest going back to the simpler version and adding the aggressive inline attribute.

[Intrinsic]
[NonVersionable]
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static bool IsBelow<T>(ref T left, ref T right)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR no longer requires the IsBelow method; I'll remove the dependency.

// Finally, apply DeMorgan's theorem:
// if ((src < destEnd) && (dest < srcEnd)) { /* buffers overlap */ }

if (Unsafe.IsBelow(ref src.Value, ref destEnd) && Unsafe.IsBelow(ref dest.Value, ref srcEnd))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is incorrect and will be reverted back to its original implementation (or higher-performance equivalent).

@stephentoub
Copy link
Member

Great to see these perf improvements, @GrabYourPitchforks. Thanks.

@GrabYourPitchforks
Copy link
Member Author

@AndyAyersMS On Win x64 I'm seeing proper inlining and tail-calling as expected, and there's no stack spillage. However, on Win x86 I'm seeing unexpected stack spillage inside the Buffer.Memmove(ByReference<T> ...) method. The weird thing is that even with the stack spillage I'm seeing better performance with the proposed changes. I may have to enlist some of you for assistance in investigating.

@AndyAyersMS
Copy link
Member

Feel free to post some assembly snippets.

x86 doesn't have very many allocatable registers, so it is not too surprising if there are extra uses of the stack compared to x64.

Also introduces an internal ref version of Buffer.Memmove
- Have single return from TryCopyTo
- Move 'nuint' parameters up the call stack for more efficient use of registers
@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Jan 25, 2018

I posted the disassembly of Memmove(void*, ...) and Memmove(ByReference<byte>, ...) for amd64. They're almost completely identical. The only difference is that the ByReference<byte> version performs a regular call instead of a tailcall into _Memmove. I'm investigating why this is. Otherwise this bodes very well for the byref implementation.

I'll post the x86 disassembly separately so we can investigate it.

@jkotas
Copy link
Member

jkotas commented Jan 25, 2018

The only difference is that the ByReference<byte> version performs a regular call instead of a tailcall

The JIT does the tailcall optimization only when it can prove that none of the addresses pointing to local frame are used as arguments. I suspect that it is not able to prove that for the ByReference<byte> version.

@jkotas
Copy link
Member

jkotas commented Jan 25, 2018

It may be side-effect of ByReference<T> that will get fixed by eventual switch to Roslyn with ref reassignments.

This breaks the Span dependency and allows access by any internal caller, such as arrays or strings
Moving the multiplication after the (elementCount != 0) check allows codegen to reuse existing registers, saving a few mov operations
@GrabYourPitchforks
Copy link
Member Author

@AndyAyersMS @jkotas And here's the x86 disassembly: original and modified. Ignore the comparisons at the very beginning and very end of the function as I've moved some things around a bit.

In particular check out the parts around the movdqu block, which seem to be less compact in their representation and perform more shuffling of registers.

Original (pointer-based)

53c15feb 8bd8            mov     ebx,eax
53c15fed c1eb06          shr     ebx,6
53c15ff0 f30f6f02        movdqu  xmm0,xmmword ptr [edx]
53c15ff4 f30f7f01        movdqu  xmmword ptr [ecx],xmm0
53c15ff8 f30f6f4210      movdqu  xmm0,xmmword ptr [edx+10h]
53c15ffd f30f7f4110      movdqu  xmmword ptr [ecx+10h],xmm0
53c16002 f30f6f4220      movdqu  xmm0,xmmword ptr [edx+20h]
53c16007 f30f7f4120      movdqu  xmmword ptr [ecx+20h],xmm0
53c1600c f30f6f4230      movdqu  xmm0,xmmword ptr [edx+30h]
53c16011 f30f7f4130      movdqu  xmmword ptr [ecx+30h],xmm0
53c16016 83c140          add     ecx,40h
53c16019 83c240          add     edx,40h

Modified (ref-based)

53b9cbf5 8bf1            mov     esi,ecx
53b9cbf7 c1ee06          shr     esi,6
53b9cbfa 8b7d08          mov     edi,dword ptr [ebp+8]
53b9cbfd 8b5d0c          mov     ebx,dword ptr [ebp+0Ch]
53b9cc00 f30f6f07        movdqu  xmm0,xmmword ptr [edi]
53b9cc04 f30f7f03        movdqu  xmmword ptr [ebx],xmm0
53b9cc08 f30f6f4710      movdqu  xmm0,xmmword ptr [edi+10h]
53b9cc0d f30f7f4310      movdqu  xmmword ptr [ebx+10h],xmm0
53b9cc12 f30f6f4720      movdqu  xmm0,xmmword ptr [edi+20h]
53b9cc17 f30f7f4320      movdqu  xmmword ptr [ebx+20h],xmm0
53b9cc1c f30f6f4730      movdqu  xmm0,xmmword ptr [edi+30h]
53b9cc21 f30f7f4330      movdqu  xmmword ptr [ebx+30h],xmm0
53b9cc26 8b7d0c          mov     edi,dword ptr [ebp+0Ch]
53b9cc29 83c740          add     edi,40h
53b9cc2c 897d0c          mov     dword ptr [ebp+0Ch],edi
53b9cc2f 8b7d08          mov     edi,dword ptr [ebp+8]
53b9cc32 83c740          add     edi,40h
53b9cc35 897d08          mov     dword ptr [ebp+8],edi

@GrabYourPitchforks
Copy link
Member Author

Note to reviewers: it's best to look at this commit-by-commit rather than file-by-file.

86b05d6 creates the ref-based Memmove routine and directs Span.CopyTo's workhorse method to call into it. The perf gain here comes from avoiding pinning.

77fad61 changes Span.CopyTo's workhorse method to take nuint instead of int as a parameter (avoiding some unnecessary mov instructions) and refactors TryCopyTo to have a single point of return, allowing inlining.

b5f075f moves the "are these buffers actually the same?" check down deep into the guts of Memmove. This means that there's less overhead for the common case of non-identical buffers. It additionally speeds up the "do these buffers overlap?" check at the beginning of Memmove by taking advantage of processor pipelining.

b129535 removes the special-casing of empty and single-element buffers from Span.CopyTo, opting instead to let Memmove handle it. This makes processing longer input buffers faster, but I haven't yet measured the impact to empty and single-element buffers.

fe2a593 refactors the Span.CopyTo logic into the Buffer class directly so that non-Span types can take advantage of it. The AggressiveInlining is there because the JIT wasn't inlining this routine into its caller, even though all the routine does is shift r8 (which contains elementCount) and tail-call into Memmove. I suspect the lack of automatic inlining has to do with the ByReference<T> local. Regardless, it saves a frame.

f527fea optimizes the public Span.CopyTo routine to avoid checking the length twice.

a9b61af improves the codegen for calling non-blittable memmove by pushing the multiplication / shift operation after the if (elementCount != 0) check. This allows more efficient usage of the r8 register in this method since it can be used both for the element count and for the total byte count.

7aa2dba allows skipping the checked arithmetic that's inside the IntPtr explicit conversion operator and allows us to use unchecked arithmetic instead, saving some comparisons. This only affects x86.

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Jan 26, 2018

With this PR, here's the final x64 codegen for a call to Span.CopyTo or Span.TryCopyTo where T is a blittable type:

; edi := src.Length
; ebx := dst.Length
; rbp := dst.Pointer
; rsi := src.Pointer
00007ff7`cffc0cb9 3bfb            cmp     edi,ebx
00007ff7`cffc0cbb 0f879d000000    ja      ERROR_DEST_TOO_SMALL
00007ff7`cffc0cc1 4c63c7          movsxd  r8,edi
00007ff7`cffc0cc4 49c1e002        shl     r8,2 ; I'm using a Span<int> for this demo
00007ff7`cffc0cc8 488bcd          mov     rcx,rbp
00007ff7`cffc0ccb 488bd6          mov     rdx,rsi
00007ff7`cffc0cce e87d82345f      call    System_Private_CoreLib!System.Buffer.Memmove(System.ByReference`1<Byte>, ...)

All that happens at the call site is that the length is checked, and if the check passes the registers are lined up correctly and a call is made directly to the new Buffer.Memmove implementation. All intermediate methods have been forcibly or automatically inlined since they were just performing glorified argument forwarding.

For non-blittable types, the codegen is a little longer, but it's not too bad. We still get wins from removing all of the overhead that used to exist.

; rsp+78h := dest span
; rsp+88h := src span
00007ff7`cb750d69 488b4c2478      mov     rcx,qword ptr [rsp+78h]
00007ff7`cb750d6e 448b842480000000 mov     r8d,dword ptr [rsp+80h]
00007ff7`cb750d76 4439842490000000 cmp     dword ptr [rsp+90h],r8d
00007ff7`cb750d7e 0f87b3000000    ja      ERROR_DEST_TOO_SMALL
00007ff7`cb750d84 488b942488000000 mov     rdx,qword ptr [rsp+88h]
00007ff7`cb750d8c 448b842490000000 mov     r8d,dword ptr [rsp+90h]
00007ff7`cb750d94 4d63c0          movsxd  r8,r8d
00007ff7`cb750d97 483bca          cmp     rcx,rdx ; bail if the buffers are identical
00007ff7`cb750d9a 740e            je      AFTER_CALL
00007ff7`cb750d9c 4d85c0          test    r8,r8 ; bail if src is empty
00007ff7`cb750d9f 7409            je      AFTER_CALL
00007ff7`cb750da1 49c1e003        shl     r8,3 ; I'm using Span<object> where references are 8 bytes
00007ff7`cb750da5 e8562ec25f      call    coreclr!MemoryNative::BulkMoveWithWriteBarrier (00007ff8`2b373c00)

The codegen for x86 and x64 is similar at the call site to Span.CopyTo/TryCopyTo. I've left separate comments detailing the codegen for the Memmove implementation.

@GrabYourPitchforks
Copy link
Member Author

BTW one thing to consider is that if an exception is thrown (when using normal CopyTo and the destination is too short), the stack trace will no longer show the Span.CopyTo frame since it got inlined into the caller.

Unhandled Exception: System.ArgumentException: Destination is too short.
   at Program.Main(String[] args) in C:\blahblah\Program.cs:line 31

Because I have PDBs available the stack trace correctly shows my calling routine, and line 31 is indeed where I have written span.CopyTo(destSpanWhichIsTooShort);. But this may negatively impact ease of debugging if PDBs aren't available.

@jkotas
Copy link
Member

jkotas commented Jan 26, 2018

For non-blittable types, the codegen is a little longer

Is there a good reason for that? It should be pretty straightforward to move the special case handling to BulkMoveWithWriteBarrier.

// check, and one for the result of TryCopyTo. Since these checks are equivalent,
// we can optimize by performing the check once ourselves then calling Memmove directly.

if ((uint)_length <= (uint)destination.Length)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot use nuint for the portable span today. Anything that uses nuint should be under #if !FEATURE_PORTABLE_SPAN or .Fast.cs files. You may want to copy the shared directory over to corefx and fix it up to make it build - so that it is prepared for mirroring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying the shared directly over still builds correctly on corefx, but that's because corefx has its own implementation of Span separate from the shared directory implementation. I opened https://github.com/dotnet/corefx/issues/26611 to track making changes to corefx to get it to consume the shared implementation.

In the meantime per our separate conversation we shouldn't let this block this PR.

// Non-inlinable wrapper around the QCall that avoids poluting the fast path

[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static void Memmove<T>(ref T destination, ref T source, int elementCount)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid this intermediate layer and moved the nuint -> int cast to the caller.

Note that the JIT is not able to inline generic methods called from other generic methods in shared generic code today. This will avoid some of the problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 39abfd0.

@jkotas
Copy link
Member

jkotas commented Jan 26, 2018

LGTM otherwise. Thanks!

@GrabYourPitchforks
Copy link
Member Author

Is there a good reason for that? It should be pretty straightforward to move the special case handling to BulkMoveWithWriteBarrier.

It's not the special-case handling. For some reason my Span<int> instances are being enregistered, but when I change the code to use Span<object> instead, things spill over to the stack. I think it's the codegen of the caller - not the callee - that is causing this inefficiency. Need to come up with a minimal repro though.

This should've set the value at the referenced address, not changed the reference
Take advantage of the fact that we're already computing srcEnd and destEnd to try to short-circuit the main overlap / disjoint check
@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Jan 26, 2018

More commits for reviewers:

dac2cbd fixes a logic error where we're sometimes forgetting to copy a single byte from src to dest.

ac116f2 improves performance of Memmove by using existing computations to short-circuit the "do these buffers overlap?" check. This change results in a 5 - 10% throughput gain in my microbenchmarks while copying small buffers. This likely regresses performance in the case where the buffers really do overlap since it introduces additional branching before we make the final "yup, these overlap!" determination. Will be reverted.

if (!TryCopyTo(destination))
// Using "if (!TryCopyTo(...))" results in two branches: one for the length
// check, and one for the result of TryCopyTo. Since these checks are equivalent,
// we can optimize by performing the check once ourselves then calling Memmove directly.
Copy link
Member

@stephentoub stephentoub Jan 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semi-related to this PR, since you've been looking at the perf a lot, I've had a bunch of cases where I need to use {Try}CopyTo and I know from previous checks that the destination is long enough. Given that, I generally preferred to use TryCopyTo and then either ignore the return value or assert that it's true, but that was when CopyTo called TryCopyTo and I could avoid the same extra branch you mention in the comment. Thoughts on the right approach here? Any sense of how much the length check costs, and whether there's any benefit in exposing an "unsafe" variant that skips the check entirely? Maybe for the uses in corelib I should just call the new memmove directly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The length checks are cheap. I do not think we want to be avoiding them, even for corelib.

If there are situations where they are obviously redundant, we should make sure that the JIT is optimizing the redundancy out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened https://github.com/dotnet/coreclr/issues/16042 to track removing redundant comparisons.

The proposed replacement logic didn't perform any better against a more realistic test battery
@GrabYourPitchforks GrabYourPitchforks merged commit e07292d into dotnet:master Jan 27, 2018
dotnet-bot pushed a commit to dotnet/corert that referenced this pull request Jan 27, 2018
* Introduce a ref-based version of Buffer.Memmove
* Remove pinning logic from Span.CopyTo
* Tweak flow graph of Span.CopyTo / TryCopyTo in order to encourage better codegen
* Push some uncommon logic (one-element buffers, perfectly overlapping buffers) down to Memmove

Signed-off-by: dotnet-bot <[email protected]>
dotnet-bot pushed a commit to dotnet/corefx that referenced this pull request Jan 27, 2018
* Introduce a ref-based version of Buffer.Memmove
* Remove pinning logic from Span.CopyTo
* Tweak flow graph of Span.CopyTo / TryCopyTo in order to encourage better codegen
* Push some uncommon logic (one-element buffers, perfectly overlapping buffers) down to Memmove

Signed-off-by: dotnet-bot-corefx-mirror <[email protected]>
jkotas pushed a commit to dotnet/corert that referenced this pull request Jan 27, 2018
* Introduce a ref-based version of Buffer.Memmove
* Remove pinning logic from Span.CopyTo
* Tweak flow graph of Span.CopyTo / TryCopyTo in order to encourage better codegen
* Push some uncommon logic (one-element buffers, perfectly overlapping buffers) down to Memmove

Signed-off-by: dotnet-bot <[email protected]>
jkotas pushed a commit to dotnet/corefx that referenced this pull request Jan 27, 2018
* Introduce a ref-based version of Buffer.Memmove
* Remove pinning logic from Span.CopyTo
* Tweak flow graph of Span.CopyTo / TryCopyTo in order to encourage better codegen
* Push some uncommon logic (one-element buffers, perfectly overlapping buffers) down to Memmove

Signed-off-by: dotnet-bot-corefx-mirror <[email protected]>
@GrabYourPitchforks GrabYourPitchforks deleted the levib/improve_span_2 branch January 29, 2018 02:31
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants