[ARM64] Use SHRN in SpanHelpers.CountValueType#89632
[ARM64] Use SHRN in SpanHelpers.CountValueType#89632neon-sunset wants to merge 4 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @dotnet/area-system-memory Issue DetailsThis change avoids using movemask emulation on Arm64 and uses SHRN instead. Numbers:
Benchmark: class Counting
{
private U8String ThirdPartyNotices;
[Params(100, 1000, int.MaxValue)]
public int Length;
[GlobalSetup]
public void Setup()
{
var notices = new HttpClient()
.GetU8StringAsync("https://raw.githubusercontent.com/dotnet/runtime/main/THIRD-PARTY-NOTICES.TXT")
.GetAwaiter()
.GetResult();
ThirdPartyNotices = Length is int.MaxValue ? notices : notices[..Length];
}
// Both variants just end up calling old/new MemoryExtensions.Count(span, (byte)'\n')
[Benchmark]
public int CountLines() => ThirdPartyNotices.Lines.Count;
[Benchmark]
public int CountLinesNew() => ThirdPartyNotices.Lines.CountNew();
}I need to fix the setup for benchmarking against two proper coreroots on my machine but hopefully this is sufficient.
|
| 1 => 2, | ||
| 2 => 3, | ||
| 4 => 4, | ||
| _ => 5 |
There was a problem hiding this comment.
I could not come up with a better way to write this.
There was a problem hiding this comment.
count += (int)((uint)BitOperations.PopCount(matches) / (4 * sizeof(T)));There was a problem hiding this comment.
Thanks, it seems JIT can't completely fold this:
[MethodImpl(MethodImplOptions.AggressiveOptimization | MethodImplOptions.NoInlining)]
static int CountAdvSimd<T>(ReadOnlySpan<T> span, T value)
{
var eqmask = Vector128.Equals(Vector128.Create(span), Vector128.Create(value));
var matches = AdvSimd.ShiftRightLogicalNarrowingLower(eqmask.AsUInt16(), 4).AsUInt64().ToScalar();
// or (int)((uint)BitOperations.PopCount(matches) / (4 * sizeof(T)));
return BitOperations.PopCount(matches) >> (Unsafe.SizeOf<T>() switch
{
1 => 2,
2 => 3,
4 => 4,
_ => 5
});
}Diff where T is ushort: https://www.diffchecker.com/hzMEBfhM/
There was a problem hiding this comment.
I don't see any codegen differences locally
There was a problem hiding this comment.
The version by @MihaZupan is faster 🤯
diff:
G_M000_IG04: ;; offset=0028H
ldr q17, [x0]
cmeq v17.16b, v17.16b, v16.16b
shrn v17.8b, v17.8h, #4
umov x5, v17.d[0]
mov v17.d[0], x5
cnt v17.8b, v17.8b
addv b17, v17.8b
umov w5, v17.b[0]
- add w3, w3, w5, LSR #2
+ mov w5, w5
+ lsr x5, x5, #2
+ add w3, w5, w3
add x0, x0, #16
cmp x0, x2
bls G_M000_IG04I think I know why. According to https://dougallj.github.io/applecpu/firestorm-int.html, ADD (register, lsr, 32-bit) has reciprocal TP of 0.333 per cycle with 2 cycle latency. However, both ADD (register, 32-bit) and LSR (immediate, 32-bit) are 0.111 of reciprocal TP with 1 cycle lat. This looks like a stall on a backwards jump because if I forcibly use V256 (V128x2) version it is marginally faster too despite codegen.
According to napkin math, slower version handles around 5.8~6 bytes per cycle while the faster one is ~8. If the above assumption is correct, it would explain the difference.
All in all this was very counterintuitive, TIL.
|
We really want to try and balance the code here. Maintaining explicitly separate code paths for different platforms is error prone and adds a lot of complexity to the managed code-base. In this particular case, the main reason its faster is because scalar But we could also add the minimal pattern recognition to the JIT instead so that the common pattern of |
|
@tannergooding I see. Given prevalence of doing eq + movemask + tzcnt/popcnt everywhere, would it make sense to resurrect #83010 (or submit a new one) since To me it seems easier to do than to teach JIT to recognize all shapes .ExtractMostSignificantBits() + SomeOp can take. |
|
We're ultimately going to want/need both. I'll be opening a proposal covering some |
|
Thank you. As for this PR, what should I do with it? Apply suggestion from @MihaZupan, keep it as is or change to draft and just let it sit until there are changes in runtime? |
|
I think in this case we want to leave it as is and track doing the JIT pattern matching instead. I wouldn't, however, be opposed to an internal |
|
It's already tracked, the trick used here is from the arm blog, we have #76047 to track it in jit. It's something we definitely want to fix eventually |
I'll submit a separate PR for this which introduces internal |
This change avoids using movemask emulation on Arm64 and uses SHRN instead.
Somewhat related to #76047
Numbers:
Benchmark:
I need to fix the setup for benchmarking against two proper coreroots on my machine but hopefully this is sufficient.