-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Single pass Vectorize Span<byte>.IndexOfAny(..., ReadOnlySpan<byte>) for 2,3 lengths #20738
Conversation
|
Could extend this for @ahsonkhan does the Utf8 Json use 4, 5 lengths (i.e. is it worth doing?) |
|
We already have good tests covering all of these lengths? |
Yes, for the specialized overloads that are called; some for these lengths with the |
|
Added tests dotnet/corefx#33185 |
|
@dotnet-bot test Windows_NT x64 Checked corefx_baseline |
|
Can you share perf results? |
I don't currently have a concrete scenario where I need it, especially since I wrote my own custom IndexOfAny with "less than" support (at least for the reader), but I am sure one will come up. I will let you know if I see it. |
|
@dotnet-bot test |
I would be interested in seeing the impact on small spans. Wouldn't these length checks regress performance there? Do we care? |
You end up pushing it upstream without it, like dotnet/corefx#33288 (review) |
|
It would still be good to see the perf results. |
Agreed :) |
f5600d9 to
85d0687
Compare
|
Not sure I'm running the benchmarks correctly since it should be processing 32 bytes at a time vs 1 byte at a time |
|
Was using |
|
Manual inlined if can be faster though varies https://gist.github.com/benaadams/cb059cda0c745fff079ba99d2db7f1be Method | Toolchain | Length | Items | Mean |
------------------ |---------- |------- |------ |--------------:|
- IndexOfAny | Current | 2 | 2 | 35.46 ns |
+ IndexOfAny | PR | 2 | 2 | 20.44 ns |
IndexOfAny_Manual | PR | 2 | 2 | 20.27 ns |
| | | | |
- IndexOfAny | Current | 7 | 2 | 117.00 ns |
+ IndexOfAny | PR | 7 | 2 | 66.73 ns |
IndexOfAny_Manual | PR | 7 | 2 | 56.13 ns |
| | | | |
- IndexOfAny | Current | 8 | 2 | 124.61 ns |
+ IndexOfAny | PR | 8 | 2 | 64.09 ns |
IndexOfAny_Manual | PR | 8 | 2 | 60.59 ns |
| | | | |
- IndexOfAny | Current | 15 | 2 | 272.92 ns |
+ IndexOfAny | PR | 15 | 2 | 156.86 ns |
IndexOfAny_Manual | PR | 15 | 2 | 125.27 ns |
| | | | |
- IndexOfAny | Current | 16 | 2 | 286.71 ns |
+ IndexOfAny | PR | 16 | 2 | 154.91 ns |
IndexOfAny_Manual | PR | 16 | 2 | 133.83 ns |
| | | | |
- IndexOfAny | Current | 31 | 2 | 737.20 ns |
+ IndexOfAny | PR | 31 | 2 | 418.17 ns |
IndexOfAny_Manual | PR | 31 | 2 | 323.98 ns |
| | | | |
- IndexOfAny | Current | 32 | 2 | 745.88 ns |
+ IndexOfAny | PR | 32 | 2 | 419.49 ns |
IndexOfAny_Manual | PR | 32 | 2 | 338.77 ns |
| | | | |
- IndexOfAny | Current | 63 | 2 | 2,223.70 ns |
+ IndexOfAny | PR | 63 | 2 | 1,328.81 ns |
IndexOfAny_Manual | PR | 63 | 2 | 978.05 ns |
| | | | |
- IndexOfAny | Current | 64 | 2 | 1,204.10 ns |
+ IndexOfAny | PR | 64 | 2 | 588.10 ns |
IndexOfAny_Manual | PR | 64 | 2 | 697.03 ns |
| | | | |
- IndexOfAny | Current | 127 | 2 | 3,836.02 ns |
+ IndexOfAny | PR | 127 | 2 | 2,333.05 ns |
IndexOfAny_Manual | PR | 127 | 2 | 1,579.54 ns |
| | | | |
- IndexOfAny | Current | 128 | 2 | 2,658.81 ns |
+ IndexOfAny | PR | 128 | 2 | 1,347.84 ns |
IndexOfAny_Manual | PR | 128 | 2 | 1,513.46 ns |
| | | | |
- IndexOfAny | Current | 255 | 2 | 8,651.33 ns |
+ IndexOfAny | PR | 255 | 2 | 5,521.69 ns |
IndexOfAny_Manual | PR | 255 | 2 | 3,500.87 ns |
| | | | |
- IndexOfAny | Current | 256 | 2 | 6,292.78 ns |
+ IndexOfAny | PR | 256 | 2 | 3,550.50 ns |
IndexOfAny_Manual | PR | 256 | 2 | 3,458.94 ns |
| | | | |
- IndexOfAny | Current | 1023 | 2 | 67,384.01 ns |
+ IndexOfAny | PR | 1023 | 2 | 42,926.52 ns |
IndexOfAny_Manual | PR | 1023 | 2 | 26,403.50 ns |
| | | | |
- IndexOfAny | Current | 1024 | 2 | 66,452.48 ns |
+ IndexOfAny | PR | 1024 | 2 | 42,186.38 ns |
IndexOfAny_Manual | PR | 1024 | 2 | 27,256.97 ns | Method | Toolchain | Length | Items | Mean |
------------------ |---------- |------- |------ |--------------:|
- IndexOfAny | Current | 2 | 3 | 42.34 ns |
+ IndexOfAny | PR | 2 | 3 | 19.67 ns |
IndexOfAny_Manual | PR | 2 | 3 | 22.60 ns |
| | | | |
- IndexOfAny | Current | 7 | 3 | 155.78 ns |
+ IndexOfAny | PR | 7 | 3 | 66.80 ns |
IndexOfAny_Manual | PR | 7 | 3 | 65.85 ns |
| | | | |
- IndexOfAny | Current | 8 | 3 | 162.62 ns |
+ IndexOfAny | PR | 8 | 3 | 64.12 ns |
IndexOfAny_Manual | PR | 8 | 3 | 72.50 ns |
| | | | |
- IndexOfAny | Current | 15 | 3 | 369.95 ns |
+ IndexOfAny | PR | 15 | 3 | 156.88 ns |
IndexOfAny_Manual | PR | 15 | 3 | 150.53 ns |
| | | | |
- IndexOfAny | Current | 16 | 3 | 395.62 ns |
+ IndexOfAny | PR | 16 | 3 | 154.56 ns |
IndexOfAny_Manual | PR | 16 | 3 | 164.60 ns |
| | | | |
- IndexOfAny | Current | 31 | 3 | 1,061.43 ns |
+ IndexOfAny | PR | 31 | 3 | 417.60 ns |
IndexOfAny_Manual | PR | 31 | 3 | 414.74 ns |
| | | | |
- IndexOfAny | Current | 32 | 3 | 1,097.19 ns |
+ IndexOfAny | PR | 32 | 3 | 419.65 ns |
IndexOfAny_Manual | PR | 32 | 3 | 436.95 ns |
| | | | |
- IndexOfAny | Current | 63 | 3 | 3,426.07 ns |
+ IndexOfAny | PR | 63 | 3 | 1,327.05 ns |
IndexOfAny_Manual | PR | 63 | 3 | 1,292.54 ns |
| | | | |
- IndexOfAny | Current | 64 | 3 | 1,536.31 ns |
+ IndexOfAny | PR | 64 | 3 | 590.88 ns |
IndexOfAny_Manual | PR | 64 | 3 | 806.08 ns |
| | | | |
- IndexOfAny | Current | 128 | 3 | 3,502.00 ns |
+ IndexOfAny | PR | 128 | 3 | 1,348.12 ns |
IndexOfAny_Manual | PR | 128 | 3 | 1,731.29 ns |
| | | | |
- IndexOfAny | Current | 255 | 3 | 13,177.27 ns |
+ IndexOfAny | PR | 255 | 3 | 5,524.36 ns |
IndexOfAny_Manual | PR | 255 | 3 | 4,174.94 ns |
| | | | |
- IndexOfAny | Current | 256 | 3 | 8,624.20 ns |
+ IndexOfAny | PR | 256 | 3 | 3,551.49 ns |
IndexOfAny_Manual | PR | 256 | 3 | 4,060.20 ns |
| | | | |
- IndexOfAny | Current | 1023 | 3 | 103,632.91 ns |
+ IndexOfAny | PR | 1023 | 3 | 42,987.17 ns |
IndexOfAny_Manual | PR | 1023 | 3 | 34,636.19 ns |
| | | | |
- IndexOfAny | Current | 1024 | 3 | 99,208.55 ns |
+ IndexOfAny | PR | 1024 | 3 | 42,090.74 ns |
IndexOfAny_Manual | PR | 1024 | 3 | 34,160.30 ns | |
|
Thanks, @benaadams. It's good to see it improves the cases that were meant to be improved... but what about the cases of > 3 search items that the change theoretically harms? Is the impact there negligible, even for small data lengths? |
Mostly noise except for longer lengths where the change to use Method | Length | Items | Mean |
----------- |------- |------ |--------------:|
- IndexOfAny | 2 | 4 | 59.78 ns |
+ IndexOfAny | 2 | 4 | 54.72 ns |
- IndexOfAny | 7 | 4 | 227.85 ns |
+ IndexOfAny | 7 | 4 | 198.02 ns |
- IndexOfAny | 8 | 4 | 212.65 ns |
+ IndexOfAny | 8 | 4 | 210.93 ns |
- IndexOfAny | 15 | 4 | 549.28 ns |
+ IndexOfAny | 15 | 4 | 505.15 ns |
- IndexOfAny | 16 | 4 | 514.03 ns |
+ IndexOfAny | 16 | 4 | 528.20 ns |
- IndexOfAny | 31 | 4 | 1,453.29 ns |
+ IndexOfAny | 31 | 4 | 1,487.49 ns |
- IndexOfAny | 32 | 4 | 1,427.88 ns |
+ IndexOfAny | 32 | 4 | 1,472.47 ns |
- IndexOfAny | 63 | 4 | 4,507.29 ns |
+ IndexOfAny | 63 | 4 | 4,679.04 ns |
- IndexOfAny | 64 | 4 | 3,644.58 ns |
+ IndexOfAny | 64 | 4 | 3,901.28 ns |
- IndexOfAny | 127 | 4 | 8,546.02 ns |
+ IndexOfAny | 127 | 4 | 8,306.65 ns |
- IndexOfAny | 128 | 4 | 5,041.45 ns |
+ IndexOfAny | 128 | 4 | 4,462.34 ns |
- IndexOfAny | 255 | 4 | 20,853.73 ns |
+ IndexOfAny | 255 | 4 | 19,632.51 ns |
- IndexOfAny | 256 | 4 | 13,268.79 ns |
+ IndexOfAny | 256 | 4 | 11,142.29 ns |
- IndexOfAny | 1023 | 4 | 166,646.23 ns |
+ IndexOfAny | 1023 | 4 | 142,920.65 ns |
- IndexOfAny | 1024 | 4 | 171,990.28 ns |
+ IndexOfAny | 1024 | 4 | 142,627.71 ns | |
|
Can you make the Bmi1 change separately and measure this one in its own? |
Margin of error Method | Length | Items | Mean |
----------- |------- |------ |--------------:|
- IndexOfAny | 2 | 4 | 59.78 ns |
+ IndexOfAny | 2 | 4 | 58.20 ns |
- IndexOfAny | 7 | 4 | 227.85 ns |
+ IndexOfAny | 7 | 4 | 216.25 ns |
- IndexOfAny | 8 | 4 | 212.65 ns |
+ IndexOfAny | 8 | 4 | 209.11 ns |
- IndexOfAny | 15 | 4 | 549.28 ns |
+ IndexOfAny | 15 | 4 | 524.19 ns |
- IndexOfAny | 16 | 4 | 514.03 ns |
+ IndexOfAny | 16 | 4 | 522.20 ns |
- IndexOfAny | 31 | 4 | 1,453.29 ns |
+ IndexOfAny | 31 | 4 | 1,438.64 ns |
- IndexOfAny | 32 | 4 | 1,427.88 ns |
+ IndexOfAny | 32 | 4 | 1,486.44 ns |
- IndexOfAny | 63 | 4 | 4,507.29 ns |
+ IndexOfAny | 63 | 4 | 4,694.68 ns |
- IndexOfAny | 64 | 4 | 3,644.58 ns |
+ IndexOfAny | 64 | 4 | 3,753.87 ns |
- IndexOfAny | 127 | 4 | 8,546.02 ns |
+ IndexOfAny | 127 | 4 | 8,526.17 ns |
- IndexOfAny | 128 | 4 | 5,041.45 ns |
+ IndexOfAny | 128 | 4 | 4,940.50 ns |
- IndexOfAny | 255 | 4 | 20,853.73 ns |
+ IndexOfAny | 255 | 4 | 20,703.20 ns |
- IndexOfAny | 256 | 4 | 13,268.79 ns |
+ IndexOfAny | 256 | 4 | 13,011.07 ns |
- IndexOfAny | 1023 | 4 | 166,646.23 ns |
+ IndexOfAny | 1023 | 4 | 166,523.84 ns |
- IndexOfAny | 1024 | 4 | 171,990.28 ns |
+ IndexOfAny | 1024 | 4 | 166,031.17 ns | |
|
|
Still improved Method | Length | Items | Mean |
------------------ |------- |------ |--------------:|
- IndexOfAny | 2 | 2 | 39.51 ns |
+ IndexOfAny | 2 | 2 | 21.58 ns |
IndexOfAny_Manual | 2 | 2 | 20.24 ns |
- IndexOfAny | 7 | 2 | 136.04 ns |
+ IndexOfAny | 7 | 2 | 62.31 ns |
IndexOfAny_Manual | 7 | 2 | 55.72 ns |
- IndexOfAny | 8 | 2 | 128.29 ns |
+ IndexOfAny | 8 | 2 | 67.50 ns |
IndexOfAny_Manual | 8 | 2 | 60.37 ns |
- IndexOfAny | 15 | 2 | 307.02 ns |
+ IndexOfAny | 15 | 2 | 137.70 ns |
IndexOfAny_Manual | 15 | 2 | 125.55 ns |
- IndexOfAny | 16 | 2 | 287.40 ns |
+ IndexOfAny | 16 | 2 | 146.18 ns |
IndexOfAny_Manual | 16 | 2 | 133.20 ns |
- IndexOfAny | 31 | 2 | 763.51 ns |
+ IndexOfAny | 31 | 2 | 351.90 ns |
IndexOfAny_Manual | 31 | 2 | 326.99 ns |
- IndexOfAny | 32 | 2 | 747.68 ns |
+ IndexOfAny | 32 | 2 | 365.57 ns |
IndexOfAny_Manual | 32 | 2 | 338.81 ns |
- IndexOfAny | 63 | 2 | 2,183.88 ns |
+ IndexOfAny | 63 | 2 | 1,025.29 ns |
IndexOfAny_Manual | 63 | 2 | 972.57 ns |
- IndexOfAny | 64 | 2 | 1,345.44 ns |
+ IndexOfAny | 64 | 2 | 774.74 ns |
IndexOfAny_Manual | 64 | 2 | 712.65 ns |
- IndexOfAny | 127 | 2 | 4,211.72 ns |
+ IndexOfAny | 127 | 2 | 1,719.49 ns |
IndexOfAny_Manual | 127 | 2 | 1,576.10 ns |
- IndexOfAny | 128 | 2 | 2,964.58 ns |
+ IndexOfAny | 128 | 2 | 1,629.05 ns |
IndexOfAny_Manual | 128 | 2 | 1,515.89 ns |
- IndexOfAny | 255 | 2 | 9,374.84 ns |
+ IndexOfAny | 255 | 2 | 3,771.19 ns |
IndexOfAny_Manual | 255 | 2 | 3,456.91 ns |
- IndexOfAny | 256 | 2 | 7,395.26 ns |
+ IndexOfAny | 256 | 2 | 3,655.45 ns |
IndexOfAny_Manual | 256 | 2 | 3,411.30 ns |
- IndexOfAny | 1023 | 2 | 78,092.99 ns |
+ IndexOfAny | 1023 | 2 | 32,838.62 ns |
IndexOfAny_Manual | 1023 | 2 | 26,750.79 ns |
- IndexOfAny | 1024 | 2 | 76,674.96 ns |
+ IndexOfAny | 1024 | 2 | 32,486.86 ns |
IndexOfAny_Manual | 1024 | 2 | 26,418.94 ns | Method | Length | Items | Mean |
------------------ |------- |------ |--------------:|
- IndexOfAny | 2 | 3 | 50.09 ns |
+ IndexOfAny | 2 | 3 | 23.88 ns |
IndexOfAny_Manual | 2 | 3 | 21.74 ns |
- IndexOfAny | 7 | 3 | 180.93 ns |
+ IndexOfAny | 7 | 3 | 69.78 ns |
IndexOfAny_Manual | 7 | 3 | 62.39 ns |
- IndexOfAny | 8 | 3 | 170.01 ns |
+ IndexOfAny | 8 | 3 | 78.10 ns |
IndexOfAny_Manual | 8 | 3 | 70.88 ns |
- IndexOfAny | 15 | 3 | 425.32 ns |
+ IndexOfAny | 15 | 3 | 167.11 ns |
IndexOfAny_Manual | 15 | 3 | 150.56 ns |
- IndexOfAny | 16 | 3 | 403.85 ns |
+ IndexOfAny | 16 | 3 | 174.06 ns |
IndexOfAny_Manual | 16 | 3 | 160.84 ns |
- IndexOfAny | 31 | 3 | 1,109.71 ns |
+ IndexOfAny | 31 | 3 | 441.54 ns |
IndexOfAny_Manual | 31 | 3 | 411.79 ns |
- IndexOfAny | 32 | 3 | 1,086.02 ns |
+ IndexOfAny | 32 | 3 | 463.34 ns |
IndexOfAny_Manual | 32 | 3 | 429.43 ns |
- IndexOfAny | 63 | 3 | 3,374.42 ns |
+ IndexOfAny | 63 | 3 | 1,344.50 ns |
IndexOfAny_Manual | 63 | 3 | 1,287.77 ns |
- IndexOfAny | 64 | 3 | 1,771.31 ns |
+ IndexOfAny | 64 | 3 | 846.24 ns |
IndexOfAny_Manual | 64 | 3 | 796.44 ns |
- IndexOfAny | 127 | 3 | 6,118.98 ns |
+ IndexOfAny | 127 | 3 | 1,945.70 ns |
IndexOfAny_Manual | 127 | 3 | 1,870.04 ns |
- IndexOfAny | 128 | 3 | 4,058.16 ns |
+ IndexOfAny | 128 | 3 | 1,835.06 ns |
IndexOfAny_Manual | 128 | 3 | 1,698.14 ns |
- IndexOfAny | 255 | 3 | 14,491.22 ns |
+ IndexOfAny | 255 | 3 | 4,417.77 ns |
IndexOfAny_Manual | 255 | 3 | 4,220.07 ns |
- IndexOfAny | 256 | 3 | 10,220.72 ns |
+ IndexOfAny | 256 | 3 | 4,275.11 ns |
IndexOfAny_Manual | 256 | 3 | 3,998.18 ns |
- IndexOfAny | 1023 | 3 | 121,068.50 ns |
+ IndexOfAny | 1023 | 3 | 41,598.90 ns |
IndexOfAny_Manual | 1023 | 3 | 34,565.33 ns |
- IndexOfAny | 1024 | 3 | 114,939.50 ns |
+ IndexOfAny | 1024 | 3 | 41,817.49 ns |
IndexOfAny_Manual | 1024 | 3 | 34,435.45 ns | |
stephentoub
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
| ref Unsafe.As<T, byte>(ref MemoryMarshal.GetReference(values)), | ||
| values.Length); | ||
| ref byte valueRef = ref Unsafe.As<T, byte>(ref MemoryMarshal.GetReference(values)); | ||
| if (values.Length == 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a values.Length == 1 check and call IndexOf or is it not worth penalizing the common cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had it at one point #20738 (comment)
Its still goes a Vectorized route and is still single pass (as there is only one item); it has a little bit extra overhead from an unnecessary loop of one iteration; so its only about fixed costs rather than a different technique being used.
The caller has already added overhead by creating a ReadOnlySpan or array of a single element to call this overhead; rather than just call .IndexOf, am not sure in what circumstances you'd do this?
…for 2,3 lengths (dotnet/coreclr#20738) Commit migrated from dotnet/coreclr@d76d97f

up to x2.3 times faster for 2 element
valuesup to x3.2 times faster for 3 element
valuesBenchmarks #20738 (comment)
Span<byte>.IndexOfAny(..., ReadOnlySpan<byte> values)is vectorized for each element invalueshowever it iterates over each one individually.This adds the faster paths used by
IndexOf,IndexOfAny(.., byte, byte)andIndexOfAny(.., byte, byte, byte)for thevalueslengths 2,3 for/cc @ahsonkhan @jkotas @stephentoub @JeremyKuhne