-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Improve performance of String.Equals(..., OrdinalIgnoreCase) #20734
Improve performance of String.Equals(..., OrdinalIgnoreCase) #20734
Conversation
A 15% boost to Equals OrdinalIgnoreCase is worth some code complexity, IMO. Especially as 64 bit is the 90% case right now. Recall that for Bing, improvements to Equals and IndexOf (operations only a little more primitive) were the single biggest contributor to their performance gains on .NET Core. |
|
I'll investigate the test failures. It's possible I ran the tests locally against the wrong version of the DLL. |
src/System.Private.CoreLib/shared/System/Globalization/CompareInfo.cs
Outdated
Show resolved
Hide resolved
danmoseley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever. Did you come up with this idea? Someone else should probably have a look.
|
Given the test failures, apparently I overlooked a mistake. |
| // performing a proper deep equality check of the string contents. We want to optimize for | ||
| // the case where the equality check is likely to succeed, which means that we want to avoid | ||
| // branching within this loop unless we're about to exit the loop, either due to failure or | ||
| // due to us running out of input data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The perf results you posted appear to only cover cases where the strings are considered equal. What does perf look like for inputs of the same length that are not equal, for inputs of different lengths, etc.? I agree that dictionary-like data structures are a very common case, but there are plenty of others where there isn't such a first-pass check. Search in src\System.Net.Http\src for ".Equals(", for example, and you'll find lots of OrdinalIgnoreCase checks. Path has several uses without such checks. TimeZoneInfo. X509Certificates. And so on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inputs of different length never hit this method because they're always kicked out by the caller. For inputs that truly aren't equal, the performance is still better than baseline, but not quite as good as if they are truly equal. I can provide benchmark numbers with those tests if you'd like. It'll take a while to rerun all of the benchmarks if we expand the battery to include these, but that's what "go make dinner and let this run in the background" is for. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For inputs that truly aren't equal, the performance is still better than baseline, but not quite as good as if they are truly equal.
Great. I'd be happy with not regressing those cases.
| } | ||
|
|
||
| return CompareInfo.EqualsOrdinalIgnoreCase(ref x.GetRawStringData(), ref y.GetRawStringData(), x.Length); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is effectively just manually inlining the previous string.Equals call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, yeah. It's also skipping the unnecessary overhead that String.Equals(...) performs re: length checking, referential equality, and validation of StringComparison parameter, because the caller knows these checks are irrelevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is checking reference equality.
Why isn't the length check beneficial here?
For validation of the StringComparison, since it was a constant here I would have hoped if it actually was inlined (I realize it wouldn't be without nudging) the JIT would then be able to prune those checks away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, that's an oversight. It looks like the unit tests are passing regardless, so I'll add a test for this condition to check for future regressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So at this point it's saving a non-virtual method call and a switch. Given all the other work being done, does that end up being measurable? If so, ok. If not, it'd be better to avoid the code duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rerun a test with only this change and get new numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Method | Toolchain | Mean | Error | StdDev | Scaled |
|---|---|---|---|---|---|
| EqualsOrdinalIgnoreCase_StringsDifferOnlyInCase | modified comparer | 5.017 ms | 0.0637 ms | 0.0532 ms | 0.75 |
| EqualsOrdinalIgnoreCase_StringsDifferOnlyInCase | original comparer | 6.675 ms | 0.1057 ms | 0.0989 ms | 1.00 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the perf boost is coming from a combination of not having to perform register shuffling (since the StringComparer.Equals signature is different from the String.Equals signature) and providing the ability for the JIT to tail-call into the workhorse method due to no shuffling being required with the modified comparer.
src/System.Private.CoreLib/shared/System/Globalization/CompareInfo.cs
Outdated
Show resolved
Hide resolved
|
|
||
| Debug.Assert(!GlobalizationMode.Invariant); | ||
| NonAscii: | ||
| return CompareStringOrdinalIgnoreCase(ref Unsafe.AddByteOffset(ref charA, byteOffset), length, ref Unsafe.AddByteOffset(ref charB, byteOffset), length) == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next iteration will have this moved to a different stub method. Turns out this helps with getting the JIT to emit better prolog in the caller.
src/System.Private.CoreLib/shared/System/Globalization/CompareInfo.cs
Outdated
Show resolved
Hide resolved
| /// This is a branchless implementation. | ||
| /// </remarks> | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| internal static bool UInt64OrdinalIgnoreCaseAscii(ulong valueA, ulong valueB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has a slightly different implementation than the 32-bit version due to the fact that 32-bit operations and 64-bit operations are represented differently by the processor. For example, operations like test and or can take imm32 values but cannot take imm64 values - those values must first be bounced through a temporary register. The 64-bit version of this method is slightly reordered to account for these effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method name should have Equals in it as it doesn't specify what its doing currently
|
Updated perf numbers from the overnight benchmark:
|
|
@GrabYourPitchforks are your numbers only on 64 bit? Did you spot check 32 bit also? |
|
These are 64-bit only, since that's what our benchmarks are set up for. I don't know how to run 32-bit benchmarks but can follow up offline with folks. |
a2d0a95 to
e35dca3
Compare
|
Most recent iteration has been rebased on latest master to resolve the unit test failures. |
|
Here's the benchmark for the 32-bit run. tl;dr is that the gain is approximately on par for most cases, but the non-ASCII case is regressed. (This regression doesn't exist on x64.) It's not ideal, but I think the 99% usage for
Any other objections or feedback before I hit the shiny green button? |
|
Will revisit #19436 after this :) |
|
I'll commit this now since it seems to be good and it'll unblock people like Ben who are waiting. If there's any additional feedback file a separate issue. Thanks all for the review! |
| /// This is a branchless implementation. | ||
| /// </remarks> | ||
| [MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
| internal static bool UInt32OrdinalIgnoreCaseAscii(uint valueA, uint valueB) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method name should have Equals in it as it doesn't specify what its doing currently
Maybe you could add a note to the doc about how to run 32 bit? Was it just |
…20734) - Tries to consume multiple chars in parallel when possible - Didn't vectorize because inputs to this function are generally fairly small - Moved static GlobalizationMode lookup out of hot path - Removed indirection so that StringComparer now calls directly into workhorse routine
…coreclr#20734) - Tries to consume multiple chars in parallel when possible - Didn't vectorize because inputs to this function are generally fairly small - Moved static GlobalizationMode lookup out of hot path - Removed indirection so that StringComparer now calls directly into workhorse routine Commit migrated from dotnet/coreclr@dd6c690
This improves the throughput of
OrdinalIgnoreCasestring comparisons, especially for smaller inputs.(Edit: Removed the earlier posted benchmark numbers since they're outdated. See updated numbers at #20734 (comment).)
The performance gains come primarily from two changes. First, the inner loop is now checking 2 characters at a time instead of one character at a time. Second, the internal branching within the inner loop has been eliminated, which helps avoid branch mispredictions when mixed-case input is provided. The source code comment in the inner loop explains this in more detail.
This code isn't vectorized because strings being compared for equality tend to be small - dictionary keys, identifiers, and the like. I also experimented with performing operations on 4 chars (64 bits) at a time on 64-bit platforms. It provides for inputs of 4+ chars an approx. 15% throughput boost above and beyond what this PR provides. Ultimately I didn't include the 64-bit code because I thought it would complicate the method for too little gain. If consensus is that the gain is worth it I can reintroduce the 64-bit-specific code paths.