Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Conversation

@GrabYourPitchforks
Copy link
Member

@GrabYourPitchforks GrabYourPitchforks commented Oct 31, 2018

This improves the throughput of OrdinalIgnoreCase string comparisons, especially for smaller inputs.

(Edit: Removed the earlier posted benchmark numbers since they're outdated. See updated numbers at #20734 (comment).)

The performance gains come primarily from two changes. First, the inner loop is now checking 2 characters at a time instead of one character at a time. Second, the internal branching within the inner loop has been eliminated, which helps avoid branch mispredictions when mixed-case input is provided. The source code comment in the inner loop explains this in more detail.

This code isn't vectorized because strings being compared for equality tend to be small - dictionary keys, identifiers, and the like. I also experimented with performing operations on 4 chars (64 bits) at a time on 64-bit platforms. It provides for inputs of 4+ chars an approx. 15% throughput boost above and beyond what this PR provides. Ultimately I didn't include the 64-bit code because I thought it would complicate the method for too little gain. If consensus is that the gain is worth it I can reintroduce the 64-bit-specific code paths.

@danmoseley
Copy link
Member

danmoseley commented Nov 1, 2018

It provides for inputs of 4+ chars an approx. 15% throughput boost above and beyond what this PR provides. Ultimately I didn't include the 64-bit code because I thought it would complicate the method for too little gain. If consensus is that the gain is worth it I can reintroduce the 64-bit-specific code paths.

A 15% boost to Equals OrdinalIgnoreCase is worth some code complexity, IMO. Especially as 64 bit is the 90% case right now.

Recall that for Bing, improvements to Equals and IndexOf (operations only a little more primitive) were the single biggest contributor to their performance gains on .NET Core.

@GrabYourPitchforks
Copy link
Member Author

I'll investigate the test failures. It's possible I ran the tests locally against the wrong version of the DLL.

Copy link
Member

@danmoseley danmoseley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever. Did you come up with this idea? Someone else should probably have a look.

@danmoseley
Copy link
Member

Given the test failures, apparently I overlooked a mistake.

// performing a proper deep equality check of the string contents. We want to optimize for
// the case where the equality check is likely to succeed, which means that we want to avoid
// branching within this loop unless we're about to exit the loop, either due to failure or
// due to us running out of input data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The perf results you posted appear to only cover cases where the strings are considered equal. What does perf look like for inputs of the same length that are not equal, for inputs of different lengths, etc.? I agree that dictionary-like data structures are a very common case, but there are plenty of others where there isn't such a first-pass check. Search in src\System.Net.Http\src for ".Equals(", for example, and you'll find lots of OrdinalIgnoreCase checks. Path has several uses without such checks. TimeZoneInfo. X509Certificates. And so on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inputs of different length never hit this method because they're always kicked out by the caller. For inputs that truly aren't equal, the performance is still better than baseline, but not quite as good as if they are truly equal. I can provide benchmark numbers with those tests if you'd like. It'll take a while to rerun all of the benchmarks if we expand the battery to include these, but that's what "go make dinner and let this run in the background" is for. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For inputs that truly aren't equal, the performance is still better than baseline, but not quite as good as if they are truly equal.

Great. I'd be happy with not regressing those cases.

}

return CompareInfo.EqualsOrdinalIgnoreCase(ref x.GetRawStringData(), ref y.GetRawStringData(), x.Length);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is effectively just manually inlining the previous string.Equals call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, yeah. It's also skipping the unnecessary overhead that String.Equals(...) performs re: length checking, referential equality, and validation of StringComparison parameter, because the caller knows these checks are irrelevant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is checking reference equality.

Why isn't the length check beneficial here?

For validation of the StringComparison, since it was a constant here I would have hoped if it actually was inlined (I realize it wouldn't be without nudging) the JIT would then be able to prune those checks away.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, that's an oversight. It looks like the unit tests are passing regardless, so I'll add a test for this condition to check for future regressions.

Copy link
Member

@stephentoub stephentoub Nov 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So at this point it's saving a non-virtual method call and a switch. Given all the other work being done, does that end up being measurable? If so, ok. If not, it'd be better to avoid the code duplication.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rerun a test with only this change and get new numbers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method Toolchain Mean Error StdDev Scaled
EqualsOrdinalIgnoreCase_StringsDifferOnlyInCase modified comparer 5.017 ms 0.0637 ms 0.0532 ms 0.75
EqualsOrdinalIgnoreCase_StringsDifferOnlyInCase original comparer 6.675 ms 0.1057 ms 0.0989 ms 1.00

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the perf boost is coming from a combination of not having to perform register shuffling (since the StringComparer.Equals signature is different from the String.Equals signature) and providing the ability for the JIT to tail-call into the workhorse method due to no shuffling being required with the modified comparer.


Debug.Assert(!GlobalizationMode.Invariant);
NonAscii:
return CompareStringOrdinalIgnoreCase(ref Unsafe.AddByteOffset(ref charA, byteOffset), length, ref Unsafe.AddByteOffset(ref charB, byteOffset), length) == 0;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next iteration will have this moved to a different stub method. Turns out this helps with getting the JIT to emit better prolog in the caller.

/// This is a branchless implementation.
/// </remarks>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool UInt64OrdinalIgnoreCaseAscii(ulong valueA, ulong valueB)
Copy link
Member Author

@GrabYourPitchforks GrabYourPitchforks Nov 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method has a slightly different implementation than the 32-bit version due to the fact that 32-bit operations and 64-bit operations are represented differently by the processor. For example, operations like test and or can take imm32 values but cannot take imm64 values - those values must first be bounced through a temporary register. The 64-bit version of this method is slightly reordered to account for these effects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method name should have Equals in it as it doesn't specify what its doing currently

@GrabYourPitchforks
Copy link
Member Author

Updated perf numbers from the overnight benchmark:

Method Toolchain StringLength Mean Error StdDev Scaled ScaledSD
EqualsOrdinalIgnoreCase_StringsAreIdentical experimental 1 4.991 ms 0.0972 ms 0.1081 ms 0.52 0.01
EqualsOrdinalIgnoreCase_StringsAreIdentical preview1-27030 1 9.638 ms 0.0718 ms 0.0637 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsDifferOnlyInCase experimental 1 5.115 ms 0.0153 ms 0.0136 ms 0.59 0.00
EqualsOrdinalIgnoreCase_StringsDifferOnlyInCase preview1-27030 1 8.729 ms 0.0384 ms 0.0359 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsNotEqual experimental 1 4.814 ms 0.0333 ms 0.0295 ms 0.62 0.01
EqualsOrdinalIgnoreCase_StringsNotEqual preview1-27030 1 7.736 ms 0.0622 ms 0.0519 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsNotAscii experimental 1 19.880 ms 0.2286 ms 0.2138 ms 0.97 0.01
EqualsOrdinalIgnoreCase_StringsNotAscii preview1-27030 1 20.523 ms 0.1394 ms 0.1304 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsAreIdentical experimental 2 4.885 ms 0.0185 ms 0.0164 ms 0.49 0.00
EqualsOrdinalIgnoreCase_StringsAreIdentical preview1-27030 2 9.953 ms 0.0405 ms 0.0359 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsDifferOnlyInCase experimental 2 5.030 ms 0.0993 ms 0.1326 ms 0.53 0.02
EqualsOrdinalIgnoreCase_StringsDifferOnlyInCase preview1-27030 2 9.568 ms 0.1713 ms 0.1519 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsNotEqual experimental 2 4.808 ms 0.0196 ms 0.0174 ms 0.60 0.00
EqualsOrdinalIgnoreCase_StringsNotEqual preview1-27030 2 8.044 ms 0.0367 ms 0.0343 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsNotAscii experimental 2 21.707 ms 0.4292 ms 0.4408 ms 0.96 0.02
EqualsOrdinalIgnoreCase_StringsNotAscii preview1-27030 2 22.603 ms 0.2804 ms 0.2341 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsAreIdentical experimental 3 5.165 ms 0.1091 ms 0.1299 ms 0.50 0.01
EqualsOrdinalIgnoreCase_StringsAreIdentical preview1-27030 3 10.383 ms 0.1615 ms 0.1432 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsDifferOnlyInCase experimental 3 5.028 ms 0.0991 ms 0.0879 ms 0.46 0.01
EqualsOrdinalIgnoreCase_StringsDifferOnlyInCase preview1-27030 3 10.936 ms 0.0607 ms 0.0568 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsNotEqual experimental 3 4.818 ms 0.0181 ms 0.0151 ms 0.62 0.00
EqualsOrdinalIgnoreCase_StringsNotEqual preview1-27030 3 7.739 ms 0.0471 ms 0.0441 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsNotAscii experimental 3 24.301 ms 0.4373 ms 0.3877 ms 0.97 0.02
EqualsOrdinalIgnoreCase_StringsNotAscii preview1-27030 3 24.962 ms 0.0758 ms 0.0672 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsAreIdentical experimental 4 5.004 ms 0.0515 ms 0.0430 ms 0.45 0.00
EqualsOrdinalIgnoreCase_StringsAreIdentical preview1-27030 4 11.079 ms 0.0481 ms 0.0450 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsDifferOnlyInCase experimental 4 5.156 ms 0.0939 ms 0.0878 ms 0.41 0.01
EqualsOrdinalIgnoreCase_StringsDifferOnlyInCase preview1-27030 4 12.724 ms 0.2465 ms 0.3536 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsNotEqual experimental 4 4.961 ms 0.0258 ms 0.0216 ms 0.63 0.01
EqualsOrdinalIgnoreCase_StringsNotEqual preview1-27030 4 7.837 ms 0.1643 ms 0.1536 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsNotAscii experimental 4 26.428 ms 0.4068 ms 0.3397 ms 0.97 0.01
EqualsOrdinalIgnoreCase_StringsNotAscii preview1-27030 4 27.238 ms 0.1066 ms 0.0997 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsAreIdentical experimental 6 6.025 ms 0.0295 ms 0.0262 ms 0.48 0.01
EqualsOrdinalIgnoreCase_StringsAreIdentical preview1-27030 6 12.497 ms 0.2416 ms 0.2141 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsDifferOnlyInCase experimental 6 6.064 ms 0.0661 ms 0.0552 ms 0.47 0.00
EqualsOrdinalIgnoreCase_StringsDifferOnlyInCase preview1-27030 6 12.938 ms 0.0544 ms 0.0454 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsNotEqual experimental 6 4.949 ms 0.0136 ms 0.0127 ms 0.63 0.02
EqualsOrdinalIgnoreCase_StringsNotEqual preview1-27030 6 7.881 ms 0.1666 ms 0.2107 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsNotAscii experimental 6 31.130 ms 0.1876 ms 0.1755 ms 0.93 0.01
EqualsOrdinalIgnoreCase_StringsNotAscii preview1-27030 6 33.327 ms 0.1734 ms 0.1622 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsAreIdentical experimental 8 7.088 ms 0.0380 ms 0.0317 ms 0.47 0.00
EqualsOrdinalIgnoreCase_StringsAreIdentical preview1-27030 8 15.067 ms 0.0814 ms 0.0762 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsDifferOnlyInCase experimental 8 7.076 ms 0.0327 ms 0.0290 ms 0.48 0.00
EqualsOrdinalIgnoreCase_StringsDifferOnlyInCase preview1-27030 8 14.674 ms 0.0402 ms 0.0314 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsNotEqual experimental 8 4.959 ms 0.0237 ms 0.0222 ms 0.59 0.00
EqualsOrdinalIgnoreCase_StringsNotEqual preview1-27030 8 8.352 ms 0.0378 ms 0.0316 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsNotAscii experimental 8 34.226 ms 0.1941 ms 0.1720 ms 0.98 0.01
EqualsOrdinalIgnoreCase_StringsNotAscii preview1-27030 8 34.845 ms 0.1563 ms 0.1462 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsAreIdentical experimental 12 8.767 ms 0.1754 ms 0.1722 ms 0.49 0.01
EqualsOrdinalIgnoreCase_StringsAreIdentical preview1-27030 12 18.058 ms 0.3583 ms 0.3983 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsDifferOnlyInCase experimental 12 9.007 ms 0.0462 ms 0.0410 ms 0.48 0.00
EqualsOrdinalIgnoreCase_StringsDifferOnlyInCase preview1-27030 12 18.843 ms 0.1103 ms 0.1032 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsNotEqual experimental 12 4.964 ms 0.0243 ms 0.0203 ms 0.59 0.01
EqualsOrdinalIgnoreCase_StringsNotEqual preview1-27030 12 8.396 ms 0.1596 ms 0.1493 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsNotAscii experimental 12 42.862 ms 0.2054 ms 0.1821 ms 0.95 0.01
EqualsOrdinalIgnoreCase_StringsNotAscii preview1-27030 12 45.005 ms 0.7340 ms 0.6507 ms 1.00 0.00

@danmoseley
Copy link
Member

@GrabYourPitchforks are your numbers only on 64 bit? Did you spot check 32 bit also?

@GrabYourPitchforks
Copy link
Member Author

These are 64-bit only, since that's what our benchmarks are set up for. I don't know how to run 32-bit benchmarks but can follow up offline with folks.

@GrabYourPitchforks GrabYourPitchforks force-pushed the string_equals_ordinalignorecase branch from a2d0a95 to e35dca3 Compare November 1, 2018 21:17
@GrabYourPitchforks
Copy link
Member Author

Most recent iteration has been rebased on latest master to resolve the unit test failures.

@GrabYourPitchforks
Copy link
Member Author

Here's the benchmark for the 32-bit run. tl;dr is that the gain is approximately on par for most cases, but the non-ASCII case is regressed. (This regression doesn't exist on x64.) It's not ideal, but I think the 99% usage for OrdinalIgnoreCase is over ASCII data, so overall it's probably still worth it in the end.

Method Toolchain StringLength Mean Error StdDev Median Ratio RatioSD
EqualsOrdinalIgnoreCase_StringsAreIdentical baseline 6 14.581 ms 0.3183 ms 0.5900 ms 14.312 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsAreIdentical experimental 6 8.914 ms 0.1441 ms 0.1277 ms 8.871 ms 0.62 0.02
EqualsOrdinalIgnoreCase_StringsDifferOnlyInCase baseline 6 14.987 ms 0.2948 ms 0.4133 ms 15.053 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsDifferOnlyInCase experimental 6 8.865 ms 0.1948 ms 0.2975 ms 8.932 ms 0.59 0.03
EqualsOrdinalIgnoreCase_StringsNotEqual baseline 6 9.870 ms 0.1960 ms 0.1738 ms 9.819 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsNotEqual experimental 6 5.885 ms 0.1303 ms 0.1219 ms 5.828 ms 0.60 0.02
EqualsOrdinalIgnoreCase_StringsNotAscii baseline 6 42.196 ms 0.8339 ms 0.7800 ms 42.044 ms 1.00 0.00
EqualsOrdinalIgnoreCase_StringsNotAscii experimental 6 50.593 ms 1.5708 ms 4.6314 ms 50.632 ms 1.13 0.12

Any other objections or feedback before I hit the shiny green button?

@benaadams
Copy link
Member

Will revisit #19436 after this :)

@GrabYourPitchforks
Copy link
Member Author

I'll commit this now since it seems to be good and it'll unblock people like Ben who are waiting. If there's any additional feedback file a separate issue. Thanks all for the review!

@GrabYourPitchforks GrabYourPitchforks merged commit dd6c690 into dotnet:master Nov 3, 2018
/// This is a branchless implementation.
/// </remarks>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool UInt32OrdinalIgnoreCaseAscii(uint valueA, uint valueB)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method name should have Equals in it as it doesn't specify what its doing currently

@danmoseley
Copy link
Member

Here's the benchmark for the 32-bit run

Maybe you could add a note to the doc about how to run 32 bit? Was it just /p:archgroup=x86?
Doc: https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/performance-tests.md

@GrabYourPitchforks GrabYourPitchforks deleted the string_equals_ordinalignorecase branch November 5, 2018 19:17
A-And pushed a commit to A-And/coreclr that referenced this pull request Nov 20, 2018
…20734)

- Tries to consume multiple chars in parallel when possible
- Didn't vectorize because inputs to this function are generally fairly small
- Moved static GlobalizationMode lookup out of hot path
- Removed indirection so that StringComparer now calls directly into workhorse routine
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
…coreclr#20734)

- Tries to consume multiple chars in parallel when possible
- Didn't vectorize because inputs to this function are generally fairly small
- Moved static GlobalizationMode lookup out of hot path
- Removed indirection so that StringComparer now calls directly into workhorse routine

Commit migrated from dotnet/coreclr@dd6c690
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants