Faster conversions #23548

VSadov · 2019-03-29T06:23:45Z

Fixes:https://github.com/dotnet/coreclr/issues/603

==== Before merging

Remove OBJECTREF field from CastCache and pass the ref as a parameter in all usages.
Makes it clear(er) that it does not need protection.
Put the special treatment of T -> Nullable<T> back into CanCastTo. But do not let it into the cache!!!
The reasons for this are a bit tricky. The cache stores “compatible-with” castability. Basically when assignment between boxed types is safe. - rules used by castclass and isinst.
Generic constraints are subtly different. It looks like, intentionally or not, T does not satisfy Nullable<T> constraint.
It may look strange, since boxed forms have the same representation and thus assignable.
However when constrained calls are considered, there is a problem since, for example, boxed int still does not have HasValue method.
Anyways, we are not deciding here how Nullable<T> constraints should work.
Even if current behavior would be somehow undesirable, no advantage whatsoever to change it in this particular PR.
Get rid of NoGC helpers.
Our premise is that single cast even if relatively expensive is still fairly cheap. Only in repeated case it causes grief, but the cache fixes that. We do expect and see high hit rates. As a result noGC helpers really have little chance of helping.
The unframed portion should contain only trivial checks (null checks, identity) and a cache lookup. Then we just call framed helpers.
Handle “failed PublishType” case.
There is a funny part in the type loader where incomplete type may participate in cast analysis (and potentially be cached) before being rejected by the loader and deallocated, leading to ABA problems.
Rejection is rare, but can happen and can be nonfatal. We will only allow types that are fully loaded in the cache. This will guarantee that the types cannot be "undone"
Add benchmarks for key scenarios to dotnet/performance.
Added a few benchmarks for casting performance. Primarily dealing with variance. performance#922
Justify the bucket size (currently 8) by measurements in the case of max sized cache.
I have measured "churning" scenario with various bucket sizes.
Bucket sizes between 4 and 16 seems acceptable. The tradeoff is between typical table size and how bad the worst case scenario may get.
For now 8 seems to be a good balance between expected worst case performance and typical memory usage:
. Bucket size 8 causes resizes at about 50% capacity and has about 2x perf difference between fast/lucky and slow/unlucky casts.
. Bucket size 4 causes resizes at 25%, which results in 2x more memory consumed when below max.
. Bucket size 16 results in 4x difference between fast/lucky and slow/unlucky casts.
It is possible that we can later improve the tradeoff by adjusting preemption policy and perhaps move to a different bucket size. That would be up to further tuning.
Confirm expected performance on ARM64 and ARM32
We use two load barriers in a hot path. There was a concern that it could be noticeable on weak memory models.
To validate this, I tried removing the barriers and running the same microbenchmark. That would generally be unsafe, but the benchmark is single-threaded so it is ok for the purpose of measuring.
. ARM64 performance did not change - meaning that load barrier have very little impact on this scenario.
. ARM32 did benefit from removing the barriers, but there is still a considerable gain from caching even with barriers present. It may be possible to change implementation just for ARM32 to avoid barriers, but it is not clear if benefits will outweigh added complexity.

src/vm/castcache.h

juliusfriedman

+1 For effort

src/vm/jithelpers.cpp

jkotas · 2019-04-16T05:42:10Z

I would be better to have the cache use managed GC collected memory so that flushes do not contribute to GC pauses, no not need to worry about leaks or dangling pointers under various race conditions. and no need to worry about fragmentation. #Resolved

src/vm/castcache.cpp

src/vm/methodtable.cpp

- comments - remove now redundant code from ObjIsInstanceOfCore

Removed comments about encoded pointers.

== use one version field instead of two. - one version field is sufficient since this is a cache and we do not expect heavy reader/writer contention. - makes tables 25% smaller on 32bit. == track the lookup distance of cache entries and allow replacing entries with newer ones if they have shorter distance. - slightly improves average lookup lengths at the cost of occasionally needing to recompute and re-add old entries to their new locations. (re-adding is capped by the bucket limit and eventual resize) - on a cast-stressing microbenchmark ~10% improvement in throughput is observed.

- completely suppress caching of `T-->Nullable<T>` casts, assert that in TryAdd. - make CanCastTo behave as before with `T-->Nullable<T>` - tests for above - use EX_TRY/EX_CATCH vs. c++ - do not allocate in FlushCurrentCache (when EE is suspended)

stephentoub · 2019-10-29T15:07:38Z

@VSadov, can you please share perf test results?

VSadov · 2019-10-29T17:37:17Z

The first thing that we are fixing here is the observed complexity of casts.

The issue is that composite types like arrays, generic interfaces, delegates require that analysis goes into constituent parts like element types, type arguments, their constraints and so on, often on both sides of the cast. That can be very involved.
What is worse is that through nesting the complexity can be made arbitrarily high. Fundamentally this is hard to avoid.

With this change there is an upper bound on the cost of repeated casts. When you see the same cast again, it is extremely likely that you will just get it from the cache. The cache does not dig through types at all and is pretty fast.

There are still ways in which casts can be improved (we have further plans), but the main issue of some casts being a lot more expensive than others is fixed.

VSadov · 2019-10-29T17:51:49Z

The results of original repro from #603 (basically timing 200000000 casts in a for loop):

=== before (just a Ctrl-F5 with whatever coreclr/sdk I have installed):

Empty: 233ms
List<int> to ICollection<int>: 653ms
List<int> to ICollection: 834ms
List<double> to ICollection<int>: 796ms
Thread to IReadOnlyCollection<int>: 4670ms
List<int> to IReadOnlyCollection<int>: 9173ms
List<string> to IReadOnlyCollection<object>: 10544ms
List<double> to IReadOnlyCollection<int>: 10683ms
string[] to IReadOnlyCollection<object>: 3694ms

=== after (with my privately built coreclr)

Empty: 291ms
List<int> to ICollection<int>: 610ms
List<int> to ICollection: 700ms
List<double> to ICollection<int>: 867ms
Thread to IReadOnlyCollection<int>: 975ms
List<int> to IReadOnlyCollection<int>: 1187ms
List<string> to IReadOnlyCollection<object>: 1002ms
List<double> to IReadOnlyCollection<int>: 1182ms
string[] to IReadOnlyCollection<object>: 1062ms

Note: the first 3 cases are "easy" cases handled entirely in assembly helpers.
Nothing changed for these cases since they do not use the cache.

VSadov · 2019-10-29T18:13:41Z

I have run casting benchmarks from https://github.com/dotnet/performance locally.

https://gist.github.com/VSadov/663a141ee142613d2c56c1716f8ac8d8

The relevant scenarios there (at the bottom of the table) are measured slightly differently - just a single cast per invocation so there is some noise and overhead.
There are good improvements though.

VSadov changed the title ~~Easy out for same types.~~ Faster conversions Mar 29, 2019

nil4 reviewed Mar 31, 2019

View reviewed changes

src/vm/castcache.h Outdated Show resolved Hide resolved

VSadov force-pushed the ConvI branch 9 times, most recently from 225e709 to 2ef1f1b Compare April 4, 2019 23:46

juliusfriedman reviewed Apr 5, 2019

View reviewed changes

src/vm/jithelpers.cpp Outdated Show resolved Hide resolved

src/vm/jithelpers.cpp Outdated Show resolved Hide resolved

VSadov force-pushed the ConvI branch 10 times, most recently from 18e8f72 to cf56af8 Compare April 13, 2019 23:49

VSadov force-pushed the ConvI branch from 3699eb2 to 32bc96f Compare April 14, 2019 17:43

VSadov closed this Apr 14, 2019

VSadov reopened this Apr 14, 2019

VSadov force-pushed the ConvI branch 2 times, most recently from f6d2e28 to e80f054 Compare April 15, 2019 15:04

jkotas reviewed Apr 16, 2019

View reviewed changes

src/vm/castcache.cpp Outdated Show resolved Hide resolved

jkotas reviewed Apr 16, 2019

View reviewed changes

src/vm/methodtable.cpp Outdated Show resolved Hide resolved

VSadov added 23 commits October 25, 2019 10:21

lazy cache allocation

5114134

unified with CastResult

468a3cc

positive results in CanCastToClassOrInterface can always be cached.

f7b6b1e

mode changes

46c7ca6

Use managed heap

6619a9f

Fixes. Get rid of a fake array typedesc.

485a8b2

TypeDesc conversions caching

806b8f3

TODOs and comments

ad4242f

hash fix

ce6ec5c

PR feedback:

e913eae

- comments - remove now redundant code from ObjIsInstanceOfCore

Pass OBJECTREF as a parameter, do not wrap in the CastCache

ed752fd

allow max cache size (squash)

28505e4

Removed NoGC helpers

2743595

Constrained array copy should not rely entirely on "NoGC" helper.

85660e8

Removed comments about encoded pointers.

Cache typedesc conversions

a1982a9

Renamed remaining trivial NoGC casting helpers

53a649d

Create static handle for the cast cache eagerly.

a112c67

Prevent cast caching for types that are not fully loaded.

03d4009

Some cleanups. Comments, redundant code.

13d2526

Fix 32bit

02795da

PR feedback.

5c36ba8

- completely suppress caching of `T-->Nullable<T>` casts, assert that in TryAdd. - make CanCastTo behave as before with `T-->Nullable<T>` - tests for above - use EX_TRY/EX_CATCH vs. c++ - do not allocate in FlushCurrentCache (when EE is suspended)

More PR feedback

1b7677e

VSadov force-pushed the ConvI branch from cdc56e0 to 1b7677e Compare October 25, 2019 17:21

VSadov merged commit a55a7eb into dotnet:master Oct 26, 2019

EgorBo mentioned this pull request Sep 15, 2020

[System.Linq] Consider adding runtime checks for IReadOnlyCollection<T> in input sources dotnet/runtime#42254

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Faster conversions #23548

Faster conversions #23548

Uh oh!

VSadov commented Mar 29, 2019 •

edited

Loading

Uh oh!

Uh oh!

juliusfriedman left a comment

Uh oh!

Uh oh!

Uh oh!

jkotas commented Apr 16, 2019 •

edited by VSadov

Loading

Uh oh!

Uh oh!

Uh oh!

stephentoub commented Oct 29, 2019

Uh oh!

VSadov commented Oct 29, 2019 •

edited

Loading

Uh oh!

VSadov commented Oct 29, 2019 •

edited

Loading

Uh oh!

VSadov commented Oct 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Faster conversions #23548

Faster conversions #23548

Uh oh!

Conversation

VSadov commented Mar 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

juliusfriedman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jkotas commented Apr 16, 2019 • edited by VSadov Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stephentoub commented Oct 29, 2019

Uh oh!

VSadov commented Oct 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VSadov commented Oct 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VSadov commented Oct 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

VSadov commented Mar 29, 2019 •

edited

Loading

jkotas commented Apr 16, 2019 •

edited by VSadov

Loading

VSadov commented Oct 29, 2019 •

edited

Loading

VSadov commented Oct 29, 2019 •

edited

Loading