Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Conversation

@VSadov
Copy link
Member

@VSadov VSadov commented Mar 29, 2019

Fixes:https://github.com/dotnet/coreclr/issues/603

==== Before merging

  • Remove OBJECTREF field from CastCache and pass the ref as a parameter in all usages.
    Makes it clear(er) that it does not need protection.
  • Put the special treatment of T -> Nullable<T> back into CanCastTo. But do not let it into the cache!!!
    The reasons for this are a bit tricky. The cache stores “compatible-with” castability. Basically when assignment between boxed types is safe. - rules used by castclass and isinst.
    Generic constraints are subtly different. It looks like, intentionally or not, T does not satisfy Nullable<T> constraint.
    It may look strange, since boxed forms have the same representation and thus assignable.
    However when constrained calls are considered, there is a problem since, for example, boxed int still does not have HasValue method.
    Anyways, we are not deciding here how Nullable<T> constraints should work.
    Even if current behavior would be somehow undesirable, no advantage whatsoever to change it in this particular PR.
  • Get rid of NoGC helpers.
    Our premise is that single cast even if relatively expensive is still fairly cheap. Only in repeated case it causes grief, but the cache fixes that. We do expect and see high hit rates. As a result noGC helpers really have little chance of helping.
    The unframed portion should contain only trivial checks (null checks, identity) and a cache lookup. Then we just call framed helpers.
  • Handle “failed PublishType” case.
    There is a funny part in the type loader where incomplete type may participate in cast analysis (and potentially be cached) before being rejected by the loader and deallocated, leading to ABA problems.
    Rejection is rare, but can happen and can be nonfatal. We will only allow types that are fully loaded in the cache. This will guarantee that the types cannot be "undone"
  • Add benchmarks for key scenarios to dotnet/performance.
    Added a few benchmarks for casting performance. Primarily dealing with variance. performance#922
  • Justify the bucket size (currently 8) by measurements in the case of max sized cache.
    I have measured "churning" scenario with various bucket sizes.
    Bucket sizes between 4 and 16 seems acceptable. The tradeoff is between typical table size and how bad the worst case scenario may get.
    For now 8 seems to be a good balance between expected worst case performance and typical memory usage:
    . Bucket size 8 causes resizes at about 50% capacity and has about 2x perf difference between fast/lucky and slow/unlucky casts.
    . Bucket size 4 causes resizes at 25%, which results in 2x more memory consumed when below max.
    . Bucket size 16 results in 4x difference between fast/lucky and slow/unlucky casts.
    It is possible that we can later improve the tradeoff by adjusting preemption policy and perhaps move to a different bucket size. That would be up to further tuning.
  • Confirm expected performance on ARM64 and ARM32
    We use two load barriers in a hot path. There was a concern that it could be noticeable on weak memory models.
    To validate this, I tried removing the barriers and running the same microbenchmark. That would generally be unsafe, but the benchmark is single-threaded so it is ok for the purpose of measuring.
    . ARM64 performance did not change - meaning that load barrier have very little impact on this scenario.
    . ARM32 did benefit from removing the barriers, but there is still a considerable gain from caching even with barriers present. It may be possible to change implementation just for ARM32 to avoid barriers, but it is not clear if benefits will outweigh added complexity.

@VSadov VSadov changed the title Easy out for same types. Faster conversions Mar 29, 2019
@VSadov VSadov force-pushed the ConvI branch 9 times, most recently from 225e709 to 2ef1f1b Compare April 4, 2019 23:46
Copy link

@juliusfriedman juliusfriedman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 For effort

@VSadov VSadov force-pushed the ConvI branch 10 times, most recently from 18e8f72 to cf56af8 Compare April 13, 2019 23:49
@VSadov VSadov closed this Apr 14, 2019
@VSadov VSadov reopened this Apr 14, 2019
@VSadov VSadov force-pushed the ConvI branch 2 times, most recently from f6d2e28 to e80f054 Compare April 15, 2019 15:04
@jkotas
Copy link
Member

jkotas commented Apr 16, 2019

I would be better to have the cache use managed GC collected memory so that flushes do not contribute to GC pauses, no not need to worry about leaks or dangling pointers under various race conditions. and no need to worry about fragmentation. #Resolved

VSadov added 23 commits October 25, 2019 10:21
- comments
- remove now redundant code from ObjIsInstanceOfCore
== use one version field instead of two.
 -  one version field is sufficient since this is a cache and we do not expect heavy reader/writer contention.
 -  makes tables 25% smaller on 32bit.

== track the lookup distance of cache entries and allow replacing entries with newer ones if they have shorter distance.
 -  slightly improves average lookup lengths at the cost of occasionally needing to recompute and re-add old entries to their new locations. (re-adding is capped by the bucket limit and eventual resize)
 -  on a cast-stressing microbenchmark ~10% improvement in throughput is observed.
- completely suppress caching of   `T-->Nullable<T>`  casts, assert that in TryAdd.
- make CanCastTo behave as before with `T-->Nullable<T>`
- tests for above
- use EX_TRY/EX_CATCH   vs. c++
- do not allocate in FlushCurrentCache (when EE is suspended)
@VSadov VSadov merged commit a55a7eb into dotnet:master Oct 26, 2019
@stephentoub
Copy link
Member

@VSadov, can you please share perf test results?

@VSadov
Copy link
Member Author

VSadov commented Oct 29, 2019

The first thing that we are fixing here is the observed complexity of casts.

The issue is that composite types like arrays, generic interfaces, delegates require that analysis goes into constituent parts like element types, type arguments, their constraints and so on, often on both sides of the cast. That can be very involved.
What is worse is that through nesting the complexity can be made arbitrarily high. Fundamentally this is hard to avoid.

With this change there is an upper bound on the cost of repeated casts. When you see the same cast again, it is extremely likely that you will just get it from the cache. The cache does not dig through types at all and is pretty fast.

There are still ways in which casts can be improved (we have further plans), but the main issue of some casts being a lot more expensive than others is fixed.

@VSadov
Copy link
Member Author

VSadov commented Oct 29, 2019

The results of original repro from #603 (basically timing 200000000 casts in a for loop):

=== before (just a Ctrl-F5 with whatever coreclr/sdk I have installed):

Empty: 233ms
List<int> to ICollection<int>: 653ms
List<int> to ICollection: 834ms
List<double> to ICollection<int>: 796ms
Thread to IReadOnlyCollection<int>: 4670ms
List<int> to IReadOnlyCollection<int>: 9173ms
List<string> to IReadOnlyCollection<object>: 10544ms
List<double> to IReadOnlyCollection<int>: 10683ms
string[] to IReadOnlyCollection<object>: 3694ms

=== after (with my privately built coreclr)

Empty: 291ms
List<int> to ICollection<int>: 610ms
List<int> to ICollection: 700ms
List<double> to ICollection<int>: 867ms
Thread to IReadOnlyCollection<int>: 975ms
List<int> to IReadOnlyCollection<int>: 1187ms
List<string> to IReadOnlyCollection<object>: 1002ms
List<double> to IReadOnlyCollection<int>: 1182ms
string[] to IReadOnlyCollection<object>: 1062ms

Note: the first 3 cases are "easy" cases handled entirely in assembly helpers.
Nothing changed for these cases since they do not use the cache.

@VSadov
Copy link
Member Author

VSadov commented Oct 29, 2019

I have run casting benchmarks from https://github.com/dotnet/performance locally.

https://gist.github.com/VSadov/663a141ee142613d2c56c1716f8ac8d8

The relevant scenarios there (at the bottom of the table) are measured slightly differently - just a single cast per invocation so there is some noise and overhead.
There are good improvements though.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants