-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Faster conversions #23548
Faster conversions #23548
Conversation
225e709 to
2ef1f1b
Compare
juliusfriedman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 For effort
18e8f72 to
cf56af8
Compare
f6d2e28 to
e80f054
Compare
|
I would be better to have the cache use managed GC collected memory so that flushes do not contribute to GC pauses, no not need to worry about leaks or dangling pointers under various race conditions. and no need to worry about fragmentation. #Resolved |
- comments - remove now redundant code from ObjIsInstanceOfCore
Removed comments about encoded pointers.
== use one version field instead of two. - one version field is sufficient since this is a cache and we do not expect heavy reader/writer contention. - makes tables 25% smaller on 32bit. == track the lookup distance of cache entries and allow replacing entries with newer ones if they have shorter distance. - slightly improves average lookup lengths at the cost of occasionally needing to recompute and re-add old entries to their new locations. (re-adding is capped by the bucket limit and eventual resize) - on a cast-stressing microbenchmark ~10% improvement in throughput is observed.
- completely suppress caching of `T-->Nullable<T>` casts, assert that in TryAdd. - make CanCastTo behave as before with `T-->Nullable<T>` - tests for above - use EX_TRY/EX_CATCH vs. c++ - do not allocate in FlushCurrentCache (when EE is suspended)
|
@VSadov, can you please share perf test results? |
|
The first thing that we are fixing here is the observed complexity of casts. The issue is that composite types like arrays, generic interfaces, delegates require that analysis goes into constituent parts like element types, type arguments, their constraints and so on, often on both sides of the cast. That can be very involved. With this change there is an upper bound on the cost of repeated casts. When you see the same cast again, it is extremely likely that you will just get it from the cache. The cache does not dig through types at all and is pretty fast. There are still ways in which casts can be improved (we have further plans), but the main issue of some casts being a lot more expensive than others is fixed. |
|
The results of original repro from #603 (basically timing 200000000 casts in a for loop): === before (just a Ctrl-F5 with whatever coreclr/sdk I have installed): === after (with my privately built coreclr) Note: the first 3 cases are "easy" cases handled entirely in assembly helpers. |
|
I have run casting benchmarks from https://github.com/dotnet/performance locally. https://gist.github.com/VSadov/663a141ee142613d2c56c1716f8ac8d8 The relevant scenarios there (at the bottom of the table) are measured slightly differently - just a single cast per invocation so there is some noise and overhead. |
Fixes:https://github.com/dotnet/coreclr/issues/603
==== Before merging
Makes it clear(er) that it does not need protection.
T -> Nullable<T>back into CanCastTo. But do not let it into the cache!!!The reasons for this are a bit tricky. The cache stores “compatible-with” castability. Basically when assignment between boxed types is safe. - rules used by
castclassandisinst.Generic constraints are subtly different. It looks like, intentionally or not, T does not satisfy
Nullable<T>constraint.It may look strange, since boxed forms have the same representation and thus assignable.
However when constrained calls are considered, there is a problem since, for example, boxed int still does not have
HasValuemethod.Anyways, we are not deciding here how
Nullable<T>constraints should work.Even if current behavior would be somehow undesirable, no advantage whatsoever to change it in this particular PR.
Our premise is that single cast even if relatively expensive is still fairly cheap. Only in repeated case it causes grief, but the cache fixes that. We do expect and see high hit rates. As a result noGC helpers really have little chance of helping.
The unframed portion should contain only trivial checks (null checks, identity) and a cache lookup. Then we just call framed helpers.
There is a funny part in the type loader where incomplete type may participate in cast analysis (and potentially be cached) before being rejected by the loader and deallocated, leading to ABA problems.
Rejection is rare, but can happen and can be nonfatal. We will only allow types that are fully loaded in the cache. This will guarantee that the types cannot be "undone"
Added a few benchmarks for casting performance. Primarily dealing with variance. performance#922
I have measured "churning" scenario with various bucket sizes.
Bucket sizes between 4 and 16 seems acceptable. The tradeoff is between typical table size and how bad the worst case scenario may get.
For now 8 seems to be a good balance between expected worst case performance and typical memory usage:
. Bucket size 8 causes resizes at about 50% capacity and has about 2x perf difference between fast/lucky and slow/unlucky casts.
. Bucket size 4 causes resizes at 25%, which results in 2x more memory consumed when below max.
. Bucket size 16 results in 4x difference between fast/lucky and slow/unlucky casts.
It is possible that we can later improve the tradeoff by adjusting preemption policy and perhaps move to a different bucket size. That would be up to further tuning.
We use two load barriers in a hot path. There was a concern that it could be noticeable on weak memory models.
To validate this, I tried removing the barriers and running the same microbenchmark. That would generally be unsafe, but the benchmark is single-threaded so it is ok for the purpose of measuring.
. ARM64 performance did not change - meaning that load barrier have very little impact on this scenario.
. ARM32 did benefit from removing the barriers, but there is still a considerable gain from caching even with barriers present. It may be possible to change implementation just for ARM32 to avoid barriers, but it is not clear if benefits will outweigh added complexity.