Tweak VecCache to improve performance#138405
Conversation
|
r? @Noratrieb rustbot has assigned @Noratrieb. Use |
|
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
Tweak `VecCache` to improve performance This has some tweaks to `VecCache` to improve performance. - It saves a `compare_exchange` in `complete` using the new `put_unique` function. - It removes bound checks on entries. These are instead checked in the `slot_index_exhaustive` test. - `initialize_bucket` is outlined and tuned for that. cc `@Mark-Simulacrum`
|
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
| #[cold] | ||
| fn initialize_bucket<V>(&self, bucket: &AtomicPtr<Slot<V>>) -> *mut Slot<V> { | ||
| #[inline(never)] | ||
| fn initialize_bucket<V>(bucket: &AtomicPtr<Slot<V>>, bucket_idx: usize) -> *mut Slot<V> { |
There was a problem hiding this comment.
It avoids Self needing it exist when not inlined, as the needed information can be passed in registers instead.
|
Finished benchmarking commit (0b0612c): comparison URL. Overall result: ❌✅ regressions and improvements - please read the text belowBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.
Max RSS (memory usage)Results (primary -2.0%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResults (primary 4.1%, secondary -3.1%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 779.667s -> 779.823s (0.02%) |
|
Local results:
Looks like something might be up with |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I can't reproduce the
|
|||||||||||||||||||||||||||||||||||||||||||||||||
|
Cycles are noisy, although this was high above the threshold. Let's try again, just in case it was a fluke. @bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
Tweak `VecCache` to improve performance This has some tweaks to `VecCache` to improve performance. - It saves a `compare_exchange` in `complete` using the new `put_unique` function. - It removes bound checks on entries. These are instead checked in the `slot_index_exhaustive` test. - `initialize_bucket` is outlined and tuned for that. cc `@Mark-Simulacrum`
|
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (d06bb14): comparison URL. Overall result: ❌✅ regressions and improvements - please read the text belowBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.
Max RSS (memory usage)This benchmark run did not return any relevant results for this metric. CyclesResults (primary -1.5%, secondary -2.1%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 779.458s -> 779.323s (-0.02%) |
|
Looks like it was a fluke after all. |
|
yes |
|
r? rust-lang/compiler |
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (f020bce): comparison URL. Overall result: ❌✅ regressions and improvements - please read the text belowBenchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please do so in sufficient writing along with @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (primary 1.7%, secondary 3.2%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (secondary -0.3%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeResults (primary -0.0%, secondary -0.0%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Bootstrap: 472.506s -> 470.841s (-0.35%) |
If the indices are unique, the location they operate on are also unique and won't cause a race.
It still uses a lock for the main slots, but this changes the secondary list of inserted keys to just insert without atomics as the length (used as the index) is unique for each |
|
@rustbot ready |
|
Not too comfortable reviewing this, so @rustbot reroll |
|
Also not comfortable, |
|
Can you update the PR description to reflect the more limited change being made here? r=me with that done. |
|
@Mark-Simulacrum Not sure what you mean, the description is still accurate? |
|
@bors r+ Ah, I misread the line about complete. Sorry about that. |
This comment has been minimized.
This comment has been minimized.
What is this?This is an experimental post-merge analysis report that shows differences in test outcomes between the merged PR and its parent PR.Comparing 8340622 (parent) -> 57d2fb1 (this PR) Test differencesShow 5 test diffsStage 0
Stage 1
Additionally, 3 doctest diffs were found. These are ignored, as they are noisy. Job group index Test dashboardRun cargo run --manifest-path src/ci/citool/Cargo.toml -- \
test-dashboard 57d2fb136650d05efb3ed3ea33b330bfc85844d5 --output-dir test-dashboardAnd then open Job duration changes
How to interpret the job duration changes?Job durations can vary a lot, based on the actual runner instance |
|
Finished benchmarking commit (57d2fb1): comparison URL. Overall result: ❌✅ regressions and improvements - please read the text belowOur benchmarks found a performance regression caused by this PR. Next Steps:
@rustbot label: +perf-regression Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)Results (primary -1.1%, secondary -3.6%)A less reliable metric. May be of interest, but not used to determine the overall result above.
CyclesResults (secondary 0.0%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 472.577s -> 473.934s (0.29%) |
|
Tiny regressions on a synthetic benchmark, and the overall benefits easily outweigh them. @rustbot label: +perf-regression-triaged |
This has some tweaks to
VecCacheto improve performance.compare_exchangeincompleteusing the newput_uniquefunction.slot_index_exhaustivetest.initialize_bucketis outlined and tuned for that.cc @Mark-Simulacrum