Add ArrayVec and AccumulateVec to reduce heap allocations during interning of slices#37270
Conversation
f963bbd to
beee655
Compare
|
cc @rust-lang/lang We'd like to have impls of |
|
☔ The latest upstream changes (presumably #37269) made this pull request unmergeable. Please resolve the merge conflicts. |
There was a problem hiding this comment.
Isn't this a good place to use from_raw_parts and skip the bounds check for the slice? Or does it not matter?
There was a problem hiding this comment.
Oh I see, didn't think of it like that. Yeah it would be slightly less cruft to optimize away.
There was a problem hiding this comment.
Since rustc devs are the UB police, is this guaranteed to be convertible to T, even without #[repr(C)]? (I'm reading this PR with interest for the future of the external arrayvec and smallvec crates of course.)
There was a problem hiding this comment.
It's technically in newtype form, so we can guarantee it at no cost. However, adding #[repr(C)] would be better here, for now.
There was a problem hiding this comment.
Adding #[repr(C)] would require that T is repr(C), I thought? Or is that just a lint, and we can allow it away?
There was a problem hiding this comment.
If you have DerefMut, this is just drop_in_place(&mut self[..])
There was a problem hiding this comment.
Adding DerefMut feels unsafe since we then have potential for safe mutation of the stack allocated buffer, which we probably don't want to allow?
There was a problem hiding this comment.
I don't see what's unsound about modifying the buffer, since it's only making a slice of the valid/initialized element range.
There was a problem hiding this comment.
It's perfectly safe, it can't do anything harmful, not even semantically.
80c2bcf to
bad8be6
Compare
There was a problem hiding this comment.
These two variables need to be in a SmallVec8 variable for unwinding to drop the right values.
There was a problem hiding this comment.
This should use field assignment, i.e .value = el.
Mark-Simulacrum
left a comment
There was a problem hiding this comment.
Leaving comments to myself, mainly, but a few questions for @eddyb as well.
src/librustc/ty/context.rs
Outdated
There was a problem hiding this comment.
This should probably take I: SliceInternable, but then the current code doesn't support passing it to iter::chain, since SliceInternable is not an Iterator. @eddyb Thoughts on what we should/could do here?
There was a problem hiding this comment.
Most users are genuinely slice though. Not high-priority.
src/librustc/ty/subst.rs
Outdated
There was a problem hiding this comment.
We want to avoid Iterator::cloned(), but the types don't match up here to allow that, since we're iterating over &Substs, not Substs. I'm not sure if there is/what the good solution here is; I believe this is semi-common though.
There was a problem hiding this comment.
You can't avoid an iterator here. .cloned() is suspicious when used on a slice that can be passed directly - there's no cloning cost, these are just pointer copies.
There was a problem hiding this comment.
Can you clarify as to if it's possible to avoid .cloned() here? The target_substs and (I believe) the slice into self are both &Kind instead of Kind. Or is it Ty?
There was a problem hiding this comment.
It's not but that's irrelevant here. cloned is not bad, it's just a sign of a slice iterator, which in some cases (not this) can be replaced with the slice itself.
There was a problem hiding this comment.
Adding #[repr(C)] would require that T is repr(C), I thought? Or is that just a lint, and we can allow it away?
There was a problem hiding this comment.
Adding DerefMut feels unsafe since we then have potential for safe mutation of the stack allocated buffer, which we probably don't want to allow?
src/librustc_mir/build/scope.rs
Outdated
There was a problem hiding this comment.
I feel like this might be easier to optimize if we skip the iter::once, since we're not joining it with anything anyway. Leaving this is a note to self, mostly, since I'm not sure we can omit the iterator with the current implementation.
src/librustc_mir/hair/cx/mod.rs
Outdated
There was a problem hiding this comment.
Should we be taking a params: SliceInternable<...> instead of &[...] here?
There was a problem hiding this comment.
I wish. Most of these take a slice though.
src/librustc_typeck/astconv.rs
Outdated
There was a problem hiding this comment.
Should collect into a MaybeSmallVec8.
There was a problem hiding this comment.
Can't this use an itera- ah shucks, the slice impl problem.
src/librustc_typeck/check/closure.rs
Outdated
There was a problem hiding this comment.
Should we be allocating a vec![...] here? I feel like we might be able to change the struct's type to a SmallVec8?
There was a problem hiding this comment.
The correct thing to do is to intern a slice, but that's a bit more complicated of a refactoring.
src/librustc_typeck/check/mod.rs
Outdated
There was a problem hiding this comment.
Don't collect into a Vec. (this requires slice impl)
There was a problem hiding this comment.
If you make this loop an Extend impl, you can then have fill_item take T: Extend<Kind> + Deref<Target=[Kind]>.
|
EDIT: Moved this to a PR comment to be able to link it. Oh, turns out That is, there is a "minimum struct/union" alignment (not sure where in the C standard, but LLVM supports it and so do we! I just checked and no built-in target specs have minimum ABI alignment for struct/union that's larger than one byte, just "preferred alignment"). The correct thing to do is Thanks to @ubsan and @retep998 on IRC for figuring this out! |
src/librustc/ty/subst.rs
Outdated
There was a problem hiding this comment.
Put the bounds in the where clause.
src/librustc_driver/test.rs
Outdated
There was a problem hiding this comment.
This can pass a slice to intern_substs.
src/librustc_driver/test.rs
Outdated
src/librustc_driver/test.rs
Outdated
f21961f to
a36e00c
Compare
|
☔ The latest upstream changes (presumably #37220) made this pull request unmergeable. Please resolve the merge conflicts. |
a36e00c to
1a3cb0f
Compare
src/librustc/ty/subst.rs
Outdated
There was a problem hiding this comment.
All of these arguments need to be reindented.
d39d8f4 to
25d3b4a
Compare
|
Performance, compared with current master, stage2 release builds (configured with no arguments): |
25d3b4a to
7517e34
Compare
|
@Mark-Simulacrum I'm surprised a bit, I would've expected more wins given how high allocation was on the profile. Maybe the |
|
OK, so, @rust-lang/lang: @eddyb and I had a short chat about fundamental traits on IRC. I am not comfortable with declaring However, I would be comfortable with doing a targeted extension to fundamental. Basically a mildly hacky negative impl. For example, we discussed allowing We would thus be committed to adding negative impls of How do people feel about that? (I'm happy to guide @Mark-Simulacrum through the steps to do that change.) |
|
@nikomatsakis first paragraph is how I felt when I saw @eddyb's question also. One of the biggest problems with
|
|
Do note that we don't care about |
I guess part of this is that, if I were happy with But yeah this does mean we are committing to either @eddyb tbh I meant |
|
IIRC, LLVM can optimize even the malloc case better if you have a fixed capacity (i.e. use |
src/librustc/ty/context.rs
Outdated
There was a problem hiding this comment.
This can be just Slice::empty() like the other one.
b49bfe9 to
d56f099
Compare
d56f099 to
c751299
Compare
c751299 to
a2efca9
Compare
There was a problem hiding this comment.
I'd like a slightly more expansive comment that mentions what the threshold is (8) and what happens when it's exceeded.
There was a problem hiding this comment.
Likewise, please explain what happens if 8 is exceeded.
a2efca9 to
2059fdc
Compare
AccumulateVec is generic over the Array trait, which is currently only implemented for [T; 8].
2059fdc to
989eba7
Compare
|
@bors r+ |
|
📌 Commit 989eba7 has been approved by |
|
⌛ Testing commit 989eba7 with merge a6b3b01... |
…ddyb Add ArrayVec and AccumulateVec to reduce heap allocations during interning of slices Updates `mk_tup`, `mk_type_list`, and `mk_substs` to allow interning directly from iterators. The previous PR, #37220, changed some of the calls to pass a borrowed slice from `Vec` instead of directly passing the iterator, and these changes further optimize that to avoid the allocation entirely. This change yields 50% less malloc calls in [some cases](https://pastebin.mozilla.org/8921686). It also yields decent, though not amazing, performance improvements: ``` futures-rs-test 4.091s vs 4.021s --> 1.017x faster (variance: 1.004x, 1.004x) helloworld 0.219s vs 0.220s --> 0.993x faster (variance: 1.010x, 1.018x) html5ever-2016- 3.805s vs 3.736s --> 1.018x faster (variance: 1.003x, 1.009x) hyper.0.5.0 4.609s vs 4.571s --> 1.008x faster (variance: 1.015x, 1.017x) inflate-0.1.0 3.864s vs 3.883s --> 0.995x faster (variance: 1.232x, 1.005x) issue-32062-equ 0.309s vs 0.299s --> 1.033x faster (variance: 1.014x, 1.003x) issue-32278-big 1.614s vs 1.594s --> 1.013x faster (variance: 1.007x, 1.004x) jld-day15-parse 1.390s vs 1.326s --> 1.049x faster (variance: 1.006x, 1.009x) piston-image-0. 10.930s vs 10.675s --> 1.024x faster (variance: 1.006x, 1.010x) reddit-stress 2.302s vs 2.261s --> 1.019x faster (variance: 1.010x, 1.026x) regex.0.1.30 2.250s vs 2.240s --> 1.005x faster (variance: 1.087x, 1.011x) rust-encoding-0 1.895s vs 1.887s --> 1.005x faster (variance: 1.005x, 1.018x) syntex-0.42.2 29.045s vs 28.663s --> 1.013x faster (variance: 1.004x, 1.006x) syntex-0.42.2-i 13.925s vs 13.868s --> 1.004x faster (variance: 1.022x, 1.007x) ``` We implement a small-size optimized vector, intended to be used primarily for collection of presumed to be short iterators. This vector cannot be "upsized/reallocated" into a heap-allocated vector, since that would require (slow) branching logic, but during the initial collection from an iterator heap-allocation is possible. We make the new `AccumulateVec` and `ArrayVec` generic over implementors of the `Array` trait, of which there is currently one, `[T; 8]`. In the future, this is likely to expand to other values of N. Huge thanks to @nnethercote for collecting the performance and other statistics mentioned above.
|
I did some malloc profiling on Oct 24 (before this landed) and again on Oct 27 (after this landed). I expect the vast majority of changes are due to this PR. Note that these are total/cumulative numbers, i.e. not the number of live blocks/bytes at any point. (The latter numbers barely changed because the allocations avoided by this PR are all short-lived.) |
Updates
mk_tup,mk_type_list, andmk_subststo allow interning directly from iterators. The previous PR, #37220, changed some of the calls to pass a borrowed slice fromVecinstead of directly passing the iterator, and these changes further optimize that to avoid the allocation entirely.This change yields 50% less malloc calls in some cases. It also yields decent, though not amazing, performance improvements:
We implement a small-size optimized vector, intended to be used primarily for collection of presumed to be short iterators. This vector cannot be "upsized/reallocated" into a heap-allocated vector, since that would require (slow) branching logic, but during the initial collection from an iterator heap-allocation is possible.
We make the new
AccumulateVecandArrayVecgeneric over implementors of theArraytrait, of which there is currently one,[T; 8]. In the future, this is likely to expand to other values of N.Huge thanks to @nnethercote for collecting the performance and other statistics mentioned above.