-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Improve Enumerable.ToArray to avoid excessive copying for lazy enumerables #11208
Conversation
|
@stephentoub can you PTAL? I admit that the added logic is a bit complex, however this is a frequently-used path in cases like |
|
I want to think about this some more. It's not clear to me yet it's entirely a win, even if a microbenchmark for throughput and Gen0 GCs shows an improvement. It may be; I'm slightly concerned though that this will end up with more objects held onto for the duration of the operation (plus the few additional small objects that get allocated), etc. As you say, it's also a lot more code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want the implementation to work well for both JITed and AOT environments. This trick does not work for AOT - you will pay for the extra code there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This trick does not work for AOT - you will pay for the extra code there.
How much is the cost? Since the method is compiled in advance, there is no first-time overhead invoking the method like when it is jitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cost is in binary size and compile time.
What is the improvement for this case, or other common Linq use cases? Linq has ToArray optimized in the layer above via |
If the number of customers is greater than 32, for example 50, then we exchange the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious - why not new T[first.Length * 2] ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to be overly optimistic with the allocations- having first.Length * 2 (which would be 64 as this PR is currently implemented) would lead us to allocate more than 2N elements for an N-length enumerable in some cases. For example, 100 would lead us to allocate space for 224 elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. In a typical doubling algorithm, you'd have 32 elements and double the size of the array to 64, making room for another 32. Then double again to 128, making room for another 64. Etc. Since we're allocating an array that won't include the initial set of elements, we don't want to do that initial doubling. So we allocate an additional array of length 32 for 32 elements (64 total), then an additional array of length 64 for 64 elements (128 total), etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an idea that should sometimes reduce the number of allocations/copying:
Allocate first.Length * 2 up front, but place the elements in the new array starting at first.Length. Once you reach the end of the new array, or the end of the iterator:
- If there are more items, put them at the start of the new array (You will have to reorder them when you de-fragment the array list)
- If there are no more items, place the elements from
firstin the new array, from 0 tofirst.Length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NSIL That seems like it would increase allocations for e.g. lengths 33-64, since before we would allocate a T[32] and a T[32], and now we would allocate a T[32] and T[64]. What would be an example in which is decreases allocations/copying?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are between 32 and 64 elements, you can just return the new array.
Edit: I don't know if resizing down allocates. if it does, then it doesn't help at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NSIL Resizing down does allocate, unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind then :) I assumed it didn't, will check next time.
|
I like the change. It seems that for |
|
LGTM |
|
@jamesqo - could you try benchmarking scenarios with T being a reference type (or a struct containing a reference type). |
stephentoub
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love the complexity this adds, but I can see it being worth it for larger enumerables. Thanks for working on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just use Array.Resize / Array.Copy and let them employ appropriate optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IEnumerator.MoveNext is spec'd to return false on subsequent calls after it's been called, but I'm concerned that not every IEnumerator implementation follows that, and this is introducing an additional MoveNext call after MoveNext may have returned false in certain cases. Let's avoid that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we reached here we have to had exited from the loop after calling get_Current. MoveNext is only called while index < buffer.Length, if they're == then we exited via break. (Perhaps I should add a comment clarifying this?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment would be helpful. This is part of the complexity I was talking about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as earlier. Please just use Array.Resize / Array.Copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this assert; this was just checked two lines earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. In a typical doubling algorithm, you'd have 32 elements and double the size of the array to 64, making room for another 32. Then double again to 128, making room for another 64. Etc. Since we're allocating an array that won't include the initial set of elements, we don't want to do that initial doubling. So we allocate an additional array of length 32 for 32 elements (64 total), then an additional array of length 64 for 64 elements (128 total), etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be checked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than special-case first, should we just put first into buffers at the very beginning and then treat it like any other array in the list? Then you can delete this copy. The one-extra slot in buffer's underlying array of arrays should not matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will cause a slight overhead for enumerables length 33-64 since adding it to the list will cause the List to do an allocation of one T[4][]. I wanted to defer that allocation as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to defer that allocation as much as possible.
I see. Ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could just foreach, e.g.
foreach (T[] buffer in buffers)
{
Array.Copy(buffer, 0, result, copied, buffer.Length);
copied += buffer.Length;
}(I also changed previous to buffer here, as I'm not sure what previous is referring to here as a name.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will be slower; I checked some time ago and get_Item and set_Item of List are both inlined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not suggesting this because of performance. Does it make a measurable difference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... no. I'll change it to use foreach, then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to carry over any of the logic from https://github.com/dotnet/corefx/pull/11208/files#diff-be1bf738e84b4f261cc144a01fdb196cL63 ? If not, the local isn't necessary and this could just be:
current = new T[current.Length * 2];There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephentoub No, since if the enumerable is between 2^30 and 2^31 - 1 elements we will not attempt to allocate an array over MaxArrayLength unlike the old algorithm. Will change.
|
Thanks for your efforts on this, @jamesqo. |
…ables (dotnet/corefx#11208) * Improve Enumerable.ToArray implementation to avoid excessive copying Commit migrated from dotnet/corefx@c9c676e
The idea that instead of resizing the array (aka throwing it out and allocating a new one) and re-copying the data every time the sequence doesn't fit, we can instead keep the old array around, allocate a new one of size * 2, and just continue reading into the new array at index 0. This will require keeping around a list of arrays we've filled up so far, however the overhead of this list will be substantially less than the size of the data itself.
A visualization of what I'm talking about for a 300-length enumerable:
(I also put this in the comments to help future readers understand.)
Then when we need the final array to return from the method, we just calculate
first.Length + list.Sum(a => a.Length) + indexWeFinishedReadingIntoInCurrentArray, pre-allocate an array of that size, and then copy each of the arrays into that array.I opted to only use this algorithm for enumerables of > length 32, since for smaller sizes it will lead to increased fragmentation, and the overhead of the list allocation will probably be noticeable in comparison to, say, 20 elements. ~32 is also the threshold at around which
Array.Copystarts getting faster than a for-loop.Benchmark results
https://gist.github.com/jamesqo/399b72bf5de8e2cbd83d044836cbefa4 (includes results/source code)
You can see that the new implementation consistently has about 2/3 the gen0 collections as the old one.
The timings are somewhat inconsistent (disclaimer: I only did 100,000 iterations since the tests were taking way too long to run), but you can see that the new implementation is generally faster/the same speed as the old one. (I suspect it may be hard to measure the differences due to all of the interface invocations that are going on, in addition to the covariant array type checks if T is a reference type.)
cc @stephentoub @VSadov @JonHanna