Avoid allocating HashSet in Distinct() for some counts#97845
Avoid allocating HashSet in Distinct() for some counts#97845stephentoub wants to merge 2 commits intodotnet:mainfrom
Conversation
If we can get the count for the underlying source and it's 0 or 1, we can avoid allocating the HashSet, as distinctness only matters when there are multiple elements.
|
Tagging subscribers to this area: @dotnet/area-system-linq Issue DetailsIf we can get the count for the underlying source and it's 0 or 1, we can avoid allocating the HashSet, as distinctness only matters when there are multiple elements.
|
|
I wonder if 2 elements can get similar optimization in a sense that you can just compare those without using Hashset and return either 1 or 2 elements. Or using something like ValueHashSet for some small count of values like 7 or less? Or are these an overreach? |
I considered that, but it would mean extra invocations of GetHashCode and Equals, which can be user-provided functions of arbitrary complexity, and while it might help the smaller cases, it would hurt the larger cases when that work then needed to be duplicated. Plus extra complexity here. |
| public TSource[] ToArray() => Enumerable.HashSetToArray(new HashSet<TSource>(_source, _comparer)); | ||
| public TSource[] ToArray() | ||
| { | ||
| if (TryGetNonEnumeratedCount(_source, out int count) && count < 2) |
There was a problem hiding this comment.
Optimizing for counted sources of size < 2 seems somewhat niche to me, is it worth the added type tests?
|
Just curious, because the iterator state count for Distinct was maxed at 2 for a long time, but is there a reason that a third state couldn't be added to defer an allocation in the enumerator? The allocation of HashSet could be deferred to state 2 if MoveNext has an additional item, and state 3 can be the final state. I think it comes out to be the same number of ops. |
|
I'm changing this code in another PR. Will close this for now and revisit. |
If we can get the count for the underlying source and it's 0 or 1, we can avoid allocating the HashSet, as distinctness only matters when there are multiple elements.