Avoid allocating HashSet in Distinct() for some counts#97845

stephentoub · 2024-02-02T02:07:16Z

If we can get the count for the underlying source and it's 0 or 1, we can avoid allocating the HashSet, as distinctness only matters when there are multiple elements.

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Tests).Assembly).Run(args);

[MemoryDiagnoser(false)]
public class Tests
{
    private List<int> _values;

    [Params(0, 1, 2)]
    public int Count { get; set; }

    [GlobalSetup]
    public void Setup() => _values = Enumerable.Range(0, Count).ToList();

    [Benchmark]
    public int[] DistinctToArray() => _values.Distinct().ToArray();

    [Benchmark]
    public List<int> DistinctToList() => _values.Distinct().ToList();

    [Benchmark]
    public int DistinctCount() => _values.Distinct().Count();
}

Method	Toolchain	Count	Mean	Ratio	Allocated	Alloc Ratio
DistinctToArray	\main\corerun.exe	0	34.89 ns	1.00	152 B	1.00
DistinctToArray	\pr\corerun.exe	0	15.71 ns	0.45	64 B	0.42

DistinctToList	\main\corerun.exe	0	37.08 ns	1.00	160 B	1.00
DistinctToList	\pr\corerun.exe	0	21.97 ns	0.59	96 B	0.60

DistinctCount	\main\corerun.exe	0	29.41 ns	1.00	128 B	1.00
DistinctCount	\pr\corerun.exe	0	23.39 ns	0.83	64 B	0.50

DistinctToArray	\main\corerun.exe	1	86.61 ns	1.00	304 B	1.00
DistinctToArray	\pr\corerun.exe	1	27.08 ns	0.30	96 B	0.32

DistinctToList	\main\corerun.exe	1	82.34 ns	1.00	336 B	1.00
DistinctToList	\pr\corerun.exe	1	35.33 ns	0.50	128 B	0.38

DistinctCount	\main\corerun.exe	1	65.99 ns	1.00	272 B	1.00
DistinctCount	\pr\corerun.exe	1	17.41 ns	0.26	64 B	0.24

DistinctToArray	\main\corerun.exe	2	81.79 ns	1.00	304 B	1.00
DistinctToArray	\pr\corerun.exe	2	86.24 ns	1.06	304 B	1.00

DistinctToList	\main\corerun.exe	2	90.39 ns	1.00	336 B	1.00
DistinctToList	\pr\corerun.exe	2	93.72 ns	1.04	336 B	1.00

DistinctCount	\main\corerun.exe	2	72.62 ns	1.00	272 B	1.00
DistinctCount	\pr\corerun.exe	2	75.89 ns	1.05	272 B	1.00

If we can get the count for the underlying source and it's 0 or 1, we can avoid allocating the HashSet, as distinctness only matters when there are multiple elements.

ghost · 2024-02-02T02:07:20Z

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

Issue Details

If we can get the count for the underlying source and it's 0 or 1, we can avoid allocating the HashSet, as distinctness only matters when there are multiple elements.

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Linq`
Milestone:	-

En3Tho · 2024-02-02T05:57:12Z

I wonder if 2 elements can get similar optimization in a sense that you can just compare those without using Hashset and return either 1 or 2 elements.

Or using something like ValueHashSet for some small count of values like 7 or less?

Or are these an overreach?

stephentoub · 2024-02-02T12:08:33Z

I wonder if 2 elements can get similar optimization in a sense that you can just compare those without using Hashset and return either 1 or 2 elements.

Or using something like ValueHashSet for some small count of values like 7 or less?

Or are these an overreach?

I considered that, but it would mean extra invocations of GetHashCode and Equals, which can be user-provided functions of arbitrary complexity, and while it might help the smaller cases, it would hurt the larger cases when that work then needed to be duplicated. Plus extra complexity here.

eiriktsarpalis · 2024-02-05T14:23:02Z

src/libraries/System.Linq/src/System/Linq/Distinct.SpeedOpt.cs

-            public TSource[] ToArray() => Enumerable.HashSetToArray(new HashSet<TSource>(_source, _comparer));
+            public TSource[] ToArray()
+            {
+                if (TryGetNonEnumeratedCount(_source, out int count) && count < 2)


Optimizing for counted sources of size < 2 seems somewhat niche to me, is it worth the added type tests?

TheCodingOwl · 2024-02-11T20:41:06Z

Just curious, because the iterator state count for Distinct was maxed at 2 for a long time, but is there a reason that a third state couldn't be added to defer an allocation in the enumerator? The allocation of HashSet could be deferred to state 2 if MoveNext has an additional item, and state 3 can be the final state. I think it comes out to be the same number of ops.

stephentoub · 2024-02-27T01:59:47Z

I'm changing this code in another PR. Will close this for now and revisit.

Avoid allocating HashSet in Distinct() for some counts

e025c73

If we can get the count for the underlying source and it's 0 or 1, we can avoid allocating the HashSet, as distinctness only matters when there are multiple elements.

stephentoub added the area-System.Linq label Feb 2, 2024

ghost assigned stephentoub Feb 2, 2024

build-analysis bot mentioned this pull request Feb 3, 2024

[browser][MT] WebWorkerTest.WaitAssertsOnJSInteropThreads #97914

Closed

Merge branch 'main' into distinctcount

65c4117

eiriktsarpalis reviewed Feb 5, 2024

View reviewed changes

stephentoub closed this Feb 27, 2024

stephentoub deleted the distinctcount branch March 25, 2024 20:54

github-actions bot locked and limited conversation to collaborators Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Avoid allocating HashSet in Distinct() for some counts#97845

Avoid allocating HashSet in Distinct() for some counts#97845
stephentoub wants to merge 2 commits intodotnet:mainfrom
stephentoub:distinctcount

stephentoub commented Feb 2, 2024 •

edited

Loading

Uh oh!

ghost commented Feb 2, 2024

Uh oh!

En3Tho commented Feb 2, 2024

Uh oh!

stephentoub commented Feb 2, 2024

Uh oh!

eiriktsarpalis Feb 5, 2024

Uh oh!

TheCodingOwl commented Feb 11, 2024

Uh oh!

stephentoub commented Feb 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

stephentoub commented Feb 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Feb 2, 2024

Uh oh!

En3Tho commented Feb 2, 2024

Uh oh!

stephentoub commented Feb 2, 2024

Uh oh!

eiriktsarpalis Feb 5, 2024

Choose a reason for hiding this comment

Uh oh!

TheCodingOwl commented Feb 11, 2024

Uh oh!

stephentoub commented Feb 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

stephentoub commented Feb 2, 2024 •

edited

Loading